
resistant to overfitting, and being able to identify
intricate non-linear correlations in the data.
KNN is a learning algorithm that is instance-based.
When a new sample is provided, the KNN algorithm
locates the K samples in the training dataset that are
the most comparable (closest in distance) and then
uses the classes of these K neighbors to forecast the
new sample's class. Manhattan distance, Euclidean
distance, and other methods are typically used to
measure the distances between samples. It is
extensively utilized in domains including
recommendation systems, text categorization, and
image recognition. It is quite flexible with regard to
data distribution and does not require a training
procedure. One of its drawbacks is its high
computational complexity, which rises sharply with
data volume. Additionally, it is susceptible to local
noise in the data, and the model's performance may
be impacted by the K value selection.
Logistic regression is a linear model for
classification issues. It represents the likelihood that
a sample belongs to a particular class by mapping the
outcomes of linear regression to probability values
between 0 and 1. The likelihood function is
maximized in the model to estimate parameters, and
gradient descent is a popular solution technique. The
model can effectively handle linearly separable data
and is straightforward, interpretable, and
computationally efficient. Its restrictions on data
distribution, which typically call for feature
independence, are its drawbacks. Its capacity to fit
complex non-linear data is restricted, and it can only
handle linear relationships.
The Bayes theorem and the feature conditional
independence assumption form the foundation of the
Naive Bayes classification technique. It computes the
posterior probability of each class given the features
and makes the assumption that each feature is
independent given the class. The prediction outcome
is chosen from the class with the highest posterior
probability. This model is insensitive to missing
values, works well on small-scale data, and trains
quickly. The drawback is that the feature conditional
independence assumption is highly sensitive to the
input data's representation form and is frequently
challenging to meet in practice, which could
compromise the model's accuracy.
Based on the concept of ensemble learning,
gradient boosting trains a number of weak learners
iteratively before combining them to create a strong
learner. The approach gradually improves the
performance of the model by modifying the training
of the subsequent weak learner based on the gradient
of the current model's loss function in each iteration.
Numerous data kinds, including categorical and
numerical data, can be handled by it. It performs well
in generalization and has a strong fit for intricate non-
linear relationships. However, it is time-consuming to
train, prone to overfitting, and very sensitive to
hyperparameter selection, necessitating adjustment.
SVM maximizes the margin between the two
classes of samples by identifying the best hyperplane
to divide them. A kernel function is used to translate
the data to a high-dimensional space in order to make
it linearly separable in the high-dimensional space
while dealing with non-linear situations. SVM
performs well in handling both linear and non-linear
issues, has strong generalization ability, and
successfully prevents overfitting when working with
small sample data. The drawbacks include a lengthy
training period and considerable computational
complexity, particularly when working with huge
amounts of data. It also necessitates specific tuning
abilities and is highly sensitive to the kernel functions
and parameter choices.
4 MODEL TRAINING AND
COMPARISON
The six algorithms are properly applied in the process
of training and show different performances. To
evaluate the functionality of each algorithm, accuracy
and F1-score are calculated in Equation (1), and
Equation (2), using True Positive (TP), True Negative
(TN), False Positive (FP), False Negative (FN), with
P refers to precision and R refers to recall.
π΄πππ’ππππ¦ =
TP + TN
TP + TN + FP + FN
(
1
)
πΉ1 =
2PR
P + R
(
2
)
A key tool for assessing the effectiveness of
diagnostic tests, particularly in binary classification
settings, is the ROC curve (Srinivasan & Mishra,
2024), which is also used to compare various models.
A greater area under the curve typically indicates
higher performance. Figure 2 shows the inferred
confusion matrix reports, each of which represents an
algorithm. Table 2 lists the F1 Scores and Accuracy.
Figure 3 shows the ROC curve.
Machine Learning Solutions for Heart Disease Diagnosis: Model Choices and Factor Analysis
63