KNN uses distance metrics to classify or predict new
samples based on the assumption that "similar objects
have similar outputs." How the KNN algorithm works
Select the K value: K represents the number of
neighbors, that is, the K neighbors closest to the test
sample are considered when predicting. The process
of it includes multiple steps: 1) Calculate the distance:
For a given test sample, calculate its distance from
each sample in the training set. Commonly used
distance metrics include Euclidean distance,
Manhattan distance, cosine similarity, etc. 2) Select
the nearest K neighbors: Sort the samples in the
training set according to the distance, and select the K
nearest to the test sample. Neighbor. 3) Classification
or regression: Classification task: According to the
categories in the K nearest neighbors, the
classification of the test sample is determined by a
voting mechanism. In other words, the category with
the most occurrences is the predicted category of the
test sample. 4) Regression task: According to the K
nearest neighbors the value is usually taken as the
average value as the prediction result of the test
sample.
KNN has several advantages: 1) Simple and easy
to understand: KNN is one of the simplest machine
learning algorithms, easy to understand and
implement, and does not require a complex model
training process. 2) Parameter-free model: KNN is a
non-parametric algorithm that does not perform
explicit model fitting on the data. It is suitable for data
sets with unknown or complex data distribution 3)
Suitable for small data sets: KNN performs well for
small data sets or when there are not many features
and can effectively classify or regression 4) Can
handle multi-category problems: KNN can be
naturally extended to multi-classification problems
(i.e. not limited to binary classification), and only
needs to select the nearest K samples for voting.
3 RESULTS AND DISCUSSION
The performance of the three ML models was
evaluated using metrics including accuracy, macro-
averaged and weighted-averaged precision, recall,
and F1-Score, as comprehensively presented in
Figure 1. Figure 2, on the other hand, shows the
importance scores of each feature in the decision tree
and random forest models.
The Decision Tree model demonstrated an
accuracy of 70.7%. The macro-averaged precision,
recall, and F1 score were 0.58, 0.60, and 0.58,
respectively. The weighted average precision, recall,
and F1 score were 0.76, 0.71, and 0.73, respectively.
Age and Chronic Disease were identified as the most
impactful features, with importance scores of 40.3%
and 23.5%, respectively.
The Random Forest model achieved the highest
accuracy of 82.1%, with macro-averaged precision,
recall, and F1 scores of 0.75, 0.53, and 0.51,
respectively. The weighted average precision, recall,
and F1 scores were 0.80, 0.82, and 0.76, respectively.
Age and Alcohol Consumption emerged as the most
significant features, contributing 38.6% and 13.4% to
the model's predictions, respectively.
Meanwhile, the KNN model attained an accuracy
of 81.6%. Its macro-averaged precision and recall
were 0.41 and 0.50, respectively, with a macro-
averaged F1 score of 0.45. The weighted average
precision, recall, and F1 scores were 0.67, 0.82, and
0.73, respectively. However, this model exhibited a
notable performance imbalance, with a precision of
0.00 for class 0, suggesting difficulty in correctly
identifying samples from this class.
Among the three models, the Random Forest
demonstrated the highest overall accuracy and
balanced performance across both classes, suggesting
better generalization to new data subset. This
performance may be attributed to the ensemble
learning the nature of Random Forest, which
effectively reduces variance and overfitting. In
contrast, the KNN model's zero feature importance is
anticipated, as it is a distance-based algorithm that
does not have the ability to inherently assign weights
to features during training. Its difficulty in
recognizing class 0 is probably due to the lack of
significant clustering or separation between class
samples. The lower accuracy of the Decision Tree
model may be due to overfitting in the training subset
A. The model captures excessive noise specific in the
training data, thereby reducing its ability to generalize
to unseen data in subset B. Feature importance
analysis indicates that 'Age' is consistently a
significant predictor in both the Random Forest and
Decision Tree models, underscoring its strong
predictive value within this dataset. In contrast,
'Chronic Disease' demonstrated a substantially higher
importance in the Decision Tree model, which may
suggest the presence of a biased decision rule heavily
reliant on this specific feature.
4 CONCLUSIONS
Overall, this study employed a clustering approach to
explore the generalizability of different models in
predicting lung cancer by training the models on one
subset and testing them on the other. We utilised the
K-means algorithm for clustering the original dataset