This work also interested in the 1.0 training
accuracy on the first dataset archived by algorithm
random forest and gradient boosting. So, this work
trains these two algorithms with different numbers of
estimators. The accuracy after the different number of
estimators is shown in Figure 1.
4 DISCUSSIONS
4.1 Performance
Accuracy is one of the most important indicators
when evaluating models. The author first analyzes the
accuracy of given datasets for all these algorithms,
and finds out that within the non-ensemble models,
random forests have the best accuracy. This may be
due to the random forest being more suitable to a
given dataset. As it has a much better accuracy, the
voting algorithm based on all these 3 algorithms
perform worse than the decision tree. However, it
could also be noticed that random forest and gradient
boosting have similar or better accuracy compare to
the decision tree. As these two algorithms are based
on the decision tree and have made improvements to
it. As distance weighted kNN is applied on mushroom
dataset, it suggests that it overfitted the train data with
a 1.0 train accuracy. However, since non-weighted
kNN is used on the heart attack dataset, the kNN
algorithm does not show overfitting. To avoid
overfitting, this work limited the depth and features
when training decision tree models. By given these
parameters, the decision tree has a lower train
accuracy but a better performance. It could also be
noticed that random forest and gradient boosting have
a very high train accuracy. While it may not mean the
model is overfitted. As the model is not getting lower
accuracy when increasing the number of estimators,
aka. underlying decision trees or the number of
iterations as shown in Figure 1.
4.2 Time Consumption
Among the three non-ensemble algorithms, SVC has
the worst efficiency. It takes much longer time when
applied on a large dataset. The kNN algorithm has a
lower training time, but a longer predicting time. This
is due to the fact that the algorithm is not actually
trained into certain model, but use all train data when
predicting. The vote algorithm needs to first train all
these 3 algorithms. So, its time usage is about to be
the sum of the above 3 algorithms. And since this
work included the SVC algorithm which is slow on
large dataset, it suggests voting have a very poor
efficient. Random forest and gradient boosting are all
based on decision trees. And there are parameters
which could be used to control number of decision
trees or number of iterations. So, their efficiency is
heavily influenced by the parameter. As a result, they
are much slower than decision trees. However, they
still show a better efficient when compare to SVC on
large datasets.
4.3 Randomness and Stability
The kNN and SVC algorithms are not relied on
randomness, so they always have the stable outcome.
However, the decision tree needs randomness when
splitting nodes. And as the result, the decision tree is
not stable. Based on different random seeds, it may
have different accuracy.
Random forest is based on decision tree. However,
it combines the results from multiple decision trees.
So, it has a higher stability. Similar behavior may be
observed in gradient boosting. It iterates many times
to avoid the unstable introduced by randomness.
5 CONCLUSIONS
This work presents a comprehensive comparison of
ensemble and non-ensemble machine learning
algorithms, focusing on their performance, efficiency,
and stability. The analysis includes decision trees,
support vector classification, K-nearest neighbors,
random forests, gradient boosting, and voting
algorithms.
From the evaluation, it could be observed that
ensemble methods, especially random forests and
gradient boosting, generally outperform non-
ensemble methods in terms of accuracy. This can be
attributed to their ability to combine multiple models,
thereby reducing overfitting and enhancing
generalization. However, the voting algorithm did not
perform as well as expected, possibly due to the
inclusion of SVC, which exhibits inefficiency on
large datasets.
In terms of training and prediction time, non-
ensemble methods such as kNN and decision trees
exhibit faster training times, but their prediction
efficiency varies. kNN, in particular, exhibits longer
prediction times due to its reliance on the entire
training dataset. Ensemble methods are slower during
training due to the complexity of combining multiple
models, but are still more efficient than SVC on large
datasets.
Stability analysis shows that non-ensemble
methods such as kNN and SVC provide consistent