A Parameter-Free Self-Training Algorithm for Dual Choice Strategy
Wei Zhao, Qingsheng Shang
*
, Jikui Wang, Xiran Li, Xueyan Huang and Cuihong Zhang
College of Information Engineering and Artificial Intelligence,
Lanzhou University of Finance and Economics, Lanzhou 730020, Gansu, China
Keywords: Self-Training, High-Confidence Samples, Dual Choice Strategy.
Abstract: In the field of machine learning, semi-supervised learning has become a research hotspot. Self-training
algorithms, improve classification performance by iteratively adding selected high-confidence samples to the
labeled sample set. However, existing methods often rely on parameter tuning for selecting high-confidence
samples and fail to fully account for local neighborhood information and the information of labeled samples.
To address these issues, this paper proposes a self-training algorithm with a parameter-free self-training
algorithm for dual choice strategy. Firstly, the selection problem of K-value in KNN classifier is solved by
using natural neighbors to capture the local information of each sample, and secondly, adaptive stable labels
are defined to consider the information of labeled samples. On this basis, a decision tree classifier is introduced
to combine the global information for double selection to further select high-confidence samples. We
conducted experiments on 12 benchmark datasets and compared them with several self-training algorithms.
The experimental results show that the FSTDC algorithm achieves significant improvement in classification
accuracy.
1 INTRODUCTION
In the field of semi-supervised learning (SSL)(Van
Engelen and Hoos, 2020), self-training algorithms
play a crucial role, which aiming to improve the
performance of classifiers by combining a small
amount of labeled data with a large amount of
unlabeled data. Li et al (Li et al., 2005) proposed the
self-training with editing algorithm(SETRED), which
uses data editing techniques to identify and reject
potentially mislabeled samples from the labeling
process. Despite the progress SETRED has made in
improving the robustness of self-training, it relies on
manually set thresholds. To overcome this limitation,
Wu et al (Wu et al., 2018) proposed the self-training
semi-supervised classification based on density peaks
of data algorithm(STDP), which is based on the
concept of density peak clustering (DPC)(Rodriguez
and Laio, 2014) and is able to reveal the underlying
data on data distributions of different shapes. The
STDP algorithm does not rely on specific data
distribution assumptions, thus expanding the
application scope of self-training algorithms. Zhao et
al(Zhao and Li, 2021) introduced the concept of
natural neighbors based on STDP and proposed a
semi-supervised self-training method based on
density peaks and natural neighbors algorithm
(STDPNaN). Furthermore, Li et al
(Li et al., 2019)
proposed a self-training method based on density
peaks and an extended parameter-free local noise
filter for k nearest neighbor (STDPNF). An extended
parameter-free local noise filter (ENaNE) was
proposed to address the problem of mislabeled
samples in STDP. The design of ENaNE cleverly
exploits the information of both labeled and unlabeled
data, and efficiently filters noise. Wang et al(Wang et
al., 2023) proposed a self-training algorithm based on
the two-stage data editing method with mass-based
dissimilarity (STDEMB) based on the previous work.
The STDEMB algorithm, through a prototype tree
design, which effectively edits mislabeled samples
and selects high-confidence samples during self-
training.
Based on the above algorithms, a parameter-free
self-training algorithm for the dual selection strategy
is designed. It is not only parameter-free, but also
integrates the global and local information of the
samples, while making full use of the information of
the labeled samples. In addition, a decision tree
classifier is introduced for high-confidence sample
selection, which is extensively experimentally
validated on several datasets, proving its