than other models. Although Omniglot is a relatively
simple dataset, the stable performance of MTKD on
this dataset further demonstrates its robustness. The
results indicate that MTKD not only excels at
learning from small amounts of data but is also
adaptable to different types of datasets, including
those with relatively simple or homogeneous classes.
However, it is also evident from the results that while
MTKD performs excellently on Omniglot, it does not
achieve the best performance among all methods. The
source dataset the teacher model used is ImageNet
and it is quite different from handwritten character
dataset, Omniglot. Obviously, there is a huge domain
gap.
According to Table 3, when performing 5-shot
learning tasks, the performance of all models
improves as the number of training samples increases.
However, MTKD continues to maintain its leading
position, especially on the Mini-ImageNet and CUB
datasets. This suggests that MTKD not only benefits
from having more samples but also consistently
maintains a significant performance gap compared to
other methods. The improved accuracy with more
samples further demonstrates MTKD's strong ability
to utilize additional information, improve
classification decisions, and reduce errors.
A key factor contributing to the enhanced
performance of MTKD is the use of distillation from
multiple teachers. By leveraging several teacher
models, the student model can assimilate diverse and
complementary insights from various sources,
leading to stronger feature representations. This
technique helps the student model improve its
generalization abilities, especially in situations where
the training data is limited or contains noise.
Although Transfer Learning + Finetune shows
competitive performance on the Omniglot dataset, its
performance is inconsistent across different datasets.
This inconsistency suggests that while transfer
learning helps leverage pre-trained knowledge, it may
struggle to adapt effectively to new categories and
tasks that differ from the source domain. In contrast,
MTKD is specifically designed for few-shot learning
tasks, making it more reliable and adaptable across
various scenarios.
Furthermore, the success of MTKD can also be
attributed to its ability to simultaneously learn both
local and global features. This allows the model to
capture fine details while also recognizing broader
patterns, giving it a distinct advantage over models
that focus predominantly on one aspect. This
balanced feature-learning approach ensures that the
model can adapt to a wide range of tasks, regardless
of the complexity of the dataset.
5 CONCLUSION
This study proposed and validated a FSL approach
based on KD. By utilizing ResNet50 and DenseNet-
121 as teacher models and ResNet18 as the student
model, the study effectively applied the principles of
knowledge distillation. This allowed the student
model, even with a limited amount of training data, to
achieve performance surpassing other few-shot
learning methods. Additionally, the paper designed a
comprehensive loss function, combining soft-label
loss, hard-label loss, and attention-weighted feature
distillation, which further enhances the student
model's feature learning capabilities while
maintaining prediction accuracy.
The experimental results show that after thorough
training of the teacher model on a large dataset,
distillation to the student model not only reduced the
model's parameter count and computational cost but
also significantly improved the student model's
generalization ability in few-shot tasks. Specifically,
in One-Shot and Few-Shot scenarios, the student
model achieved better performance after distillation
training compared to training independently on small
datasets and other few-shot learning methods. This
approach fully validated the effectiveness of
knowledge distillation in few-shot learning and
demonstrated its high practical value, especially in
resource-constrained applications. Future work will
further explore more sophisticated distillation
strategies, such as adaptive temperature parameter
adjustment, to achieve more robust model
performance across more tasks and datasets.
REFERENCES
Benaim, S., & Wolf, L. (2018). One-shot unsupervised
cross domain translation. Advances in neural
information processing systems, 31.
Chen, D., Mei, J. P., Zhang, Y., Wang, C., Wang, Z., Feng,
Y., & Chen, C. (2021, May). Cross-layer distillation
with semantic calibration. In Proceedings of the AAAI
conference on artificial intelligence (Vol. 35, No. 8, pp.
7028-7036).
Dong, B., Zhou, P., Yan, S., & Zuo, W. (2022, October).
Self-promoted supervision for few-shot transformer. In
European Conference on Computer Vision (pp. 329-
347). Cham: Springer Nature Switzerland.
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot
learning of object categories. IEEE transactions on
pattern analysis and machine intelligence, 28(4), 594-
611.
Finn, C., Abbeel, P., & Levine, S. (2017, July). Model-
agnostic meta-learning for fast adaptation of deep