This experiment illustrates a number of points.
One is that the repetition penalty parameter 1.5 would
fudge the content and is not suited to the rigors of the
medical field; the other is that the model's image
parsing is so powerful that it is able to identify foreign
objects in the chest cavity on monochrome X-rays,
even though it requires some prompting to
identifythem as steel nails, which only shows that the
model has not been trained in the medical field. What
is certain is that lightweight models have unlimited
potential for future deployment in various domains.
In fact, a researcher has already trained a lightweight
model for the gaming domain, called
VideoGameBunny (VGB), based on the Bunny
model architecture, using game screenshots as a
training set, and its performance is also excellent
(Taesiri & Bezemer, 2024).
3.5.2 Abnormal Input Robustness
This final experiment challenges the model's Chinese
language processing capabilities, with anomalous
input consisting of missing key prepositions,
homophone substitutions, mixing of non-literal
symbols, mixing of Chinese and English, and
extremely long single sentences without punctuation.
The model is able to handle most of the abnormal
inputs correctly, but only outputs English when
English is used to request Chinese responses. When
the model is not prompted to reply in Chinese, the
model is likely to reply in English, which indicates
that although the model has been optimized for
Chinese, English is still a higher priority.
4 CONCLUSIONS
Based on BAAI's Bunny-v1_0-2B-zh model, this
paper explores the application boundaries of this
lightweight model through stepwise parameter
tuning. This experiment has been designed in a
targeted way, without using public datasets for large-
scale testing, and the experimental data may have
some errors; the experiments in this paper are only
tested using the Bunny-v1_0-2B-zh model, which
cannot represent other lightweight models, and the
experimental results are for reference only. As a
model with only 2 billion parameters, the Bunny-
v1_0-2B-zh model is smaller than general lightweight
models, but it shows amazing performance. It not
only outputs results in an average of 2 minutes and 30
seconds on mobile devices, but is also able to show
image recognition capabilities that are not inferior to
those of larger models, which gives it application
potential in a wide range of domains , and bodes very
well for the development of lightweight models. This
paper provides a reference for the optimization and
deployment of lightweight models in the future, and
contributes to the promotion of lightweight
development of AI.
REFERENCES
Beijing Zhiyuan Artificial Intelligence Research Institute.
2024. Bunny-v1_0-2B-zh. Retrieved April 1, 2025,
from https://huggingface.co/BAAI/Bunny-v1_0-2B-
zh/tree/main/images
Brown, T. B., Mann, B., Ryder, N., et al. 2020. Language
models are few-shot learners. Advances in Neural
Information Processing Systems, 33, 1877–1901.
Chen, W., Li, M., Zhou, Y., et al. 2023. EcoMLM: A
holistic efficiency evaluation framework for
lightweight multimodal models. In Proceedings of the
2023 ACM International Conference on Mobile
Systems, Applications, and Services (MobiSys) (pp. 1–
15).
He, M., Liu, Y., Wu, B., et al. 2024. Efficient multimodal
learning from data-centric perspective. arXiv preprint
arXiv:2402.11530.
Liu, Y., Chen, Z., Wang, Y., et al. 2023. CPM-M3:
Multilingual multimodal pre-training with curriculum
learning. In Proceedings of the 61st Annual Meeting of
the Association for Computational Linguistics (ACL)
(pp. 10234–10248).
Mehta, S., & Rastegari, M. 2022. MobileViT: Light-weight,
general-purpose, and mobile-friendly vision
transformer. In Proceedings of the 10th International
Conference on Learning Representations (ICLR).
Taesiri, M. R., & Bezemer, C.-P. 2024. VideoGameBunny:
Towards vision assistants for video games. arXiv
preprint arXiv:2407.15295.
Touvron, H., Lavril, T., Izacard, G., et al. 2023. LLaMA:
Open and efficient foundation language models. arXiv
preprint arXiv:2302.13971.
Wu, X., Zhang, B., Deng, Z., et al. 2024. Vision-language
dataset distillation. arXiv preprint
arXiv:2308.07545v4.
Yuan, J., Gao, H., Dai, D., et al. 2025. Native sparse
attention: Hardware-aligned and natively trainable
sparse attention. arXiv preprint arXiv:2502.11089v2.
Youlai Doctor. 2024. Sternum fracture (33). Retrieved
April 1, 2025, from
https://www.youlai.cn/dise/imagedetail/343_72859.ht
ml
Zhang, Y., Liu, Z., Wang, X., et al. 2024. Gradient
complexity profiling: A systematic approach to
lightweight model capability boundary analysis. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 46(3), 1452–1467.