incorporation of self-learning and artificial
intelligence enables these systems to adapt their
behavior over time, providing personalized emotional
feedback and fostering stronger emotional bonds
between pets and users (Shenoy et al., 2022; Yang et
al., 2020). The development of large language models
(LLM) has further enhanced the ability of intelligent
systems to understand and manage complex
emotions, facilitating improved interaction scenarios
through natural language processing and generation
technology (Chandraumakantham et al., 2024). These
systems can dynamically adapt to users' emotional
states and anticipations, presenting new possibilities
for affective computing (Jiang et al., 2025).
This study aims to develop an intelligent
electronic pet interaction system that leverages
emotional feedback and a self-learning mechanism.
By integrating multimodal emotion perception
technology and using LLM as the core decision
engine, the system can perceive users' emotional
states, environmental changes, and evolving
preferences in real time. Through continuous self-
learning, the system adapts the pet's behavior to
provide personalized emotional feedback, enhancing
the user experience and fostering stronger emotional
connections with users.
Simulation experiments using public multimodal
datasets are conducted to validate the multimodal
perception and feedback effects of the decision
engine based on LLM. The AffectNet dataset is used
here to classify several basic emotions and obtain
facial expression information. The IEMOCAP dataset
is also used to obtain information on the
correspondence between speech, expression, and
emotion. In addition, there is some public data to
determine the categories of environmental factors and
user preferences. These experiments assess the
model's ability to understand user preferences,
personalized feedback, and evaluate its adaptability,
feedback accuracy, and self-learning potential in a
non-real-time offline environment.
Currently, smart pet designs largely focus on
improving technical aspects such as emotion
recognition accuracy and user interaction. The
primary goal of these advancements is to enhance
user experience, strengthen emotional connections,
and enable the device to empathize with the user
(Abdollahi et al., 2023). This research aims to provide
a theoretical foundation and technical support for the
development of emotionally supportive devices. It
encourages the use of smart electronic pets in family
companionship, elderly care, and emotional therapy
while advocating for the integration of multimodal
perception and LLM in emotional computing.
2 METHODOLOGY
2.1 Overall System Architecture
The intelligent electronic pet interaction system
utilizes a layered architecture for emotion-driven
dynamic interaction (Figure 1). It consists of three
layers: the perception layer integrates user emotions,
environmental factors, and preferences, using
multimodal sensors and deep learning models to
analyze real-time data; the decision layer, powered by
LLM, merges multimodal inputs to generate
personalized feedback; and the execution layer
produces pet movements and voice feedback through
drive modules like servos and speakers, while
collecting user behavior data for model optimization.
The data flow involves extracting features from
the perception layer and structuring them into
prompts (e.g., {emotion: Happy, environment: sunny
and warm, user-pref-style: gentle, user-pref-tone:
encouraging}), which are fed into the LLM. The LLM
generates feedback instructions (e.g., {[Action]: The
pet jumps up in a playful manner, wagging its tail
excitedly. [Speech]: “Woof woof! That's fantastic
news, friend! I'm so proud of you and I can't wait to
celebrate together!”}), guiding the execution layer.
Feedback is adjusted dynamically through
reinforcement learning, with preference models
undergoing closed-loop optimization.
2.2 Perception Layer
2.2.1 User Emotion Perception Module
This module enhances emotion recognition by
integrating visual and auditory signals (Figure 1). A
1080P camera captures facial images at 15FPS, using
a ResNet-50 model trained on the AffectNet dataset
to classify 7 basic emotions: Happy, Sad, Anger,
Normal, Disgust, Fear, Surprised. For auditory
perception, a directional microphone captures sound,
while emotion intensity is assessed through speech-
to-text and voiceprint analysis. In case of a mismatch
between visual and auditory signals, voice features
are prioritized and flagged for verification. The
module outputs structured emotion labels, expression
features, and speech text for LLM decision-making.
2.2.2 Environmental Factors Perception
Module
Environmental perception adjusts pet behavior to
physical environments using sensor networks and
visual scene analysis (Figure 1). Sensors monitor