An LLM-Based Interaction System with Multimodal Emotion

Recognition and Self-Learning Mechanism in Intelligent Electronic

Pets

Ruize Wang

College of Design and Innovation, Tongji University, Shanghai, Shanghai, China

Keywords: Intelligent Electronic Pet, Emotional Feedback, Self-Learning Mechanism, Multimodal Perception, Large

Language Model (LLM).

Abstract: This study presents an LLM-based interaction system for intelligent electronic pets, aiming to enhance

emotional adaptability and personalized feedback through multimodal perception technology and self-

learning. Traditional electronic pets rely on fixed interaction modes, limiting their ability to provide

individualized emotional responses. The proposed system, structured into three layers—perception, decision,

and execution—uses various sensors to gather user emotions, environmental data, and preferences, then

applies LLM technologies like GPT-4 to generate adaptive feedback. The system's self-learning capability

continuously optimizes responses based on evolving user interactions. Virtual user samples were created to

simulate system decision-making, and the feedback was evaluated across multiple dimensions. Results

showed superior emotional alignment, feedback diversity, and adaptability compared to unimodal and rule-

based models, highlighting the system's exceptional self-learning capabilities. This research underscores the

critical role of LLMs in multimodal emotion processing and self-learning, offering theoretical and technical

guidance for the use of intelligent electronic pets in emotional support and companionship applications.

1 INTRODUCTION

Smart electronic pets, or companion robots, have

gained significant attention for their emotional

support capabilities, especially in healthcare

applications such as aiding mental health patients,

supporting the elderly, and facilitating emotional

healing (Nimmagadda, Arora, & Martin, 2022).

However, traditional systems often rely on fixed

interaction modes, limiting their adaptability to users'

emotional changes and hindering the establishment of

long-lasting emotional connections with users. This

challenge, coupled with the lack of research on the

evolution of emotional connections over time in

human-computer interactions, impedes the

development of deeper emotional bonds between

smart electronic pets and their users (Kumar et al.,

2024).

Recent studies in artificial intelligence and

robotics focus on enhancing emotion recognition

technologies, including facial expression recognition

https://orcid.org/0009-0008-9963-6450

and speech emotion analysis. Despite advances, these

technologies face challenges regarding real-time

responsiveness and the accuracy of emotional

feedback, which are critical for adjusting robot

behavior and refining personalized emotion models

(Spezialetti, Placidi, & Rossi, 2020). To address these

issues, research is increasingly emphasizing

multimodal perception technology and self-learning

mechanisms to improve real-time responsiveness,

accuracy, and adaptability, allowing electronic pets to

interact more naturally and build deeper emotional

connections with users (Ramaswamy &

Palaniswamy, 2024).

The emotional perception capabilities of

intelligent electronic pets have significantly

improved with advancements in multimodal sensing

technology. Integrating facial expression recognition,

voice analysis, tactile perception, and EEG signals

with multimodal fusion strategies enhances the

system's ability to capture user emotions and respond

more effectively (Ramaswamy & Palaniswamy,

2024; Tuncer et al., 2022). Additionally, the

Wang, R.

An LLM-Based Interaction System with Multimodal Emotion Recognition and Self-Learning Mechanism in Intelligent Electronic Pets.

DOI: 10.5220/0014360700004718

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 423-431

ISBN: 978-989-758-792-4

423

incorporation of self-learning and artificial

intelligence enables these systems to adapt their

behavior over time, providing personalized emotional

feedback and fostering stronger emotional bonds

between pets and users (Shenoy et al., 2022; Yang et

al., 2020). The development of large language models

(LLM) has further enhanced the ability of intelligent

systems to understand and manage complex

emotions, facilitating improved interaction scenarios

through natural language processing and generation

technology (Chandraumakantham et al., 2024). These

systems can dynamically adapt to users' emotional

states and anticipations, presenting new possibilities

for affective computing (Jiang et al., 2025).

This study aims to develop an intelligent

electronic pet interaction system that leverages

emotional feedback and a self-learning mechanism.

By integrating multimodal emotion perception

technology and using LLM as the core decision

engine, the system can perceive users' emotional

states, environmental changes, and evolving

preferences in real time. Through continuous self-

learning, the system adapts the pet's behavior to

provide personalized emotional feedback, enhancing

the user experience and fostering stronger emotional

connections with users.

Simulation experiments using public multimodal

datasets are conducted to validate the multimodal

perception and feedback effects of the decision

engine based on LLM. The AffectNet dataset is used

here to classify several basic emotions and obtain

facial expression information. The IEMOCAP dataset

is also used to obtain information on the

correspondence between speech, expression, and

emotion. In addition, there is some public data to

determine the categories of environmental factors and

user preferences. These experiments assess the

model's ability to understand user preferences,

personalized feedback, and evaluate its adaptability,

feedback accuracy, and self-learning potential in a

non-real-time offline environment.

Currently, smart pet designs largely focus on

improving technical aspects such as emotion

recognition accuracy and user interaction. The

primary goal of these advancements is to enhance

user experience, strengthen emotional connections,

and enable the device to empathize with the user

(Abdollahi et al., 2023). This research aims to provide

a theoretical foundation and technical support for the

development of emotionally supportive devices. It

encourages the use of smart electronic pets in family

companionship, elderly care, and emotional therapy

while advocating for the integration of multimodal

perception and LLM in emotional computing.

2 METHODOLOGY

2.1 Overall System Architecture

The intelligent electronic pet interaction system

utilizes a layered architecture for emotion-driven

dynamic interaction (Figure 1). It consists of three

layers: the perception layer integrates user emotions,

environmental factors, and preferences, using

multimodal sensors and deep learning models to

analyze real-time data; the decision layer, powered by

LLM, merges multimodal inputs to generate

personalized feedback; and the execution layer

produces pet movements and voice feedback through

drive modules like servos and speakers, while

collecting user behavior data for model optimization.

The data flow involves extracting features from

the perception layer and structuring them into

prompts (e.g., {emotion: Happy, environment: sunny

and warm, user-pref-style: gentle, user-pref-tone:

encouraging}), which are fed into the LLM. The LLM

generates feedback instructions (e.g., {[Action]: The

pet jumps up in a playful manner, wagging its tail

excitedly. [Speech]: “Woof woof! That's fantastic

news, friend! I'm so proud of you and I can't wait to

celebrate together!”}), guiding the execution layer.

Feedback is adjusted dynamically through

reinforcement learning, with preference models

undergoing closed-loop optimization.

2.2 Perception Layer

2.2.1 User Emotion Perception Module

This module enhances emotion recognition by

integrating visual and auditory signals (Figure 1). A

1080P camera captures facial images at 15FPS, using

a ResNet-50 model trained on the AffectNet dataset

to classify 7 basic emotions: Happy, Sad, Anger,

Normal, Disgust, Fear, Surprised. For auditory

perception, a directional microphone captures sound,

while emotion intensity is assessed through speech-

to-text and voiceprint analysis. In case of a mismatch

between visual and auditory signals, voice features

are prioritized and flagged for verification. The

module outputs structured emotion labels, expression

features, and speech text for LLM decision-making.

2.2.2 Environmental Factors Perception

Module

Environmental perception adjusts pet behavior to

physical environments using sensor networks and

visual scene analysis (Figure 1). Sensors monitor

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

424

temperature, humidity, light intensity, and noise

levels, while YOLOv5 is used to recognize objects,

like an "umbrella" for a rainy-day response. Semantic

segmentation determines spatial layouts such as

"crowded" or "open". Data from sensors and visual

semantics are fused, creating environmental labels

like {environment: Cold and rainy} for LLM

decision-making.

Figure 1: Framework of the intelligent electronic pet interaction system (Picture credit: Original).

2.2.3 User Preference Module

This module forms the basis for self-learning (Figure

1). Initially, the system interacts with the user using a

default template (e.g., 'gentle and encouraging') and

utilizes a few-shot prompts to guide the LLM in

generating preliminary feedback. After each

interaction, the system records explicit ratings and

implicit behaviors. A reward function improves

learning, and a vector database stores interaction

fragments for future reference. The system refines

response styles based on past interactions, creating a

personalized strategy library by clustering user

behavior patterns over time.

2.3 Decision Layer

The LLM integrates data from the user's emotions,

environmental factors, and preference modules to

make emotional decisions and provide feedback

(Figure 1). The LLM serves as the system's

"emotional brain," transforming multimodal data into

structured prompts, such as {emotion: Happy,

environment: sunny and warm, user-pref-style:

gentle, user-pref-tone: encouraging}. Prompt

engineering ensures valid responses, while multi-

round memory preserves conversation history for

consistency. A temperature coefficient controls

randomness in feedback generation. The action

library defines 50 fundamental behaviors, which

LLM combines to create compound actions or

generate more complex actions independently.

2.4 Execution layer

The execution layer translates JSON instructions

from the decision layer into physical actions. It

includes the multimodal output module and real-time

feedback acquisition module (Figure 2). Pet

movements are controlled by a servo system with 50

basic behaviors, using a PWM signal smoothing

algorithm.

The TTS engine converts text feedback

An LLM-Based Interaction System with Multimodal Emotion Recognition and Self-Learning Mechanism in Intelligent Electronic Pets

425

Figure 2: Framework of the execution layer (Picture credit: Original).

into adaptive speech based on user preferences. User

feedback is captured via cameras and microphones,

processed for reinforcement learning, and fed back to

the preference model for real-time optimization. This

layer forms a self-iterative loop of emotional

decision-making, behavior output, and feedback

optimization, ensuring authentic emotional

expression and adaptability.

3 EXPERIMENTS

3.1 Verification of the Multimodal

Processing and Feedback

Capabilities of the LLM Decision

Engine

Given the maturity and accuracy of existing emotion

perception technology and environmental perception

technology, this simulation experiment assumes the

three perception modules have yielded results. It

generates user samples labelled by these modules,

which are processed by the LLM decision engine to

simulate system functionality and validate the

multimodal approach's superiority over alternative

models.

User samples are created using the AffectNet and

IEMOCAP datasets, which include seven basic

emotions: Happy, Sad, Anger, Normal, Disgust, Fear,

and Surprise. Each emotion is paired with

corresponding facial expressions and speech texts,

forming user emotion labels. For example, {Happy,

"smiling face, bright eyes", "I got the job! I'm so

excited!"}. Five environmental factors (e.g., Sunny

and warm, Cold and rainy) and five user preferences

(e.g., gentle and encouraging, playful and cheerful)

are also selected as labels. A complete sample

consists of emotion, facial description, speech text,

environmental factors, and user preferences (e.g.,

{emotion: Happy, facial_desc: “smiling face, bright

eyes”, speech_text: “I got the job! I'm so excited!”,

environment: sunny and warm, user_pref_style:

gentle, user_pref_tone: encouraging}) (Figure 3).

Figure 3: Schematic diagram of virtual user sample settings (Picture credit: Original).

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

426

Table 1: Virtual user sample grouping.

Cate

Variable Quantit

Grou

1 Emotion 7

Grou

2 Environment 5

Group3 Preference 5

Group4 Rando

The samples were categorized into four groups for

analysis: one with only the emotion label changed (7

groups); one with only the environment label changed

(5 groups); one with only the user preference label

changed (5 groups); and a random combination of

three labels (18 groups) ensuring coverage of all

labels (Table 1). This analysis evaluates the system's

ability to adapt to emotional changes, environmental

shifts, user preferences, and overall performance.

Figure 4: Experimental flow chart for verification of the multimodal processing and feedback capabilities of the LLM decision

engine (Picture credit: Original).

This study employs a control experiment for

model establishment (Figure 4). The experiment uses

three models: (A) multimodal + GPT, (B) unimodal

(emotion-only) + GPT, and (C) preset rule model.

Virtual user samples are processed by these models to

evaluate feedback performance. The GPT-4 API

converts sample data into structured prompts, with

GPT providing behavioral (1 sentence) and voice (1-

2 sentences) feedback. Feedback is collected for

analysis, with a temperature coefficient of 0.7 and a

150-character limit to control randomness and length.

Results compare the models across four

dimensions: content quality, including high-

frequency words, emotional word density, and tone

style matching; behavior strategy, including action

diversity statistics and response suitability analysis;

language structure, including BLEU score,

vocabulary richness, average sentence length, and

number of emotional words; and qualitative analysis,

including user ratings of feedback and evaluation of

typical cases. These comparisons demonstrate the

advantages of the multimodal + GPT model over the

other models, identify shortcomings, and explore

optimization methods.

3.2 Verification of User Understanding

and Personalized Adjustment of

LLM Under Self-learning

Mechanism

Smart pets can comprehend user preferences, adapt

feedback dynamically, and produce personalized

responses. This capability is crucial for fostering a

sense of equal emotional engagement with smart pets

and forming emotional bonds. This experiment

evaluates the system’s self-learning capability and

personalized adaptability using the multimodal +

GPT model through multiple rounds of interaction.

The experimental process (Figure 5) involves

selecting a fixed user sample combination, initially

without the user preference label. Subsequent rounds

incorporate feedback and prompt words to simulate

user feedback (e.g., {user_pref_style: brief,

user_pref_tone: calm}, with prompt words {The

behavior should be concise. Maintain a calm and brief

tone.}). By using historical context, GPT

autonomously learns and refines user preferences

over multiple rounds.

The experiment consists of 20 rounds, with

feedback results analyzed across three key aspects:

action evolution, tracking convergence to the desired

style; speech optimization, including sentence length,

An LLM-Based Interaction System with Multimodal Emotion Recognition and Self-Learning Mechanism in Intelligent Electronic Pets

427

Figure 5: Experimental flow chart for verification of user understanding and personalized adjustment of LLM under self-

learning mechanism (Picture credit: Original).

vocabulary diversity and other linguistic aspects; and

BLEU score, assessing feedback content similarity.

Qualitative analysis evaluates system learning

efficiency, personalization, and error detection. This

experiment complements the multimodal study,

testing the system’s self-learning and user preference

adaptation capabilities, identifying areas for

improvement and optimization.

4 RESULTS AND DISCUSSIONS

4.1 Experimental Results and

Evaluation of the Multimodal

Processing and Feedback

Capabilities of the LLM Decision

Engine

The experimental results show that model A, utilizing

multimodal input combined with GPT, outperforms

unimodal model B and rule-based model C in

emotional fit, feedback diversity, and personalized

adaptability. Model A uses comforting words like

“together” (18 times), “snuggles” (10 times), “joyful”

(8 times), and “warm” (9 times). Its emotional word

density is 28%, higher than model B’s 19% and

model C’s 12% (Figure 6). Model A also excels in

emotional fit. For example, when the user expresses

fear, model A provides physical and verbal comfort,

enhancing emotional support authenticity, while

model C’s “It’s okay” seems robotic, and model B

lacks context adaptation (Figure 7). Model A tailors

responses based on the environment, showing a

“gentle+encouraging” tone, like “purring softly” in a

“Cold and rainy” setting. Model B occasionally

conflicts with style (e.g., “Woof woof” in formal

settings), while model C relies on fixed templates.

In terms of behavioral strategies, Model A offers 22

unique actions, including "hops," "cuddles," and

"wraps ears," while Models B and C support only 14

and 9 actions, respectively. For instance, in the Anger

Figure 6: Model (A) (B) (C) High-frequency words and emotional word density statistics (Picture credit: Original).

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

428

Figure 7: Diagram of comparative analysis of typical feedback content of models (A) (B) (C) (Picture credit: Original).

Figure 8: Model (A) (B) (C) High-frequency words and emotional word density statistics (Picture credit: Original).

scenario, Model A combines behavior and language

effectively, saying, "The pet gently nuzzles against

the user's leg, emitting a soothing purring sound,"

while Model B repeats, "The pet nudges the user's

hand." In the Fear scenario, Model A combines

physical and verbal comfort, whereas Model B only

“rubs against leg.” In the Disgust scenario, Model B’s

response, "The virtual pet grimaces and moves

away," could increase discomfort.

Regarding language structure, Model A shows

significantly higher vocabulary diversity compared to

B and C. The average BLEU score of 0.04 indicates a

low level of template reliance (Figure 8). Model A

generates content with an average sentence length of

12 words, each containing 2.5 emotional words,

demonstrating its ability to produce diverse and

emotionally engaging responses.

4.2 Experimental Results and

Evaluation of User Understanding

and Personalized Adjustment of

LLM under Self-learning

Mechanism

Multiple rounds of interaction experiments

demonstrate the system's ability to align with user

preferences through self-learning. For example,

feedback evolved from "Wow, that's fantastic! We

should celebrate!" to "Well earned" by the 15th round

in response to the "brief and calm" preference.

Behaviorally, the action "jumps up excitedly, tail

wagging at rapid pace" (40%) was gradually replaced

by "gently nudges the user's hand" (65%), reflecting

the preference for "simple interaction."

Figure 9: BLEU score statistics for each round of interactive feedback results compared with the previous round of feedback

results (Picture credit: Original).

An LLM-Based Interaction System with Multimodal Emotion Recognition and Self-Learning Mechanism in Intelligent Electronic Pets

429

BLEU score analysis shows that as the interaction

rounds increase, feedback similarity rises in a zigzag

manner, stabilizing at 0.08 after the 15th round,

indicating the model's gradual adaptation to the user

preference template (Figure 9). Vocabulary diversity

dropped from 0.72 in round 1 to 0.55 in round 20,

showing language style convergence. Excessive

repetition, like "Well done" appearing 8 times, may

reduce freshness, but in real environments, fewer than

15 rounds of the same emotion and environment are

common, and changes in other factors will increase

diversity.

Figure 10: Diagram of self-learning process and key nodes

(Picture credit: Original).

Qualitative analysis revealed that emotional

intensity transitioned from "thrilled for you" to

"Proud of you" between the 2nd and 4th rounds. By

the 10th round, actions settled on "nudge/purr," with

language becoming a 2-3-word phrase, completing

the learning process (Figure 10). Feedback in later

rounds received a user rating of 8.5/10 for

naturalness, higher than the initial 6.2/10, though

users noted a lack of surprise.

The experiment also identified constraints in the

self-learning mechanism. When user preferences

change, the model requires 5-7 rounds to fully adjust,

suggesting the need for better long-term memory

optimization. Despite this, the system's personalized

adaptation across multiple rounds validates the

efficacy of its self-learning mechanism.

4.3 Discussions

The intelligent electronic pet system in this study

shows significant advantages in naturalness and

personalization of emotional interaction through

LLM-driven multimodal perception and self-

learning. Experiments validate its ability to integrate

user emotions, environment, and preferences,

outperforming unimodal and rule-based models in

emotional fitness and behavior diversity. The self-

learning mechanism ensures rapid convergence of

user preferences within 10 interaction rounds.

However, the system faces key limitations:

multimodal fusion causes reasoning delays of up to

600ms, dynamic preference adjustments lag, and

repetitive feedback reduces interaction freshness.

To validate real-world scenarios, future research

should include long-term tests with diverse groups,

assess emotional connection efficiency in the elderly,

explore strategies for managing conflicting user

needs, and incorporate physiological signals such as

EEG for improved emotion recognition. Future

efforts should focus on expanding multimodal

dimensions, optimizing lightweight LLM

architecture for real-time performance, and

developing an emotional migration mechanism to

predict emotional trends.

This study highlights the potential of LLM in

effective computing while stressing the need to

address security risks and ethical concerns in

generated content. Advancing the integration of

affective computing and embodied intelligence is the

key future direction.

5 CONCLUSIONS

This study has successfully developed and validated

an intelligent electronic pet interaction system that

integrates multimodal perception and an LLM to

enhance emotional engagement and personalization.

The experimental results highlight the system's ability

to dynamically adjust to user needs, providing

personalized emotional feedback through a self-

learning mechanism. This mechanism allows the

system to adapt and refine its interactions based on

user preferences, improving both the naturalness and

personalization of the pet's responses. The

multimodal fusion approach driven by LLM

significantly enhances the system's situational

awareness, enabling it to respond effectively to a wide

range of emotional cues and environmental

conditions.

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

430

Additionally, the self-learning and historical

memory components allow the system to rapidly

converge toward user preferences, ensuring that

interactions become more attuned to the user's

emotional state over time. However, challenges

remain in optimizing real-time performance, reducing

reasoning delays, and improving the consistency of

feedback. Despite these challenges, the system

demonstrates a strong foundation for emotional

support interactions.

This study introduces innovative concepts for the

integration of emotional computing and embodied

intelligence, offering valuable insights into the

potential applications of intelligent electronic pets in

areas such as emotional therapy, elderly care, and

mental health support. The ultimate goal is to push the

boundaries of emotional computing, improving the

depth and authenticity of human-computer emotional

interaction.

REFERENCES

Abdollahi, H., Mahoor, M. H., Zandie, R., Siewierski, J., &

Qualls, S. H. (2022). Artificial emotional intelligence in

socially assistive robots for older adults: a pilot study.

IEEE Transactions on Affective Computing, 14(3),

2020-2032.

Kumar, C. O., Gowtham, N., Zakariah, M., & Almazyad, A.

(2024). Multimodal emotion recognition using feature

fusion: An llm-based approach. IEEE Access.

Jiang, Y., Shao, S., Dai, Y., & Hirota, K. (2024, July). A

LLM-Based Robot Partner with Multi-modal Emotion

Recognition. In International Conference on Intelligent

Robotics and Applications (pp. 71-83). Singapore:

Springer Nature Singapore.

Yang, J., Wang, R., Guan, X., Hassan, M. M., Almogren,

A., & Alsanad, A. (2020). AI-enabled emotion-aware

robot: The fusion of smart clothing, edge clouds and

robotics. Future Generation Computer Systems, 102,

701-709.

Kumar, S. S., Apsal, M., Raishan, A. A., Jessy, R. M., &

Prasad, V. S. K. (2024). A systematic review of the

design and implementation of emotionally intelligent

companion robots. International Research Journal of

Engineering and Technology (IRJET), 11(9), 1–15

Nimmagadda, R., Arora, K., & Martin, M. V. (2022).

Emotion recognition models for companion robots. The

Journal of Supercomputing, 78(11), 13710-13727.

Ramaswamy, M. P. A., & Palaniswamy, S. (2024).

Multimodal emotion recognition: A comprehensive

review, trends, and challenges. Wiley Interdisciplinary

Reviews: Data Mining and Knowledge Discovery,

14(6), e1563.

Shenoy, S., Jiang, Y., Lynch, T., Manuel, L. I., & Doryab,

A. (2022, August). A Self Learning System for Emotion

Awareness and Adaptation in Humanoid Robots. In

2022 31st IEEE International Conference on Robot and

Human Interactive Communication (RO-MAN) (pp.

912-919). IEEE.

Spezialetti, M., Placidi, G., & Rossi, S. (2020). Emotion

recognition for human-robot interaction: Recent

advances and future perspectives. Frontiers in Robotics

and AI, 7, 532279.

Tuncer, T., Dogan, S., Baygin, M., & Acharya, U. R. (2022).

Tetromino pattern based accurate EEG emotion

classification model. Artificial Intelligence in Medicine,

123, 102210.

An LLM-Based Interaction System with Multimodal Emotion Recognition and Self-Learning Mechanism in Intelligent Electronic Pets

431