Research on Intention Recognition and Security of Multimodal
Human-Computer Interaction System Based on Deep Learning
Ge Zhang
a
Faculty of Science and Engineering, University of Nottingham, Ningbo, Sichuan, China
Keywords: Deep Learning, Single-Modal Interactive System, Multi-Modal Interactive System.
Abstract: With the development of science and technology in recent years, human-computer interaction systems are
developing in a more natural, efficient and safer direction, and the scenarios of human-computer interaction
are becoming increasingly complex. However, traditional interaction systems that rely on a single modality
(such as voice or vision) have gradually exposed problems such as low security, insufficient robustness and
incomplete understanding of user intent in multimodal data processing, and can no longer support the needs
of modern more immersive and intelligent interaction systems. This paper studies the typical applications of
single modality in human-computer interaction systems, and deeply analyzes its limitations and potential
safety hazards exposed in complex interactive environments. Based on this, this paper further explore
multimodal interaction based on deep learning, especially the key role of convolutional neural networks
(CNN) and recurrent neural networks (RNN) in user intent recognition, to provide certain theoretical support
and reference for building a safer and more intelligent human-computer interaction system.
1 INTRODUCTION
With the rapid development of information
technology, the way of human-computer interaction
is also constantly evolving, from the traditional
keyboard and mouse input to touch control, voice
recognition, gesture recognition and other more
natural and advanced interaction methods. In recent
years, the rise of technologies such as virtual reality
(VR), augmented reality (AR) and mixed reality
(MR) has further promoted the development of
human-computer interaction in an immersive and
natural direction.
With more and more data and data modality types,
multimodal interaction systems have emerged.
Multimodal human-computer interaction systems are
composed of the comprehensive use of various input
and output channels. They can be input not only by
traditional keyboards and mice, but also by the latest
voice and face recognition technologies. Output can
also use traditional screens or the latest languages, or
facial expression synthesis and gestures. The
development of multimodality enables the system to
process data of different modalities at the same time,
a
https://orcid.org/0009-0006-7121-7831
improving the accuracy of interaction and user
experience.
Although single-modality has been studied in
depth, single-modality biometric systems rely on only
a single source or a unique individual biometric
feature for measurement and inspection. These
systems are susceptible to factors such as
environmental changes, sensor quality, and user
physiological state, resulting in reduced recognition
accuracy. In addition, single-modality systems are
vulnerable to counterfeit attacks, such as when false
fingerprints are provided to the sensor, they are
vulnerable to presentation attacks or deception
(Benaliouche & Touahria, 2014). In contrast,
multimodal biometric systems can effectively
improve recognition accuracy and anti-attack
capabilities and enhance system security by
integrating multiple biometric information.
This paper explores the application of single
modality in human-computer interaction systems, and
studies the important role of multimodal interaction
in improving system performance and security,
focusing on the core role of convolutional neural
networks (CNN) and recurrent neural networks
(RNN) in the process of user intention recognition.
394
Zhang, G.
Research on Intention Recognition and Security of Multimodal Human-Computer Interaction System Based on Deep Learning.
DOI: 10.5220/0014360200004718
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 394-398
ISBN: 978-989-758-792-4
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
This paper aims to provide theoretical support and
reference for technological innovation and
application in the field of human-computer
interaction, and promote the integrated development
of technological progress and practical applications in
this field.
2 IDENTITY RECOGNITION
BASED ON SINGLE-MODAL
INFORMATION
Single-modal recognition mainly uses biometric
authentication technology, which is a technology that
confirms individual identity based on human
physiological characteristics, such as irises and
fingerprints, or behavioural characteristics, such as
voice and signature. In the past two decades, this
technology has been widely used in security, medical
care, identity verification, Internet security and other
fields.
Fingerprint recognition is one of the most
common and widely used technologies. Due to the
uniqueness and accuracy of fingerprints, fingerprint
recognition is often used by law enforcement
agencies. The earliest automatic fingerprint
recognition technology was used by law enforcement
agencies to confirm the identity of suspects. And
because the technology is relatively mature and
relatively low-cost, and the equipment is easy to
install, the technology is also often used in
community or university security and smart door
locks (Pato & Millett, 2010).
Facial recognition is a biometric identification
technology that uses facial images to identify or
verify identity. It identifies an individual's identity
through facial features such as the position, size, and
spatial relationship of the eyes, nose, and ears. It has
the advantages of convenience, speed, and
contactlessness. This technology has been widely
used in payment, monitoring, security inspection, and
other scenarios (Pato & Millett, 2010).
Iris recognition is a biometric authentication
technology that analyses the texture pattern of the
human iris to identify the person. The iris has a high
degree of uniqueness and stability, making it one of
the most reliable biometric features known to date.
This technology has been used in places that require
high confidentiality, such as military facilities and
medical facilities (Pato & Millett, 2010).
Although single-modal biometric systems have
achieved remarkable results in accuracy and
convenience, single biometric authentication systems
may be unreliable because single biometric systems
usually rely on only one feature, which is likely to be
contaminated. This may cause the security system to
be unable to cope with certain threats. There are still
many problems in terms of security and
environmental adaptability, and it cannot meet the
stability requirements of future human-computer
interaction systems in multiple scenarios, multiple
devices, and multiple users.
First, single-mode systems generally face
environmental sensitivity issues. For example, facial
recognition performance and accuracy significantly
decrease under conditions such as strong light,
backlight, expression occlusion, and age changes.
Fingerprint recognition may be affected by dirt, scars,
and aging of the fingerprint film on the finger. Iris
recognition may suffer from reduced recognition
performance due to iris lesions.
Second, due to recent technological
developments, hackers have created fraudulent
systems that can impersonate these anatomical and
behavioural features, while unimodal systems are less
vulnerable to spoofing attacks because it is relatively
simple to forge a biometric attribute of an individual.
For example, a fingerprint can be replicated by
copying a fictitious ridge pattern of a finger. A
person's face can also be forged using neural texture
algorithms or deep fake recognition to fool facial
recognition (Ashwini et al., 2024).
3 MULTIMODAL HUMAN-
COMPUTER INTERACTION
BASED ON DEEP LEARNING
3.1 User Intention Recognition Based
on CNN
Given that unimodal interaction methods have
obvious deficiencies in both robustness and security,
this paper will systematically review the advantages
of multimodal interaction based on deep learning in
terms of user intent recognition and security.
In terms of user intent recognition, CNN has a
strong ability to extract features from static and
dynamic images, such as extracting facial expressions
of people, and fusing them with other modalities
(such as speech, electromyographic signals, head
orientation, etc.). For example, Lueangwitchajaroend
et al. proposed a multimodal fusion method based on
EfficientNet-B7, which can be roughly divided into
three categories: First is early fusion, which directly
fuses the original features of multiple modalities such
Research on Intention Recognition and Security of Multimodal Human-Computer Interaction System Based on Deep Learning
395
as RGB frames and optical flow images as CNN
input; Second is intermediate fusion, after each
modality (RGB frame and optical flow image) passes
through the intermediate layer of the CNN network,
it is spliced or weighted fused and then connected to
a shared fully connected layer or attention layer;
Third is late fusion, where independent CNN extracts
features, generates separate softmax outputs, and then
uses the combined outputs through the classification
system to make the final decision. This fusion
strategy effectively combines the spatial structure and
appearance information provided by the RGB frame
(original visual information), the optical flow
provided by the RGB frame to capture the direction
and speed of movement, provide temporal dynamic
information, and the human key points provided by
the 2D skeleton coordinates to provide structural
information of posture and action. This greatly
enhances the accuracy of the human action
recognition system and achieves high-precision
classification output (Lueangwitchajaroen et al.,
2024).
In terms of security improvement, the
introduction of a multimodal fusion CNN architecture
can significantly improve the accuracy and security
of biometric recognition systems. For example, Nada
Alay et al. designed independent CNN models for
iris, face, and finger vein biometrics and performed
deep feature extraction, and then used technologies
such as Data Augmentation to prevent overfitting and
improve generalization capabilities. Finally, the three
modal features were fused through feature-level and
score-level fusion strategies. Experimental results
show that this method achieved a recognition
accuracy of up to 100% on the SDUMLA-HMT
dataset, greatly improving the robustness and security
of biometric recognition systems in identity
authentication tasks (Alay et al., 2020).
3.2 User Intent Recognition Based on
RNN
Recurrent Neural Network (RNN) is a type of deep
learning model specifically designed for processing
sequence data. RNN can maintain memory of
previous inputs by using its memory to process input
sequences, which makes them very suitable for
applications where order is highly important, such as
natural language processing and speech recognition.
At the same time, Hochreiter and Schmidhuber
introduced the long short-term memory (LSTM)
network, which enables RNNs to process long-term
user behavior sequences, such as action streams and
speech paragraphs, solving the long dependency
problem.
In terms of user intent recognition, Zhigao et al.
proposed an innovative model, Context-embedded
Hypergraph Attention Network (C-HAN), which
models the relationship between contexts through a
hypergraph structure and cooperates with the self-
attention mechanism to more comprehensively
understand the user's behavioral motivations and
intentions. The model consists of two modules,
Context-Embedded Hypergraph Attention Network
and Self-Attention Mechanism. Context-Embedded
Hypergraph Attention captures the potential intention
aggregation of users in a sub-session or behavior
cluster by modeling contextual coherence. Self-
Attention Mechanism models the item sequence,
captures the sequential dependency between session
items, and thus tracks the user's behavior trajectory
and intention transfer in the time dimension.
Experimental results show that the model
outperforms the current mainstream methods on
datasets such as Diginetica, Tmall, and Nowplaying.
On Diginetica, C-HAN's Precision@20 is improved
by about 6.5% compared to the strong baseline
method, indicating its significant improvement in
user intent modeling and recommendation accuracy
(Zhigao et al., 2024). Although the C-HAN model
does not integrate multimodal features in the
traditional sense (such as vision, speech, etc.), it
incorporates rich contextual information (time,
sequence, window features, etc.) and sequence
attention mechanism through the hypergraph
structure, which essentially embodies the modeling
idea of multimodal fusion and has important
reference significance in complex human-computer
interaction and user intent modeling tasks.
In terms of security improvement, the IoT-RNN
model proposed by Mudawi et al. (2025) integrates
multiple inertial, spatial and environmental sensing
modalities (such as acceleration, gyroscope, GPS,
WiFi signals, etc.) to build two sets of RNN models
for activity classification and positioning
classification, respectively, to achieve collaborative
modeling of human activity recognition and
positioning. It effectively improves the system's
security monitoring capabilities in complex
environments and reduces the security risks of false
detection and recognition (Mudawi et al., 2025).
Selvaraj et al. also use Multi-scale Residual Attention
Network (RAN) to extract voice, fingerprint image
and iris trimodal features, and use Enhanced
Lichtenberg Algorithm (ELA) for feature-level
fusion, and then use Dilated Adaptive RNN
(DARNN) for classification and recognition. The
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
396
experimental results show that the recognition
accuracy of the ELA DARNN model is significantly
improved (accuracy 96.01%) compared with the
traditional SVM, CNN, CNN-AlexNet and Dil-
ARNN models (accuracy 87.94%~93.25%). This
study shows that applying RNN to multimodal feature
fusion can greatly improve the anti-attack capability
and accurate recognition of biometric systems,
thereby significantly improving the overall security
and robustness of the system (Selvaraj et al., 2025).
4 LIMITATIONS AND FUTURE
PROSPECTS
4.1 Limitations
Although multimodal interaction has made
significant progress in terms of security and
robustness, it still has many limitations and
shortcomings. For example, since multiple types of
data need to be collected, synchronization between
different data will be a challenge, which places high
demands on the accuracy of sensors. For example, the
delay between the camera and the microphone will
affect the collaborative analysis of voice and face.
Secondly, the fusion of different modalities is also
a challenge. Due to the heterogeneity between
modalities, that is, the data of different modalities
differ in data type, time synchronization, spatial
resolution, etc., which increases the complexity of
modal fusion (Jose, 2022). At the same time, due to
the large number of processed data types and data
volumes, multimodality will consume more resources
to complete more calculations. How to balance real-
time performance and resource consumption will also
be a key issue (Jose, 2022).
4.2 Future Prospects
With the rapid development of technologies such as
VR and AR, interactive systems based on platforms
such as Unity may become a direction that can be
explored. As a high-performance, cross-platform 3D
engine, Unity has unique advantages in multimodal
fusion. It not only supports real-time 3D rendering,
but also can seamlessly connect with a variety of
sensors and data streams, making it very suitable for
multimodal safe human-computer interaction
systems. For example, head-mounted cameras are
used to capture the user's line of sight in real time to
analyze the user's attention to optimize the content
presentation logic and interface layout, handheld
controllers or sensors are used to identify hand
movements and object manipulation, and
microphones are used to identify the user's semantics
to achieve voice control effects (Tang, 2024). These
multimodal fusion mechanisms built on Unity have
shown significant potential in identifying user
intentions, enhancing security, and improving
interaction efficiency. In the future, they are expected
to play an important role in entertainment games,
virtual teaching, medical rehabilitation and other
scenarios.
5 CONCLUSIONS
In the context of the rapid development of human-
computer interaction technology, this paper deeply
analyzes the limitations of traditional single-modal
interaction methods in the face of diverse needs,
especially in terms of robustness and security. At the
same time, this paper focuses on the outstanding
advantages of multimodal interaction systems based
on deep learning in the core interaction task of user
intent recognition, showing its significant progress in
multimodal fusion and improving the naturalness and
security of interaction.
In addition, this paper further analyzes that
multimodal interaction still faces great challenges in
synchronizing different modal data, heterogeneity of
modal fusion, balancing resource consumption and
real-time performance, etc. To meet these challenges,
this paper proposes a research direction for a
multimodal human-computer interaction platform
built on high-performance 3D engines such as Unity,
and hopes that it can play a key role in important
scenarios such as education and training, medical
rehabilitation, etc. in the future.
REFERENCES
Alay, N., & Al-Baity, H. H. (2020). Deep Learning
Approach for Multimodal Biometric Recognition
System Based on Fusion of Iris, Face, and Finger Vein
Traits. Sensors (Basel, Switzerland), 20(19), 5523.
Al Mudawi, N., Azmat, U., Alazeb, A., Alhasson, H. F.,
Alabdullah, B., Rahman, H., Liu, H., & Jalal, A. (2025).
IoT powered RNN for improved human activity
recognition with enhanced localization and
classification. Scientific reports, 15(1), 10328.
Ashwini, K., Keshava Murthy, G. N., Raviraja, S., &
Srinidhi, G. A. (2024). A novel multimodal biometric
person authentication system based on ecg and iris data.
BioMed Research International, 2024(1), 8112209.
Research on Intention Recognition and Security of Multimodal Human-Computer Interaction System Based on Deep Learning
397
Azofeifa, J. D., Noguez, J., Ruiz, S., Molina-Espinosa, J.
M., Magana, A. J., & Benes, B. (2022, February).
Systematic review of multimodal human–computer
interaction. In Informatics (Vol. 9, No. 1, p. 13). MDPI.
Benaliouche, H., & Touahria, M. (2014). Comparative
study of multimodal biometric recognition by fusion of
iris and fingerprint. The Scientific World Journal,
2014(1), 829369.
Lueangwitchajaroen, P., Watcharapinchai, S., Tepsan, W.,
& Sooksatra, S. (2024). Multi-Level Feature Fusion in
CNN-Based Human Action Recognition: A Case Study
on EfficientNet-B7. Journal of Imaging, 10(12), 320.
National Research Council. (2010). Biometric recognition:
Challenges and opportunities. The National Academies
Press.
Selvaraj, U., & Nithiyanantham, J. (2025). Security-aware
user authentication based on multimodal biometric data
using dilated adaptive RNN with optimal weighted
feature fusion. Network (Bristol, England), 1–41.
Tang, J., Gong, M., Jiang, S., Dong, Y., & Gao, T. (2024).
Multimodal human-computer interaction for virtual
reality. Applied and Computational Engineering, 42,
201-207.
Zhang, Z., Zhang, H., Zhang, Z., & Wang, B. (2024).
Context-embedded hypergraph attention network and
self-attention for session recommendation. Scientific
Reports, 14(1), 19413.
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
398