Research on Intention Recognition and Security of Multimodal

Human-Computer Interaction System Based on Deep Learning

Ge Zhang

Faculty of Science and Engineering, University of Nottingham, Ningbo, Sichuan, China

Keywords: Deep Learning, Single-Modal Interactive System, Multi-Modal Interactive System.

Abstract: With the development of science and technology in recent years, human-computer interaction systems are

developing in a more natural, efficient and safer direction, and the scenarios of human-computer interaction

are becoming increasingly complex. However, traditional interaction systems that rely on a single modality

(such as voice or vision) have gradually exposed problems such as low security, insufficient robustness and

incomplete understanding of user intent in multimodal data processing, and can no longer support the needs

of modern more immersive and intelligent interaction systems. This paper studies the typical applications of

single modality in human-computer interaction systems, and deeply analyzes its limitations and potential

safety hazards exposed in complex interactive environments. Based on this, this paper further explore

multimodal interaction based on deep learning, especially the key role of convolutional neural networks

(CNN) and recurrent neural networks (RNN) in user intent recognition, to provide certain theoretical support

and reference for building a safer and more intelligent human-computer interaction system.

1 INTRODUCTION

With the rapid development of information

technology, the way of human-computer interaction

is also constantly evolving, from the traditional

keyboard and mouse input to touch control, voice

recognition, gesture recognition and other more

natural and advanced interaction methods. In recent

years, the rise of technologies such as virtual reality

(VR), augmented reality (AR) and mixed reality

(MR) has further promoted the development of

human-computer interaction in an immersive and

natural direction.

With more and more data and data modality types,

multimodal interaction systems have emerged.

Multimodal human-computer interaction systems are

composed of the comprehensive use of various input

and output channels. They can be input not only by

traditional keyboards and mice, but also by the latest

voice and face recognition technologies. Output can

also use traditional screens or the latest languages, or

facial expression synthesis and gestures. The

development of multimodality enables the system to

process data of different modalities at the same time,

https://orcid.org/0009-0006-7121-7831

improving the accuracy of interaction and user

experience.

Although single-modality has been studied in

depth, single-modality biometric systems rely on only

a single source or a unique individual biometric

feature for measurement and inspection. These

systems are susceptible to factors such as

environmental changes, sensor quality, and user

physiological state, resulting in reduced recognition

accuracy. In addition, single-modality systems are

vulnerable to counterfeit attacks, such as when false

fingerprints are provided to the sensor, they are

vulnerable to presentation attacks or deception

(Benaliouche & Touahria, 2014). In contrast,

multimodal biometric systems can effectively

improve recognition accuracy and anti-attack

capabilities and enhance system security by

integrating multiple biometric information.

This paper explores the application of single

modality in human-computer interaction systems, and

studies the important role of multimodal interaction

in improving system performance and security,

focusing on the core role of convolutional neural

networks (CNN) and recurrent neural networks

(RNN) in the process of user intention recognition.

394

Zhang, G.

Research on Intention Recognition and Security of Multimodal Human-Computer Interaction System Based on Deep Learning.

DOI: 10.5220/0014360200004718

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 394-398

ISBN: 978-989-758-792-4

This paper aims to provide theoretical support and

reference for technological innovation and

application in the field of human-computer

interaction, and promote the integrated development

of technological progress and practical applications in

this field.

2 IDENTITY RECOGNITION

BASED ON SINGLE-MODAL

INFORMATION

Single-modal recognition mainly uses biometric

authentication technology, which is a technology that

confirms individual identity based on human

physiological characteristics, such as irises and

fingerprints, or behavioural characteristics, such as

voice and signature. In the past two decades, this

technology has been widely used in security, medical

care, identity verification, Internet security and other

fields.

Fingerprint recognition is one of the most

common and widely used technologies. Due to the

uniqueness and accuracy of fingerprints, fingerprint

recognition is often used by law enforcement

agencies. The earliest automatic fingerprint

recognition technology was used by law enforcement

agencies to confirm the identity of suspects. And

because the technology is relatively mature and

relatively low-cost, and the equipment is easy to

install, the technology is also often used in

community or university security and smart door

locks (Pato & Millett, 2010).

Facial recognition is a biometric identification

technology that uses facial images to identify or

verify identity. It identifies an individual's identity

through facial features such as the position, size, and

spatial relationship of the eyes, nose, and ears. It has

the advantages of convenience, speed, and

contactlessness. This technology has been widely

used in payment, monitoring, security inspection, and

other scenarios (Pato & Millett, 2010).

Iris recognition is a biometric authentication

technology that analyses the texture pattern of the

human iris to identify the person. The iris has a high

degree of uniqueness and stability, making it one of

the most reliable biometric features known to date.

This technology has been used in places that require

high confidentiality, such as military facilities and

medical facilities (Pato & Millett, 2010).

Although single-modal biometric systems have

achieved remarkable results in accuracy and

convenience, single biometric authentication systems

may be unreliable because single biometric systems

usually rely on only one feature, which is likely to be

contaminated. This may cause the security system to

be unable to cope with certain threats. There are still

many problems in terms of security and

environmental adaptability, and it cannot meet the

stability requirements of future human-computer

interaction systems in multiple scenarios, multiple

devices, and multiple users.

First, single-mode systems generally face

environmental sensitivity issues. For example, facial

recognition performance and accuracy significantly

decrease under conditions such as strong light,

backlight, expression occlusion, and age changes.

Fingerprint recognition may be affected by dirt, scars,

and aging of the fingerprint film on the finger. Iris

recognition may suffer from reduced recognition

performance due to iris lesions.

Second, due to recent technological

developments, hackers have created fraudulent

systems that can impersonate these anatomical and

behavioural features, while unimodal systems are less

vulnerable to spoofing attacks because it is relatively

simple to forge a biometric attribute of an individual.

For example, a fingerprint can be replicated by

copying a fictitious ridge pattern of a finger. A

person's face can also be forged using neural texture

algorithms or deep fake recognition to fool facial

recognition (Ashwini et al., 2024).

3 MULTIMODAL HUMAN-

COMPUTER INTERACTION

BASED ON DEEP LEARNING

3.1 User Intention Recognition Based

on CNN

Given that unimodal interaction methods have

obvious deficiencies in both robustness and security,

this paper will systematically review the advantages

of multimodal interaction based on deep learning in

terms of user intent recognition and security.

In terms of user intent recognition, CNN has a

strong ability to extract features from static and

dynamic images, such as extracting facial expressions

of people, and fusing them with other modalities

(such as speech, electromyographic signals, head

orientation, etc.). For example, Lueangwitchajaroend

et al. proposed a multimodal fusion method based on

EfficientNet-B7, which can be roughly divided into

three categories: First is early fusion, which directly

fuses the original features of multiple modalities such

Research on Intention Recognition and Security of Multimodal Human-Computer Interaction System Based on Deep Learning

395

as RGB frames and optical flow images as CNN

input; Second is intermediate fusion, after each

modality (RGB frame and optical flow image) passes

through the intermediate layer of the CNN network,

it is spliced or weighted fused and then connected to

a shared fully connected layer or attention layer;

Third is late fusion, where independent CNN extracts

features, generates separate softmax outputs, and then

uses the combined outputs through the classification

system to make the final decision. This fusion

strategy effectively combines the spatial structure and

appearance information provided by the RGB frame

(original visual information), the optical flow

provided by the RGB frame to capture the direction

and speed of movement, provide temporal dynamic

information, and the human key points provided by

the 2D skeleton coordinates to provide structural

information of posture and action. This greatly

enhances the accuracy of the human action

recognition system and achieves high-precision

classification output (Lueangwitchajaroen et al.,

2024).

In terms of security improvement, the

introduction of a multimodal fusion CNN architecture

can significantly improve the accuracy and security

of biometric recognition systems. For example, Nada

Alay et al. designed independent CNN models for

iris, face, and finger vein biometrics and performed

deep feature extraction, and then used technologies

such as Data Augmentation to prevent overfitting and

improve generalization capabilities. Finally, the three

modal features were fused through feature-level and

score-level fusion strategies. Experimental results

show that this method achieved a recognition

accuracy of up to 100% on the SDUMLA-HMT

dataset, greatly improving the robustness and security

of biometric recognition systems in identity

authentication tasks (Alay et al., 2020).

3.2 User Intent Recognition Based on

RNN

Recurrent Neural Network (RNN) is a type of deep

learning model specifically designed for processing

sequence data. RNN can maintain memory of

previous inputs by using its memory to process input

sequences, which makes them very suitable for

applications where order is highly important, such as

natural language processing and speech recognition.

At the same time, Hochreiter and Schmidhuber

introduced the long short-term memory (LSTM)

network, which enables RNNs to process long-term

user behavior sequences, such as action streams and

speech paragraphs, solving the long dependency

problem.

In terms of user intent recognition, Zhigao et al.

proposed an innovative model, Context-embedded

Hypergraph Attention Network (C-HAN), which

models the relationship between contexts through a

hypergraph structure and cooperates with the self-

attention mechanism to more comprehensively

understand the user's behavioral motivations and

intentions. The model consists of two modules,

Context-Embedded Hypergraph Attention Network

and Self-Attention Mechanism. Context-Embedded

Hypergraph Attention captures the potential intention

aggregation of users in a sub-session or behavior

cluster by modeling contextual coherence. Self-

Attention Mechanism models the item sequence,

captures the sequential dependency between session

items, and thus tracks the user's behavior trajectory

and intention transfer in the time dimension.

Experimental results show that the model

outperforms the current mainstream methods on

datasets such as Diginetica, Tmall, and Nowplaying.

On Diginetica, C-HAN's Precision@20 is improved

by about 6.5% compared to the strong baseline

method, indicating its significant improvement in

user intent modeling and recommendation accuracy

(Zhigao et al., 2024). Although the C-HAN model

does not integrate multimodal features in the

traditional sense (such as vision, speech, etc.), it

incorporates rich contextual information (time,

sequence, window features, etc.) and sequence

attention mechanism through the hypergraph

structure, which essentially embodies the modeling

idea of multimodal fusion and has important

reference significance in complex human-computer

interaction and user intent modeling tasks.

In terms of security improvement, the IoT-RNN

model proposed by Mudawi et al. (2025) integrates

multiple inertial, spatial and environmental sensing

modalities (such as acceleration, gyroscope, GPS,

WiFi signals, etc.) to build two sets of RNN models

for activity classification and positioning

classification, respectively, to achieve collaborative

modeling of human activity recognition and

positioning. It effectively improves the system's

security monitoring capabilities in complex

environments and reduces the security risks of false

detection and recognition (Mudawi et al., 2025).

Selvaraj et al. also use Multi-scale Residual Attention

Network (RAN) to extract voice, fingerprint image

and iris trimodal features, and use Enhanced

Lichtenberg Algorithm (ELA) for feature-level

fusion, and then use Dilated Adaptive RNN

(DARNN) for classification and recognition. The

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

396

experimental results show that the recognition

accuracy of the ELA DARNN model is significantly

improved (accuracy 96.01%) compared with the

traditional SVM, CNN, CNN-AlexNet and Dil-

ARNN models (accuracy 87.94%~93.25%). This

study shows that applying RNN to multimodal feature

fusion can greatly improve the anti-attack capability

and accurate recognition of biometric systems,

thereby significantly improving the overall security

and robustness of the system (Selvaraj et al., 2025).

4 LIMITATIONS AND FUTURE

PROSPECTS

4.1 Limitations

Although multimodal interaction has made

significant progress in terms of security and

robustness, it still has many limitations and

shortcomings. For example, since multiple types of

data need to be collected, synchronization between

different data will be a challenge, which places high

demands on the accuracy of sensors. For example, the

delay between the camera and the microphone will

affect the collaborative analysis of voice and face.

Secondly, the fusion of different modalities is also

a challenge. Due to the heterogeneity between

modalities, that is, the data of different modalities

differ in data type, time synchronization, spatial

resolution, etc., which increases the complexity of

modal fusion (Jose, 2022). At the same time, due to

the large number of processed data types and data

volumes, multimodality will consume more resources

to complete more calculations. How to balance real-

time performance and resource consumption will also

be a key issue (Jose, 2022).

4.2 Future Prospects

With the rapid development of technologies such as

VR and AR, interactive systems based on platforms

such as Unity may become a direction that can be

explored. As a high-performance, cross-platform 3D

engine, Unity has unique advantages in multimodal

fusion. It not only supports real-time 3D rendering,

but also can seamlessly connect with a variety of

sensors and data streams, making it very suitable for

multimodal safe human-computer interaction

systems. For example, head-mounted cameras are

used to capture the user's line of sight in real time to

analyze the user's attention to optimize the content

presentation logic and interface layout, handheld

controllers or sensors are used to identify hand

movements and object manipulation, and

microphones are used to identify the user's semantics

to achieve voice control effects (Tang, 2024). These

multimodal fusion mechanisms built on Unity have

shown significant potential in identifying user

intentions, enhancing security, and improving

interaction efficiency. In the future, they are expected

to play an important role in entertainment games,

virtual teaching, medical rehabilitation and other

scenarios.

5 CONCLUSIONS

In the context of the rapid development of human-

computer interaction technology, this paper deeply

analyzes the limitations of traditional single-modal

interaction methods in the face of diverse needs,

especially in terms of robustness and security. At the

same time, this paper focuses on the outstanding

advantages of multimodal interaction systems based

on deep learning in the core interaction task of user

intent recognition, showing its significant progress in

multimodal fusion and improving the naturalness and

security of interaction.

In addition, this paper further analyzes that

multimodal interaction still faces great challenges in

synchronizing different modal data, heterogeneity of

modal fusion, balancing resource consumption and

real-time performance, etc. To meet these challenges,

this paper proposes a research direction for a

multimodal human-computer interaction platform

built on high-performance 3D engines such as Unity,

and hopes that it can play a key role in important

scenarios such as education and training, medical

rehabilitation, etc. in the future.

REFERENCES

Alay, N., & Al-Baity, H. H. (2020). Deep Learning

Approach for Multimodal Biometric Recognition

System Based on Fusion of Iris, Face, and Finger Vein

Traits. Sensors (Basel, Switzerland), 20(19), 5523.

Al Mudawi, N., Azmat, U., Alazeb, A., Alhasson, H. F.,

Alabdullah, B., Rahman, H., Liu, H., & Jalal, A. (2025).

IoT powered RNN for improved human activity

recognition with enhanced localization and

classification. Scientific reports, 15(1), 10328.

Ashwini, K., Keshava Murthy, G. N., Raviraja, S., &

Srinidhi, G. A. (2024). A novel multimodal biometric

person authentication system based on ecg and iris data.

BioMed Research International, 2024(1), 8112209.

Research on Intention Recognition and Security of Multimodal Human-Computer Interaction System Based on Deep Learning

397

Azofeifa, J. D., Noguez, J., Ruiz, S., Molina-Espinosa, J.

M., Magana, A. J., & Benes, B. (2022, February).

Systematic review of multimodal human–computer

interaction. In Informatics (Vol. 9, No. 1, p. 13). MDPI.

Benaliouche, H., & Touahria, M. (2014). Comparative

study of multimodal biometric recognition by fusion of

iris and fingerprint. The Scientific World Journal,

2014(1), 829369.

Lueangwitchajaroen, P., Watcharapinchai, S., Tepsan, W.,

& Sooksatra, S. (2024). Multi-Level Feature Fusion in

CNN-Based Human Action Recognition: A Case Study

on EfficientNet-B7. Journal of Imaging, 10(12), 320.

National Research Council. (2010). Biometric recognition:

Challenges and opportunities. The National Academies

Press.

Selvaraj, U., & Nithiyanantham, J. (2025). Security-aware

user authentication based on multimodal biometric data

using dilated adaptive RNN with optimal weighted

feature fusion. Network (Bristol, England), 1–41.

Tang, J., Gong, M., Jiang, S., Dong, Y., & Gao, T. (2024).

Multimodal human-computer interaction for virtual

reality. Applied and Computational Engineering, 42,

201-207.

Zhang, Z., Zhang, H., Zhang, Z., & Wang, B. (2024).

Context-embedded hypergraph attention network and

self-attention for session recommendation. Scientific

Reports, 14(1), 19413.

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

398