as RGB frames and optical flow images as CNN
input; Second is intermediate fusion, after each
modality (RGB frame and optical flow image) passes
through the intermediate layer of the CNN network,
it is spliced or weighted fused and then connected to
a shared fully connected layer or attention layer;
Third is late fusion, where independent CNN extracts
features, generates separate softmax outputs, and then
uses the combined outputs through the classification
system to make the final decision. This fusion
strategy effectively combines the spatial structure and
appearance information provided by the RGB frame
(original visual information), the optical flow
provided by the RGB frame to capture the direction
and speed of movement, provide temporal dynamic
information, and the human key points provided by
the 2D skeleton coordinates to provide structural
information of posture and action. This greatly
enhances the accuracy of the human action
recognition system and achieves high-precision
classification output (Lueangwitchajaroen et al.,
2024).
In terms of security improvement, the
introduction of a multimodal fusion CNN architecture
can significantly improve the accuracy and security
of biometric recognition systems. For example, Nada
Alay et al. designed independent CNN models for
iris, face, and finger vein biometrics and performed
deep feature extraction, and then used technologies
such as Data Augmentation to prevent overfitting and
improve generalization capabilities. Finally, the three
modal features were fused through feature-level and
score-level fusion strategies. Experimental results
show that this method achieved a recognition
accuracy of up to 100% on the SDUMLA-HMT
dataset, greatly improving the robustness and security
of biometric recognition systems in identity
authentication tasks (Alay et al., 2020).
3.2 User Intent Recognition Based on
RNN
Recurrent Neural Network (RNN) is a type of deep
learning model specifically designed for processing
sequence data. RNN can maintain memory of
previous inputs by using its memory to process input
sequences, which makes them very suitable for
applications where order is highly important, such as
natural language processing and speech recognition.
At the same time, Hochreiter and Schmidhuber
introduced the long short-term memory (LSTM)
network, which enables RNNs to process long-term
user behavior sequences, such as action streams and
speech paragraphs, solving the long dependency
problem.
In terms of user intent recognition, Zhigao et al.
proposed an innovative model, Context-embedded
Hypergraph Attention Network (C-HAN), which
models the relationship between contexts through a
hypergraph structure and cooperates with the self-
attention mechanism to more comprehensively
understand the user's behavioral motivations and
intentions. The model consists of two modules,
Context-Embedded Hypergraph Attention Network
and Self-Attention Mechanism. Context-Embedded
Hypergraph Attention captures the potential intention
aggregation of users in a sub-session or behavior
cluster by modeling contextual coherence. Self-
Attention Mechanism models the item sequence,
captures the sequential dependency between session
items, and thus tracks the user's behavior trajectory
and intention transfer in the time dimension.
Experimental results show that the model
outperforms the current mainstream methods on
datasets such as Diginetica, Tmall, and Nowplaying.
On Diginetica, C-HAN's Precision@20 is improved
by about 6.5% compared to the strong baseline
method, indicating its significant improvement in
user intent modeling and recommendation accuracy
(Zhigao et al., 2024). Although the C-HAN model
does not integrate multimodal features in the
traditional sense (such as vision, speech, etc.), it
incorporates rich contextual information (time,
sequence, window features, etc.) and sequence
attention mechanism through the hypergraph
structure, which essentially embodies the modeling
idea of multimodal fusion and has important
reference significance in complex human-computer
interaction and user intent modeling tasks.
In terms of security improvement, the IoT-RNN
model proposed by Mudawi et al. (2025) integrates
multiple inertial, spatial and environmental sensing
modalities (such as acceleration, gyroscope, GPS,
WiFi signals, etc.) to build two sets of RNN models
for activity classification and positioning
classification, respectively, to achieve collaborative
modeling of human activity recognition and
positioning. It effectively improves the system's
security monitoring capabilities in complex
environments and reduces the security risks of false
detection and recognition (Mudawi et al., 2025).
Selvaraj et al. also use Multi-scale Residual Attention
Network (RAN) to extract voice, fingerprint image
and iris trimodal features, and use Enhanced
Lichtenberg Algorithm (ELA) for feature-level
fusion, and then use Dilated Adaptive RNN
(DARNN) for classification and recognition. The