A Comprehensive Analyze of Deep Learning Techniques and
Applications
Shuo Song
Fakulti Teknologi dan Sains Maklumat, Universitiy Kebangsaan Malaysia, Malaysia
Keywords: Facial Emotion Recognition (FER), Deep Learning, Human-Computer Interaction (HCI).
Abstract: Facial emotion recognition (FER) is a key research direction in the field of artificial intelligence and human-
computer interaction (HCI). This technology is of great significance for improving the emotion understanding
ability of virtual reality, intelligent customer service, intelligent driving and other systems. Currently, deep
learning models can significantly improve the recognition accuracy of FER, which makes up for the limitation
that traditional methods are difficult to cope with complex and changing real-world scenarios. This paper
systematically reviews the development history of FER, and analyzes its typical applications in human-
computer interaction from the perspectives of both traditional methods and deep learning methods. In
summary, it is found that deep learning methods have made significant progress in multimodal emotion
recognition, real-time performance optimization, and personalized modeling, but still face challenges such as
data imbalance, occlusion interference and cross-domain adaptability. This paper provides an outlook on the
future development trend of FER, aiming to provide reference and inspiration for subsequent research.
1 INTRODUCTION
Facial expression is one of the most significant
human nonverbal behaviors, which is widely used in
human–human social communication. Facial
expression can not only reflect subject’s emotional
feeling, but also be used in group communication and
other social behaviors. In the past few years, due to
the rapid development of artificial intelligence and
computer vision, Facial Emotion Recognition (FER)
has gradually become one of the key research focuses
of automated affective computing. The purpose of
Facial Emotion Recognition is to recognize the
emotion of a subject from the face image or video
stream automatically and classify it into several
categories, so as to improve the intelligence level of
Human-Computer Interaction (HCI).
In the HCI field, FER is widely used in intelligent
customer service, education technology, mental
health monitoring, intelligent driving, and Virtual
Reality (VR) and Augmented Reality (AR) scenarios.
For example, in intelligent customer service systems,
FER can be used to analyze the user's emotional
fluctuations, so as to optimize the interaction of
customer service robots or voice assistants and
improve the user experience; in intelligent driving
systems, FER can be used to monitor the driver's
attention and fatigue in real time, so as to improve the
safety of driving; in mental health monitoring, FER
can be used to combine physiological signals and
voice analysis to assist in the diagnosis of emotional
disorders. In addition, in the VR/AR environment,
FER can enhance the realism of virtual characters'
expressions, making the interaction more natural and
immersive.
Despite the deep learning application greatly
advances the performance of facial expression
recognition (FER) recognition, there are still many
challenges for deep learning to be applied in practical
FER applications, such as imbalanced data categories,
poor cross-domain generalization, sensitivity to
occlusion and illumination variations, and high real-
time performance requirements. Mollahosseini, Chan,
and Mahoor (2016) noted that, in the FER2013 data
set, there were only 547 samples in the aversion
category, so the model prediction was unfairly
influenced by this data category. Li et al. (2018)
enhanced the stability of CNN models in complex
environments by introducing an occlusion simulation
mechanism. In addition, Mollahosseini et al. (2016)
validated the superior performance of their model in
cross-domain recognition through cross-validation
Song, S.
A Comprehensive Analyze of Deep Learning Techniques and Applications.
DOI: 10.5220/0014366900004718
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 589-593
ISBN: 978-989-758-792-4
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
589
experiments conducted on seven publicly available
databases, while Lopes et al. (2017) found that data
enhancement and preprocessing techniques (e.g.,
rotating, cropping, and luminance normalization)
significantly improved the model performance on
multiple databases (Mellouk & Handouzi, 2020). For
real-time application requirements, Pranav, Kamal,
Chandran, and Supriya (2020) proposed a lightweight
convolutional network with a simple structure for
emotion recognition on mobile. Overall, FER
research is evolving towards model lightweighting,
multimodal fusion and high robustness.
So, in this paper, the aim is to present the
development process of FERFacial Emotion
Recognition (FER) technology, and it is achieved by
analyzing its applications and challenges in
HCI.Firstly, introduce the traditional method and its
application environment; Secondly, discuss the
application effect of deep learning based FER model
in practice; Finally, make summary of the problems
and look forward to the future development direction.
2 OVERVIEW OF METHODS
FOR FACIAL EMOTION
RECOGNITION
2.1 Traditional Methods
Early FER methods relied heavily on feature
engineering techniques that used geometric and
texture features to extract unique patterns in facial
images for sentiment classification.
Geometric feature-based methods mainly analyze
the spatial relationships between key points on the
face, such as the positions of the eyes, eyebrows, nose,
and mouth. The Facial Action Coding System (FACS),
proposed by Ekman and Friesen, is a widely
recognized approach in this category. It decomposes
facial expressions into a set of Action Units (AUs)
based on underlying muscle movements, allowing for
the recognition of various emotional states. In HCI
systems, FACS is commonly applied in intelligent
customer service and remote sentiment analysis to
enhance user experience—intelligent assistants, for
instance, can adapt their interactions based on users’
subtle facial cues to improve naturalness and
engagement. In contrast, texture feature-based
methods determine emotional states by analyzing
local texture patterns in facial regions. Techniques
such as Local Binary Pattern (LBP) and Gabor
wavelets are commonly used in this context. LBP
captures local pixel differences and demonstrates
robustness to lighting changes, while Gabor wavelets
extract features across multiple scales and
orientations, thereby improving recognition accuracy.
These texture-based approaches are extensively used
in applications such as distance education and
affective computing, where FER systems can monitor
students’ concentration and dynamically adapt
teaching strategies to enhance learning outcomes.
2.2 Traditional Machine Learning
Methods
Before the rise of deep learning, FER algorithms were
predominantly based on traditional machine learning
approaches like classifiers (e.g., SVMs), hidden
Markov models (HMMs) or k-nearest neighbors
(kNN). Typically, these approaches need you to
extract so-called geometric features (distances
between facial keypoints, angles) or texture features
(local binary patterns, responses of filters like Gabor
filters) from the image and use these vectors as input
for the machine learning model. This “feature
engineering” approach has the big disadvantage that
it is knowledge- (domain expert) dependent and
results in a bad generalization behavior for
challenging scenarios.
Specifically, traditional classifiers are difficult to
effectively establish robust classification boundaries
when facing high-dimensional, nonlinearly
distributed expression data. For example, SVM may
be overfitted or underfitted when dealing with data of
high dimensionality or unevenly distributed
categories, and is highly sensitive to hyperparameter
selection.Although HMM is suitable for sequence
modeling, its assumption of state transfer is more
simplified, and it is difficult to accurately capture the
nonlinear and complex dynamic changes in facial
expressions. In addition, distance-based methods
such as kNN face “dimensional disaster” in high-
dimensional space, their performance does not
improve significantly with the increase of training
data, and the computational cost is high.
In dynamic interaction environments, such as
video surveillance, emotional robots, or intelligent
customer service systems, the user's facial expression
changes rapidly and is greatly affected by lighting,
posture, occlusion, etc., and traditional methods are
less robust to these disturbances. For example, in
intelligent surveillance systems, SVM is usually used
to analyze the emotional state in long-time static
image sequences, but it is difficult to respond to
instantaneous expression changes in real time, and it
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
590
is not able to dynamically predict the emotional state
based on the context. Therefore, although traditional
machine learning methods played a key role in the
early research of FER, their lack of adaptability to
environmental changes, feature complexity, and real-
time availability limits their wide application in real
interactive systems. These limitations have provided
an opportunity for the rise of deep learning methods,
which have significantly improved the performance
and application scope of emotion recognition systems
through end-to-end learning and automatic feature
extraction mechanisms.
2.3 Deep Learning Methods
In the past few years, with the rapid development of
deep learning, the accuracy of FER has made a big
breakthrough. Compared with traditional machine
learning methods which need to manually extract
features, deep learning models can automatically
learn emotionally relevant higher-order semantic
features from raw images in an end-to-end way,
which greatly improves the accuracy and robustness
of recognition. Among them, Convolutional Neural
NetworksCNNsare the most widely used models
for the recognition of static facial images. By multiple
convolution and pooling layers, local features are
extracted from images and global representation are
formed. This class of models have gained far better
recognition level than traditional methods on the main
stream datasets, FER2013, CK+, JAFFE, etc.
In addition, for video sequences or dynamic
expression analysis, Recurrent Neural Networks
(RNNs) and their variants, such as Long Short-Term
Memory Networks (LSTMs), can efficiently model
emotional changes in time series and enhance the
understanding of expression transition processes. For
example, in intelligent customer service scenarios,
LSTMs can continuously track subtle changes in user
expressions to adjust interaction strategies and
achieve more emotionally adaptive feedback.
Deep learning methods also support multimodal
emotion fusion, such as combining facial images,
speech signals, physiological features, and other
information for comprehensive discrimination,
further improving the accuracy and application
breadth of emotion recognition. Compared with
traditional methods, deep models are more fault-
tolerant to disturbing factors such as lighting changes,
occlusion, and individual differences in expressions.
However, deep learning methods also face some
challenges, such as the dependence on large-scale
high-quality data, high consumption of computational
resources, and poor interpretability.
Nevertheless, with the improvement of computing
power and optimization of training strategies, deep
learning-based FER systems have gradually matured
and have been widely used in many HCI application
scenarios, such as smart home, virtual reality, and
driving assistance, becoming one of the key
technologies to promote the intelligence of human-
computer interaction.
3 DEEP LEARNING-BASED
FACIAL EMOTION
RECOGNITION IN
HUMAN-COMPUTER
INTERACTION
3.1 Emotional Perception System for
Students in Virtual Educational
Environments
Students' facial expressions can reflect their level of
understanding, learning status and psychological
changes. However, in virtual teaching and distance
education scenarios, it is often difficult for teachers to
capture students' emotional feedback through
traditional methods. Therefore, real-time recognition
of student emotions has become a key tool to enhance
the online education experience. Sachith Hans and
Rao (2021) designed a deep learning architecture
combining Convolutional Neural Networks (CNNs)
and Long Short-Term Memory Networks (LSTMs)
for analyzing sequences of facial images in videos
and extracting spatio-temporal features to recognize
continuous emotional changes. It is shown that the
model achieves 78.52% and 63.35% recognition
accuracy on the training set CREMA-D and test set
RAVDEES, respectively. The method enhances the
model's focus on facial dynamics by preprocessing
video images with the OpenFace tool, extracting
facial regions and masking background interference.
This deep learning-based student emotion perception
system can assist teachers to dynamically understand
students' attention and emotion changes, and provide
reference for personalized teaching content pushing
and interaction strategy adjustment.
A Comprehensive Analyze of Deep Learning Techniques and Applications
591
3.2 Non-Contact Emotion Monitoring
in Intelligent Healthcare Systems
In smart healthcare and telecare scenarios, patients
may not be able to actively express their emotional
states due to language barriers, ambiguous
consciousness or physiological factors. Compared to
traditional emotion monitoring methods that rely on
physiological signals such as skin electrical response,
heart rate, or body temperature, facial expression
recognition (FER) systems have the advantages of
being non-contact, real-time, and easier to deploy,
which is especially suitable for healthcare
environments where it is inconvenient to wear
sensing devices. Carmen Bisogni et al. (2022)
propose a multi-resolution deep convolutional
network model for medical scenarios, which
combines a migration learning strategy with image
data enhancement techniques to improve the
generalization ability of the model under different
lighting, angle and resolution conditions. The system
is tested on mainstream datasets such as Cohn-
Kanade (CK+), Karolinska Directed Emotional Faces
(KDEF), and Static Facial Expressions in the Wild,
and outperforms the traditional model in terms of
recognition accuracy and environmental adaptability.
The system architecture includes three core modules:
facial feature point detection, multi-resolution image
processing and convolutional feature extraction. Its
final recognition results can be fed back to medical
terminals via an Internet of Things (IoT) platform to
help doctors determine whether a patient is in an
emotional state such as anxiety, pain, or depression,
enabling more detailed and personalized remote care.
The method is especially suitable for scenarios such
as geriatric care, psychiatric treatment and post-
operative rehabilitation, reflecting the unique value of
FER in humanized healthcare.
3.3 Abnormal Emotion Recognition
and Early Warning in Public
Security Scenarios
When people in crowded public scenes like stations,
airports and shopping malls exhibit mood changes in
their faces, it may be a significant clue to identify
abnormal behaviors. Akhand et al. designed a deep
convolutional neural network based on transfer
learning. Pre-trained models (VGG-16, ResNet-50,
DenseNet-161) are used as feature extractor in their
architecture, and KDEF is combined with JAFFE for
fine-tuned training. The experimental results show
that the model obtains 99.52% recognition accuracy
on JAFFE data set, and has good cross-view
robustness on KDEF data set of multi-angle face
images(Saleem, 2021). The system focuses on side
face and head deviated view in recognition, which
makes the system more applicable in real security
scene. In actual application, the system can access the
face video stream collected by the camera and give
the real-time marking and warning for the person with
abnormal expression such as anger, fear, surprise, etc,
to assist the security man to make rapid handling and
improve the security management level of the public
place.
4 CHALLENGES AND FUTURE
DIRECTIONS IN FACIAL
EMOTION RECOGNITION
The variability of facial expressions among
individuals presents significant challenges for
classification models in facial expression recognition
(FER). Accessories such as glasses and masks, along
with changes in head pose, further reduce recognition
accuracy. Moreover, FER datasets often suffer from
emotional category imbalance—particularly with
emotions like fear and disgust, which have fewer
samples—leading to potential model bias. Another
issue lies in the poor generalization ability across
different datasets, resulting in inconsistent
performance in varying environments (Ranganathan,
2016). Additionally, the high computational demands
of deep learning models limit their deployment in
real-time applications. To address these challenges,
researchers are exploring several innovative
strategies. Few-Shot Learning and Zero-Shot
Learning are being investigated to enable FER
systems to recognize emotions with limited annotated
data. Multimodal FER, which integrates facial
expressions, voice, body language, and physiological
signals, is also gaining attention for its potential to
enhance emotional recognition accuracy.
Furthermore, techniques such as adversarial training
and domain adaptation are being developed to
improve robustness to unseen environments. Edge AI
technologies are also emerging as a promising
solution, aiming to design lightweight, real-time FER
models that can operate efficiently on mobile and
embedded devices.
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
592
5 CONCLUSION
This paper systematically reviews the development
history of FER technology, focusing on the typical
applications of deep learning-based methods in the
field of HCI. By comparing traditional machine
learning methods with current mainstream deep
learning architectures, the advantages of deep models
including CNN and RNN in feature extraction and
sentiment classification are analyzed. Meanwhile, the
performance performance, deployment conditions
and challenging factors of current FER systems at the
application level are sorted out based on several
representative literatures in the context of actual HCI
scenarios such as education, healthcare and security.
It is found that deep learning methods
significantly improve the recognition accuracy and
robustness of FER systems in complex environments,
especially in dealing with non-frontal viewpoints,
face occlusion, and lighting changes, etc., and
substantial breakthroughs have been achieved. In
addition, the combination of migration learning, data
augmentation, multi-scale modeling and other
techniques further enhances the model's
generalization ability and practical deployment
feasibility.
Although the current FER technology has been
well applied in many fields, there are still problems
such as insufficient real-time computing capability,
weak privacy protection mechanism, and unstable
cross-domain migration effect. Future research can
start from the directions of multimodal emotion
fusion, privacy computation under the federated
learning framework, and lightweight design of
models, to promote the in-depth development of FER
systems in personalized emotion understanding and
intelligent interaction, and to lay a solid foundation
for the construction of a more natural and efficient
human-computer collaboration system.
REFERENCES
Akhand, M.A.H., Roy, S., Siddique, N., Kamal, M.A.S.,
Shimamura, T., 2021. Facial emotion recognition using
transfer learning in the deep CNN. Electronics, 10,
1036.
Bisogni, C., Castiglione, A., Hossain, S., Narducci, F.,
Umer, S., 2022. Impact of deep learning approaches on
facial expression recognition in healthcare industries.
IEEE Transactions on Industrial Informatics, 18(8),
5619–5627.
Hans, A.S.A., Rao, S., 2021. A CNN-LSTM based deep
neural networks for facial emotion detection in videos.
IJASIS, 7(1), 11–20.
Li, S., Deng, W., 2022. Deep facial expression recognition:
a survey. IEEE Transactions on Affective Computing,
13(3), 1195–1215.
Mellouk, W., Handouzi, W., 2020. Facial emotion
recognition using deep learning: review and insights.
Procedia Computer Science, 175, 689–694.
Mollahosseini, A., Chan, D., Mahoor, M.H., 2016. Going
deeper in facial expression recognition using deep
neural networks. In IEEE Winter Conference on
Applications of Computer Vision (WACV), Lake
Placid, NY, USA, 1–10.
OpenAI, 2025. Research on computer vision and AI
applications.
Pranav, E., Kamal, S., Chandran, C.S., Supriya, M.H., 2020.
Facial emotion recognition using deep convolutional
neural network. In 6th International Conference on
Advanced Computing and Communication Systems
(ICACCS), Coimbatore, India, 317–320.
Ranganathan, H., Chakraborty, S., Panchanathan, S., 2016.
Multimodal emotion recognition using deep learning
architectures. In IEEE Winter Conference on
Applications of Computer Vision (WACV), Lake
Placid, NY, USA, 1–9.
Saleem Abdullah, S.M., Abdulazeez, A.M., 2021. Facial
expression recognition based on deep learning
convolution neural network: a review. J. Smart
Computing and Data Mining, 2(1), 53–65.
A Comprehensive Analyze of Deep Learning Techniques and Applications
593