experiments conducted on seven publicly available
databases, while Lopes et al. (2017) found that data
enhancement and preprocessing techniques (e.g.,
rotating, cropping, and luminance normalization)
significantly improved the model performance on
multiple databases (Mellouk & Handouzi, 2020). For
real-time application requirements, Pranav, Kamal,
Chandran, and Supriya (2020) proposed a lightweight
convolutional network with a simple structure for
emotion recognition on mobile. Overall, FER
research is evolving towards model lightweighting,
multimodal fusion and high robustness.
So, in this paper, the aim is to present the
development process of FERFacial Emotion
Recognition (FER) technology, and it is achieved by
analyzing its applications and challenges in
HCI.Firstly, introduce the traditional method and its
application environment; Secondly, discuss the
application effect of deep learning based FER model
in practice; Finally, make summary of the problems
and look forward to the future development direction.
2 OVERVIEW OF METHODS
FOR FACIAL EMOTION
RECOGNITION
2.1 Traditional Methods
Early FER methods relied heavily on feature
engineering techniques that used geometric and
texture features to extract unique patterns in facial
images for sentiment classification.
Geometric feature-based methods mainly analyze
the spatial relationships between key points on the
face, such as the positions of the eyes, eyebrows, nose,
and mouth. The Facial Action Coding System (FACS),
proposed by Ekman and Friesen, is a widely
recognized approach in this category. It decomposes
facial expressions into a set of Action Units (AUs)
based on underlying muscle movements, allowing for
the recognition of various emotional states. In HCI
systems, FACS is commonly applied in intelligent
customer service and remote sentiment analysis to
enhance user experience—intelligent assistants, for
instance, can adapt their interactions based on users’
subtle facial cues to improve naturalness and
engagement. In contrast, texture feature-based
methods determine emotional states by analyzing
local texture patterns in facial regions. Techniques
such as Local Binary Pattern (LBP) and Gabor
wavelets are commonly used in this context. LBP
captures local pixel differences and demonstrates
robustness to lighting changes, while Gabor wavelets
extract features across multiple scales and
orientations, thereby improving recognition accuracy.
These texture-based approaches are extensively used
in applications such as distance education and
affective computing, where FER systems can monitor
students’ concentration and dynamically adapt
teaching strategies to enhance learning outcomes.
2.2 Traditional Machine Learning
Methods
Before the rise of deep learning, FER algorithms were
predominantly based on traditional machine learning
approaches like classifiers (e.g., SVMs), hidden
Markov models (HMMs) or k-nearest neighbors
(kNN). Typically, these approaches need you to
extract so-called geometric features (distances
between facial keypoints, angles) or texture features
(local binary patterns, responses of filters like Gabor
filters) from the image and use these vectors as input
for the machine learning model. This “feature
engineering” approach has the big disadvantage that
it is knowledge- (domain expert) dependent and
results in a bad generalization behavior for
challenging scenarios.
Specifically, traditional classifiers are difficult to
effectively establish robust classification boundaries
when facing high-dimensional, nonlinearly
distributed expression data. For example, SVM may
be overfitted or underfitted when dealing with data of
high dimensionality or unevenly distributed
categories, and is highly sensitive to hyperparameter
selection.Although HMM is suitable for sequence
modeling, its assumption of state transfer is more
simplified, and it is difficult to accurately capture the
nonlinear and complex dynamic changes in facial
expressions. In addition, distance-based methods
such as kNN face “dimensional disaster” in high-
dimensional space, their performance does not
improve significantly with the increase of training
data, and the computational cost is high.
In dynamic interaction environments, such as
video surveillance, emotional robots, or intelligent
customer service systems, the user's facial expression
changes rapidly and is greatly affected by lighting,
posture, occlusion, etc., and traditional methods are
less robust to these disturbances. For example, in
intelligent surveillance systems, SVM is usually used
to analyze the emotional state in long-time static
image sequences, but it is difficult to respond to
instantaneous expression changes in real time, and it