
convolution kernels can automatically learn features
such as edges and textures. Each convolution kernel
generates a feature map, indicating how the features
are extracted under this convolution kernel. Through
multiple convolution kernels, multiple feature maps
will be obtained. Then, the size of the feature map is
reduced through the pooling layer to reduce the
computational cost while retaining important features.
After multiple convolution and pooling operations,
the feature maps are usually flattened and converted
into a one-dimensional vector. Finally, in the fully
connected layer, the extracted features are mapped to
the final output category and classified using the
activation function. The entire process can be
summarized as: the input image passes through
multiple convolutional layers, activation functions,
and pooling layers to gradually extract features, and
finally is passed to the fully connected layer for
classification. During this period, the weights are
adjusted through the gradient calculation of the loss
function to reduce the loss in the next forward
propagation (Huang, 2024). CNN can automatically
learn features at different levels, from low-level
features such as edges and textures to high-level
features such as face contours and the positions of
facial organs. It can effectively learn complex
nonlinear features without the need for manual
feature extraction.
3 FACIAL RECOGNITION
SPECIFIC APPLICATIONS
3.1 Emotional Recognition
Facial emotion recognition is a technology based on
facial expression analysis, used to identify an
individual's emotional state. By analyzing the
changes in facial expressions, such as the movements
of eyebrows, eyes, lips and other parts, the system can
determine the person's emotional category, such as
happiness, anger, sadness, and surprise. With the
advancement of deep learning and computer vision
technology, especially the application of CNN, the
accuracy and practicality of facial emotion
recognition have been significantly improved. Wang
et al. proposed a feature extraction method that fuses
the Complete Local Binary Pattern (CLBP) and
geometric salient features. Using the Dlib library for
feature point positioning, a feature ratio vector is
constructed according to the significant regions of
facial expression changes, and the fine-grained
texture features extracted by fusing geometric salient
features and CLBP are used as the input feature vector
for expression classification. After the experiment,
the performance of this algorithm on the CK+
database is that the accuracy rate is as high as 92.5%
(Wang et al., 2020). Wang proposed a recognition
method that combines Faster R-CNN in the process
of facial recognition. Firstly, the Multi-Task
Cascaded Convolutional Networks (MTCNN) is used
to locate the facial key points of the image to generate
a 3D reference model, and then the model is projected
into the initial frontal face for comparison. Finally,
the comparison data is stored in the database to
complete the processing of the facial image. After that,
the facial expression classification information is
input into the Multi-Task Cascaded Convolutional
Neural Network model to extract the facial expression
features in an end-to-end manner. Then, after
removing the redundant information, the generation
of data labels for facial emotion recognition of the
existing data is carried out. After the experiment, the
results show that the recognition accuracy rate of
using Faster R-CNN expressions is above 90%
(Wang, 2023).
3.2 Disease Auxiliary Diagnosis
The application of face recognition technology in the
aspect of disease auxiliary diagnosis is also
continuously increasing. In this aspect, it mainly
utilizes the accuracy of face recognition technology
in recognizing regular features. The neural network
will learn the facial expressions and facial features of
each patient with different diseases, and then use face
recognition technology to make a preliminary
diagnosis of whether an unknown patient is ill. This
diagnosis can assist doctors in evaluating the patient's
condition. In the aspect of auxiliary diagnosis of
depression, Li combines the Single Temporal
Network (STNet) and the Full Temporal Network
(FTNet). STNet is composed of a spatial convolution
network, a contour capture network, and a temporal
attention mechanism connecting the temporal
backbone network. The spatial convolution network
adopts the VGG 16 architecture and is composed of 5
spatio-temporal convolution blocks. The contour
capture network is composed of 5 contour capture
blocks, and the temporal backbone network can be
served by the Long-Short Term Memory (LSTM)
temporal model. The full temporal domain network is
served by EfficientNet V2, with the first three layers
connected by Fused-MBConv and the last three layers
connected by MBConv. Then, the feature vectors of
size 1000 generated by STNet and FTNet are
concatenated into a feature vector of size 2000 and
The Development and Applications of Facial Recognition Technology
13