Figure 3: The pipeline of the study (Picture credit: Original).
Figure 4: The structure of CNN (Picture credit: Original).
conventional convolutional neural networks. The
study then analyzes three CNN methodologies,
exploring their respective advantages and limitations.
CNNs have notably addressed challenges such as the
inefficient processing of large image datasets,
incomplete feature extraction, and low recognition
accuracy. AlexNet excels in handling large-scale
datasets but requires substantial computational
resources and extended training times. In contrast,
ResNet overcomes issues like gradient vanishing and
model degradation, although it demands considerable
training data and is susceptible to overfitting. This
comprehensive examination underscores the progress
made in face recognition technology and identifies
potential areas for future research and development.
2.2.1 Introduction of CNN
CNN is a feedforward neural network and a deep-
learning neural network. It mainly consists of three
parts, respectively convolution layer, pooling layer,
and fully connected layer. The convolution layer
plays a crucial role in CNN and is responsible for
executing numerous calculations to extract local
features in the image. During the computation, a
specific convolution kernel, such as a sliding window
with fixed weights, is required, and then multiplies
with the image and is summed to produce a set of
convolved data. After the calculation is completed,
CNN requires a pooling layer to simplify the data.
There are two main types of pooling layers, namely
maximum pooling and average pooling. Figure 4
illustrates the structure. The primary function of the
pooling layer is to perform downsampling and feature
selection for different regions, reducing the number
of features and thus simplifying the model
parameters. This approach may compromise the
integrity of the data but significantly reduces data
complexity, thereby improving processing efficiency.
The role of the fully connected layer is to integrate the
local features extracted by the previous two layers to
form global features. Then, using linear
transformations and activation functions, the global
features are filtered, and the final prediction result is
output.
2.2.2 Introduction of AlexNet
AlexNet is a milestone of CNN in the field of
computer vision, which has attracted people's
attention to the technology of CNN. AlexNet adopts
a deeper network architecture with five convolutional
layers and three fully connected layers, which
significantly improve image classification
performance. In order to improve the problem of
calculation speed, AlexNet uses multiple graphics
processing units (GPU) for training, with each GPU
handling part of the computation, which significantly
speeds up the training process. In addition, AlexNet
used the non-saturating Rectified Linear Unit (ReLU)
function as its activation function, which is simpler to
calculate. The ReLU function can not only
significantly accelerate convergence during training
but also better mitigate the vanishing gradient
problem compared to the traditional Sigmoid
function. AlexNet implemented overlapping pooling,
where the stride (step size) is smaller than the window
size, leading to overlapping regions during the
pooling process. Additionally, AlexNet used data
augmentation and dropout techniques to prevent
overfitting and enhance recognition accuracy.
Dropout refers to randomly removing certain neurons
during training to reduce the dependency between
multiple neurons, thereby improving generalization
and introducing uncertainty. By deepening the
network structure, AlexNet demonstrated the
potential of deep neural networks and triggered a
surge of interest in deep learning.