Machines (SVM) (Mollahosseini and Chan 2016). Yu
et al utilized VGG16 as an end-to-end classifier using
multi-task learning and attention mechanisms (Yu and
Zhang 2015). Liu et al. achieved 76.1% accuracy by
augmenting VGG16's representation with Deep Belief
Network (DBN) (Liu et al 2016). Jung et al. achieved
77.1% accuracy using VGG16 as a base model and
Hierarchical Committee (HC) (Jung et al 2015).
The objective of this research is to present the
utilization of VGG16 in the construction of a model
designed for the analysis of facial emotions.
Additionally, its significance and effectiveness in the
field of psychotherapy are investigated based on the
model's reasoning. In particular, the empirical dataset
employed in this study is the Facial Expression
Recognition 2013 Dataset (FER2013), encompassing
a collection of 35,887 facial portrayals. It exhibits a
variety of different emotions, each categorized as
anger, disgust, fear, happiness, sadness, surprise, and
neutrality (Dataset 2013). Subsequently, VGG16 is
formulated as a feature extraction mechanism,
whereby the garnered multi-level features are
imported into Multi-Layer Perceptron (MLP)
classifiers to undertake the task of emotion
classification. In addition, VGG16 is used as an end-
to-end sentiment classifier for structural improvement
and parameter optimization. Enhancement techniques
include data augmentation and model merging to
strengthen the performance and stability of the model.
Its responsiveness and relevance to different facial
expression features are also explored. The model is
evaluated for its efficacy and impact in recognizing
and regulating emotions of various psychological
disorders through applications in the field of
psychotherapy. The results of the empirical study
demonstrate that the proposed facial emotion analysis
method significantly enhances the precision and
resilience of emotion recognition. The scholarly
investigation carried out in this article holds
substantial importance in propelling the progression of
fields encompassing human-computer interaction,
mental health, and education.
2 METHODOLOGY
2.1 Dataset Description and
Preprocessing
The dataset FER2013 is a collection designed for
facial expression recognition, introduced by
Goodfellow et al. in a 2013 paper (Goodfellow et al
2013). It comprises around 30,000 grayscale images
of faces and involves categorizing images into one
among seven emotional classes. FER2013 can be
harnessed within CNN and the domain of computer
vision to address various objectives, including but not
limited to facial expression categorization, assessment,
and visual representation. It serves as a resource for
researching human emotion features, and variations
and enhancing human-computer interaction. For
preprocessing FRE2013, data standardization is
employed. This procedure encompasses the deduction
of the mean and subsequent division by the standard
deviation of individual pixel values to transform the
data into a standard normal distribution, reducing bias
and variance.
2.2 Proposed Methodology Overview
This study is dedicated to leveraging the powerful
VGG16, a deep CNN architecture, in constructing a
robust model for facial emotion analysis. By
capitalizing on VGG16's remarkable feature
extraction capabilities and amalgamating them with
multi-level feature fusion, the precision of emotion
classification is significantly heightened. The entire
workflow encompasses a series of meticulously
orchestrated steps. It all begins with the pre-
processing of images, initially sized at 48x48 pixels,
which are opened using the function from the PIL
library. Following this, the images are resized to a
larger 224x224 pixel dimension. The essence of the
model's efficacy lies in its ability to extract salient
features through the utilization of pre-trained weights
from the VGG16 model. A stalwart of deep learning,
VGG16, with its 16-layer architecture, was honed
through training on the expansive ImageNet dataset,
enabling it to discern over a thousand distinct object
categories. This model's output, derived from the final
convolutional layer, yields a comprehensive set of 512
features.
These features undergo a refinement process as
they traverse through a Flatten layer, effectively
transforming multi-dimensional arrays into a compact
one-dimensional representation. In parallel, a Dropout
layer operates to stave off overfitting, selectively
discarding a proportion of neurons at random during
training. This dynamic, combined with the subsequent
four-layer neural network structure, comprising a
Flatten stratum, Dropout stratum, fully connected
stratum, and Softmax stratum, furnishes the model
with a formidable capacity for categorization. The
fully connected layer interconnects all input and
output neurons, while the Softmax layer serves as the
output layer for multi-class classification, yielding the
probability distribution for each category. The
culmination of these meticulous steps culminates in a