Modular Facial Expression Recognition using Double Channel DNNs

Sujata

and Suman K. Mitra

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India

Keywords:

CNN, DNN, VGG16, HOG, HSOG, SVM, KNN.

Abstract:

Recognizing human expressions is an important task for machines to understand emotional changes in humans.

However, the the accurate features that are closely linked to changes in expression are difﬁcult to extract due

to the inﬂuence of individual differences and variations in emotional intensity. The modular approach pre-

sented here imitates the human being’s ability to identify a person with a limited facial part. In this article, we

demonstrate experimentally that certain parts of the face, such as the eyes, nose, lips and forehead, contribute

more to the recognition of expressions. A combination of two deep neural networks is also proposed to extract

the characteristics of the facial images provided. Two preprocessing approaches are implemented, Histogram

Equalization (to handle illumination) and Data Augmentation (increasing number of facial images), to restrict

the regions used for recognition of the facial expression. Two-channel architecture used for implementation,

one channel accepts input as a grayscale face image, processed by VGG16 ft (ﬁne-tuned VGG16), and an-

other channel accepts input as histograms face image. the second order gradients (HSOG), processed from the

proposed CNN model and extracts the characteristics accordingly. Then concatenate the characteristics from

the two channels. The ﬁnal recognition result is calculated using the SVM and KNN classiﬁers. Experimental

results indicate that the proposed algorithm is able to recognize six basic facial expressions (happiness, sad-

ness, anger, disgust, fear and surprise) with great precision. Fine tuning is effective for FER activities with a

well-trained model if there are not enough samples to collected.

1 INTRODUCTION

From the human facial images FER (Facial Expres-

sion Recognition) is trying to predict the basic face

expressions like Happy, sad, Surprise, Fear and Dis-

gust. Just by analyzing the face images the method

helps the machine to understand the human emotions

and intention.It has gained lot of attention because

of its potential applications like, computer interfaces,

health management, autonomous driving, detecting

abnormal human behavior and other similar tasks.

Histogram Equalization (HE) and Data augmen-

tation (DA) are pre-processing techniques, that are

required for the facial images provided to make ma-

chine learn from images. HE is a simple but effective

technique in image processing, which could build the

distribution of gray values in numerous images more

uniformly and reduce the interference caused by light-

ing. CNN needs huge data sets to generalize a particu-

lar problem. FER databases which are available pub-

licly have not sufﬁcient images to handle the prob-

lems. For creating synthetic images from the origi-

https://orcid.org/0000-0003-4166-1502

nal face image, one researcher Simard et al. (Simard

et al., 2003) suggested the DA procedure to extend the

datasets.

Despite recent rapid developments, FER remains

challenging due to some factors such as lighting,

head deﬂection and some occlusions in facial regions.

These impedances can affect facial recognition per-

formance and reduce the accuracy of FER. As demon-

strated in the past, hand-craft features seems to be no

longer appropriate for expression recognition activi-

ties with critical issues. Fortunately, the Deep Neural

Network (DNN) is providing a satisfactory solution

to these problems which have not been able to com-

ply with hand craft techniques.

Humans have the capability to recognize a per-

son with a limited facial regions. On account of ac-

knowledgment of facial expressions, the utilization of

full-face images can be repetitive since facial expres-

sion fundamentally misshapes certain speciﬁc zones

of face images. There is one algorithm called as fa-

cial benchmark detection algorithm which is offered

by Dlib helps to extricate the facial regions from the

given face image. This Dlib is an open source ma-

chine learning library provided by King (King, 2009).

Sujata, . and Mitra, S.

Modular Facial Expression Recognition using Double Channel DNNs.

DOI: 10.5220/0010201002770284

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 4: VISAPP, pages

277-284

ISBN: 978-989-758-488-6

277

That gives us 68 landmarks points on the face. Us-

ing those landmarks points, we are able to extract the

regions of face like forehead, eyes, nose, and lips.

Our proposed frame focuses only on these parts of

the face, but to verify the efﬁciency of the proposed

frame, we also performed experiments with the com-

plete facial image.

Suggested framework is focuses on the double-

channel architecture that simultaneously processes

the grayscale face image, and the HSOG (histogram

of second-order gradients) face image. Fig.2 shown,

pre-processing steps such as histogram equaliza-

tion (HE) (for handle illumination) and DA (Data

Augmentation) (create synthetic images) are nec-

essary for the input facial images provided. The

detailed calculation of the HSOG is reported in the

section. 3. HSOG is the variant of the Histogram

of oriented gradient (HOG), as indicated in (Dalal

and Triggs, 2005). HSOG extracts local information

from the face image. DNNs are used for various

channels grayscale and HSOG facial images. In

one channel, a proposed VGG16 ft with original

parameters acquired as in VGG16, which was trained

in ImageNet, is created for grayscale facial images

to extract features related to facial expression. On

the other channel, HSOG facial images, a proposed

two-layer CNN, which refers to the development of

DeepID (Sun et al., 2015). The output of the two

channels are concatenated and made an enormous

feature vector. At long last, SVM and KNN with

various separation estimations are utilized for classi-

ﬁcation to anticipate basic facial expressions (anger,

happiness, sad, disgust, fear). To show its viability,

FER databases JAFFE Database (Lyons et al., 1998),

VIDEO Database (Shikkenawis and Mitra, 2016),

CK+ Database (Lucey et al., 2010) and Oulu- Casia

(Zhao et al., 2011) is used to test the framework

in a modular way. This modular way is another

signiﬁcant commitment to current work. In summary,

the commitments of this work are featured as follows:

• Firstly, Modular Approach (Where we extract

the facial region automatically from the full face).

• Secondly, double channels of facial images,

including grayscale images and their corresponding

HSOG images are used for FER because of their

complementary properties. As far as DNN is not

trained with HSOG images.

• Thirdly, the ﬁne-tuning methodology is used

to make full utilization of a very much learned pre-

trained VGG16 model (VGG16 model trained on Im-

ageNet).

• At last, outputs of the two channels are com-

bined to predict a vigorous outcome. Four bench-

marking datasets and a few handy facial images are

utilized to assess the successfulness of our work.

The rest of the article is organized as follows. The

2 section provides details of the proposed framework.

The 3 section shows the results and analysis of the

experiment. Section 4 Concludes the study.

2 OUR PROPOSED METHOD

FOR FER

In this section, we mainly portray the premises of our

procedure and propose the structure, which improves

the effectiveness and precision of the recognition of

facial expressions. As referenced above, we utilize

the modular approach where we just take the fore-

head, eyes, nose and lips. So starting now and into

the foreseeable future we work in these four facial re-

gions. The proposed FER procedure utilized in this

paper depends on a double channel design prepared

to do viably recognizing expressions. The ﬁgure .2

shows the methodology of the proposed framework,

which is divided into three stages: 1) Pre-processing,

2) Dual channel feature extraction technique 3) Clas-

siﬁcation by SVM and KNN.

2.1 Pre-processing

Before facial recognition, the image must be prepro-

cessed. Our preprocessing begins with the transfor-

mation of the input image in grayscale. This process

minimized the variation in facial images. This pre-

processing is a necessary step because the CNN illus-

trated below provides that the image of the 3-channel

input face and the image of the received grayscale

face can be represented within the 3 channels. Next,

we perform two procedures, which are Histogram

Equalization (lighting management) and Data Aug-

mentation (increasing the number of the face image

in the database). The next section describes each of

these steps in detail.

2.1.1 Histogram Equalization

In facial images, some problems should also be con-

sidered. Due to the different lighting conditions,

when taking images, the segments of the face will be

displayed with various brightness, which can cause

enormous interference in the results of facial recog-

nition. Therefore, we tend to perform histogram

equalization (HE) before recognition. HE is a sim-

ple but effective technique in image processing, which

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

278

Figure 1: Framework for the modiﬁed VGG16 ft network,

utilized for extraction of the facial expression features from

the given face image.

could build the distribution of gray values in numer-

ous images more uniformly and reduce the interfer-

ence caused by illumination.

2.1.2 Data Augmentation

CNN needs immense data sets to generalize a spe-

ciﬁc issue. In any case, freely accessible FER datasets

need more images to deal with the issue. Data aug-

mentation strategy to expand the data sets by mak-

ing manufactured facial images for every unique fa-

cial image. Enlivened by this technique, the accom-

panying activities were utilized as data augmentation:

1) ﬂip the image vertically and horizontally 2) Rotate

each image in the dataset, rotate it right angle if the

face image is square and turn it by 180

if it is an

image it is rectangular 3) Add arbitrary noise to the

landmarks so as to present little deformations on the

faces. Thusly, the subsequent face image is unique

in relation to the original face the one utilized with

CNN for pre-training. This distinction among pro-

cessed and non-processes data could inﬂuence results.

This might be because of the way that the network

has learned the features of the original face image and

will most likely be unable to extract the features from

the processed face images. Consequently, we likewise

give results with a network trained with original facial

images.

2.2 Feature Extraction from Grayscale

Facial Images

The absence of satisfactory raining samples restricts

the execution of the CNN FER approach. Expand-

ing data can partly deal with the issue of over-ﬁtting.

Hence, ﬁne-tuning is utilized to extract the expres-

sions features from the input face through the deep

neural network (DNN) which has made over the top

progress in comparable errands.

Our proposed system utilizes DNN for the extrac-

tion of expressions features for FER dependent on the

VGG neywok introduced by Simonyan and Zisser-

man (Simonyan and Zisserman, 2014). They accom-

pany two versions of VGG: VGG-16 and VGG-19(i.e.

sixteen and nineteen layers separately). VGG16 (Su-

jata and Mitra, 2020) was picked for its successful

execution in visual recognition and fast convergence.

It has 138 million parameters and contains 13 convo-

lutional layers, followed by 3 fully connected layers

(FC). The initial two fully connected (FC) layers have

4,096 outputs and the last layer has 2,622 outputs.

Since the VGG network isn’t intended for FER tasks,

we adjust the structure as per our necessities. Fig-

ure. 1 shows the fundamental module of the network.

Contrasted with the original VGG16, our VGG16

ft (where “ft”implies ﬁne-tuning) is disentangled by

eliminating two dense layers. The input size for the

forehead is 54 ×48, for the eyes it is 39 ×117, for the

nose it is 50 × 55 and for the lips it is 48 × 74.

Now, we ﬁx the structures of the four initial

CONV (convolution) blocks of VGG16 ft. However,

we change the structure of the ﬁfth CONV block of

VGG16 ft and furthermore change the names of each

layer simply by adding “ft ”to the end of the original

layer name. So the name of the ﬁfth CONV block

layer resembles CONV 5 1 ft. The parameters of the

layer are appeared in the Table. 1. In view of the ex-

periments, the last dense layer was saved and set its

size to 1 × 1024. That dimension is actually the ex-

tracted feature of input image denoted as feature vec-

tor “fv 1 1” for the forehead, “fv 1 2” for the eyes,

“fv 1 3” for the nose and “fv 1 4” for the lips. We

have diminished the learning rate of the layers that

have a spot with the ﬁfth CONV block 10 times (the

learning rate for the ﬁfth CONV block is .001) of the

other block learning rate(.01 utilized for other CONV

blocks ) to guarantee that we will learn more certain

information. At last, the initial part of the system

is initialized with the weights of the VGG16 model,

which is trained in the Imagenet dataset. ReLu (Rec-

tiﬁed linear unit) is applied after each convolutional

layer.

Table 1: Parameters set for ﬁfth block.

CONV 5 1 ft CONV 5 2 ft CONV 5 3 ft POOL 5 ft

Filters 256 256 512

size 7×7 3×3 3×3 2×2

stride 1 1 1 2

pad 3 0 0 0

2.3 Feature Extraction from HSOG

(Histograms of the Second Order

Gradients ) Facial Images

To the best of our knowledge, there is no model

trained on the HSOG images. So here ﬁrst compute

the HSOG facial images.

Modular Facial Expression Recognition using Double Channel DNNs

279

Figure 2: Illustration of the proposed Framework.

2.3.1 Computation of First Order Oriented

Gradient Maps (OGMs)

Image descriptor begins from computing the 1

or-

der oriented gradient map (OGM). The initial step of

estimation in many feature detectors in image prepro-

cessing is to guarantee normalized color and gamma

values. Image preprocessing gives little effect on ex-

ecution. The initial step of count is the calculation

of the gradient values. The most widely recognized

strategy is to apply the 1-D centered, point discrete

derivative masks in either of the horizontal and verti-

cal directions. In particular, this strategy requires ﬁl-

tering the color or intensity information of the image

with the following ﬁlter kernels:

[−1,0,1] and [−1,0,1]

. Resulting image is de-

noted as “G

”. After that apply the convolution of

these gradient maps G with the Gaussian kernal “G

”.

Deﬁned as

G = G

× G

(1)

Change in the image contrast in which the inten-

sity values are multiplied by the constant will result in

the multiplication of the gradient computation. These

properties will be important for actualizing the image

descriptor for outward appearance acknowledgment.

Utilizing those 1

OGM “G”. Figure the 2

Order

Gradient in the subsequent stage.

2.3.2 Computation of Second Order Gradient

(OGMs)

Once 1

OGM is computed, they are use as inputs to

the 2

order gradient calculation over the image re-

gion I. In each pixel location OGM compute the gra-

dient magnitude Mag and gradient orientation Φ as

Mag(x,y) =

(

∂G(x,y)

∂x

)

+ (

∂G(x,y)

∂y

)

(2)

Φ(x,y) = arctan(

∂J

(x,y)

∂y

∂G(x,y)

∂x

) (3)

∂G(x,y)

∂x

= G(x + 1,y) − G(x − 1,y) (4)

∂G(x,y)

∂y

= G(x,y + 1) − G(x,y − 1) (5)

Orientation Φ exists in scope of [−Π\2, Π\2].

Map orientation from [−Π\2,Π\2] to [0,2Π]. A piv-

otal issue to be managed when processing the second

order gradients is the affect-ability of the resultant lo-

cal image descriptor as for noise.The truth of utilizing

the Gaussian kernel to simulate human simple cells

and smooth ﬁrst order gradients by gives descriptor a

desirable robustness to noise.

As indicated by our insight into up until this point,

no current model is trained on the HSOG images.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

280

Thus, we build two layer CNN model that automat-

ically extracts the features from the HSOG facial im-

ages.

Fig. 3 illustrates the proposed CNN structure,

which consists input layer, two convolution layers C 1

and C 2 and two sub-sampling layers S 1 and S 2. All

parameters utilized in the proposed CNN are listed in

Table 2.

Table 2: parameter set for the proposed CNN.

C 1 S 1 C 2 S 2

Filters 64 256

size 7×7 2×2 3×3 2×2

stride 1 2 1 2

pad 3 0 0 0

Now, the output is given to the fully connected

layer (Dense layer) with the 1024 neurons. From here

extract the feature vector (fv 2) of size 1 × 1024. To

deal with the nonlinear data, include the “Relu” ac-

tivations after the S 1 and S 2 layers. At that point

we extricate the feature vector “fv 2 1” for the fore-

head, “fv 2 2” for the eyes, “fv 2 3” for the nose and

“fv 2 4” for the lips.

Figure 3: Framework for the proposed CNN used for ex-

traction of the expression features from the HSOG facial

images.

2.4 Concatenation of Different Outputs

and Classiﬁcation

Feature vector that came from the forehead (fv 1 1),

eyes (fv 1 2), nose ((fv 1 3) and lips (fv 1 4) are

concatenated and make a long feature vector for fa-

cial expression. These are the features that originate

from grayscale images utilizing VGG16 ft with ﬁne-

tuning method. Likewise feature vector fv 2 is a con-

catenation of the feature vector came from the fore-

head (fv 2 1), eyes (fv 2 2), nose (fv 2 3) and lips

(fv 2 4). These fv 2 features came from the HSOG

facial images utilizing proposed CNN architecture.

At long last, we get full feature vector “fv” that is the

blend of the fv 1 and fv 2. Will go in Next advance

for classiﬁcation.

In the classiﬁcation cycle, the comparability be-

tween extracted features of the display setting and the

test set is assessed by the SVM and K nearest neigh-

bor (K=1,2,3) classiﬁer with different separation mea-

sures.

The Support Vector Machine (SVM) algorithm is

applied to classiﬁcation. At the point when all lo-

cal image descriptors are changed to a ﬁxed length

feature vector, similarity is processed to measure the

similitude between each pair of the feature vectors.

At last, each image for the test is classiﬁed into an

object class with the greatest SVM output decision

value. We tune the parameters of the classiﬁer on the

training set, and acquire the recognition accuracy on

the test set.

Other classiﬁer is K nearest-neighbor (K=1,2,3)

classiﬁer with various distance measures. Euclidean

distance, Chi-square distance, as well as histogram in-

tersection (HI) are utilized in our experiments. Mean

Square error (MSE) is used as for the computation of

loss for the SVM and KNN.

3 EXPERIMENTAL ANALYSIS

AND RESULTS

To support the theoretical ﬁnish of the proposed

framework, tests have been conducted on some gen-

uine datasets, as archived in this section.

Experiments of FER have been conducted on the

four FER datasets. Facial images for the most part

incorporate extremely huge dimensions. Managing

such broad information ends up being very hard for

machines. Hence, the modular methodology is ap-

plied when just certain data areas of the face are

thought of. Facial expression give signals of the in-

dividual’s emotional state, even without verbal cor-

respondence. The eyes are the most open aspect of

an individual’s face and reveal a lot about their sen-

timents. Not with standing the eyes, lips, forehead,

nose, and so forth they are also information regions.

During the expression analysis task, we saw that, in

addition to the eyes, nose and lips/mouth, the fore-

head additionally assumes a signiﬁcant role regarding

expressions. Most FER strategies are presently ap-

plied to full face images. This article focuses in just

on some information area of the face, as talked about.

To make a correlation, we did the holistic experiments

(where the full face image was utilized) as well as

modular.

To see the effectiveness of our method, our FER

methodology works under Keras on the macOS

Mojave system platform. To make the assessments

correct and effective, 4 benchmark datasets were

used, consisting of facial images. Representations of

the datasets used listed below.

Modular Facial Expression Recognition using Double Channel DNNs

281

3.0.1 JAFFE

JAFFE (Lyons et al., 1998) database having 213 facial

images of 10 Japanese female models of 7 facial ex-

pressions (6 basic facial expressions + 1 neutral). Out

of 213 images, random 140 images were chosen for

training and the remaining 73 used for testing. Fig .4

shows the trends of accuracy and loss during the train-

ing and testing with the increase in iterations. Table 3

shows reported the average accuracy in the Holistic as

well as modular approach both. In the JAFFE dataset

average accuracy is 95.67%.

Figure 4: Curves of Accuracy and Loss during training and

testing phases for JAFFE dataset.

3.0.2 Video

The Video (Shikkenawis and Mitra, 2016) database

has videos of 11 persons. The single video con-

tains four different facial expressions: Smiling, An-

gry, Open mouth, and Normal. Out of 6668 images,

randomly 70% images were chosen for training and

remaining 30% images used as a testing. Out of 6668

images, randomly 70% images were chosen for train-

ing and remaining 30% images used as testing. Fig. 5

shows the trends of accuracy and loss during the train-

ing and testing with the increase in iterations. Table 3

shows reported the average accuracy in the Holistic as

well as modular approach both. In the VIDEO dataset

average accuracy is 97.77%.

Figure 5: Curves of Accuracy and Loss during training and

testing phases for VIDEO dataset.

3.0.3 CK+

In CK+ (Lucey et al., 2010) there are 593 sequences

across 123 subjects giving 8 facial expressions. All

sequences are captured from the neutral face to the

peak expression. Participants were 18 to 50 years of

age, 69% female, 81%, Euro-American, 13% Afro-

American, and 6% other groups. This paper uses im-

age sequences of 99 subjects with 7 facial expres-

sions. Fig. 6 shows the trends of accuracy and loss

during the training and testing with the increase in it-

erations. Table 3 shows reported the average accuracy

in the Holistic as well as modular approach both. In

the VIDEO dataset average accuracy is 94.78%.

Figure 6: Curves of Accuracy and Loss during training and

testing phases for CK+ dataset.

3.0.4 Oulu-Casia

Oulu-Casia (Zhao et al., 2011) has 6 facial expres-

sions (anger, happiness, surprise, fear, disgust and

sad) form 80 different subjects between 23 to 58 years

of age. 73.8% of the persons are males. Out of 3360

images randomly 70% images were chosen for train-

ing and remaining 30% images used as testing. Fig. 6

shows the trends of accuracy and loss during the train-

ing and testing with the increase in iterations. Table

3 shows reported the average accuracy in the Holistic

as well as modular approach both. In the Oulu-Casia

dataset average accuracy is 95.86%.

Figure 7: Curves of Accuracy and Loss during training and

testing phases for Oulu-Casia dataset.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

282

Table 3: Comparison between the Holistic and Modular approach in our proposed framework in the light of SVM and KNN

as the classiﬁer for all datasets (In terms of average accuracy (%) reported for 50 iterations).

Holistic Modular

Datasets SVM KNN SVM KNN

Euclidean Chi Histogram Euclidean Chi Histogram

Square Intersection Square Intersection

JAFFE 90.02 87.32 81.42 80.02 95.67 92.14 88.32 86.97

VIDEO 91.55 87.43 79.89 77.12 97.77 91.54 88.12 86.91

CK+ 90.15 87.71 86.41 83.54 94.78 91.24 86.79 85.64

OULU-CASIA 89.40 86.30 79.79 75.20 95.86 88.76 84.61 82.20

3.1 Expressions at Different Intensity

Rates

There are several models regarding the character of

emotion and deﬁne how it is represented in the body

and brain. The goal is to decide the distinctive level

of emotions. With this approach, we ﬁnd the predom-

inant emotion as well as the emotion rates presented

on the face. Here we introduce the new strategy to

analyze the level of emotion as you move from one

phase of emotion to the next higher state. Some of the

emotions that have been inﬂuenced by the changes in

the time interval, as shown in the graphs below in Fig.

8, 9, 10 and 11. Some basic variations in the propor-

tion of emotions with a different time interval are also

represented. Our methodology is incredibly useful for

exploring micro expressions.

Figure 8: (a) (b) (c) (d) Shows the expression of the

Happy face from the lower level to the extreme level, and

(e) Graphical representation of expression percentages and

how other expressions inﬂuence while the expression level

changes from low to high.

4 CONCLUSION

In this study, we investigate the FER technique pri-

marily based on double channel architecture that pro-

cesses the grayscale facial image and HSOG facial

image at the same time. Both image channels which

are utilized are complementary, and capture local

and global information from the given grayscale and

Figure 9: (a) (b) (c) (d) (a) (b) (c) (d) Shows the expression

of the Sad face from the lower level to the extreme level,

and (e) Graphical representation of expression percentages

and how other expressions inﬂuence while the expression

level changes from low to high.

Figure 10: (a) (b) (c) (d) (a) (b) (c) (d) Shows the expres-

sion of the Surprise face from the lower level to the extreme

level, and (e) Graphical representation of expression per-

centages and how other expressions inﬂuence while the ex-

pression level changes from low to high.

HSOG facial image. It can enhance the recogni-

tion capacity. Concatenation strategy is proposed to

completely use the features that are extracted from

both image channels (VGG16 ft and proposed CNN).

VGG16 ft has automatically extracted the features

from the given grayscale face images. A proposed

CNN is built to automatically extracts the features

from the HSOG facial images as of the pre-trained

model is not trained on HSOG facial images. Further-

more, concatenated features have been classiﬁed us-

ing SVM and KNN classiﬁer with different distance

measures.

Modular Facial Expression Recognition using Double Channel DNNs

283

Figure 11: (a) (b) (c) (d) Shows the expression of the an-

gry face from the lower level to the extreme level, and

(e) Graphical representation of expression percentages and

how other expressions inﬂuence while the expression level

changes from low to high.

Capability of our proposed method is to recognize

a facial expression of a person using partial informa-

tion from the given whole face image. The proposed

method is applied to the most informative regions of

the face, i.e., forehead, eyes, nose, and lips. It is ob-

served that a combination of these regions is useful

enough to distinguish facial expressions of different

persons or the same persons in most of the cases. The

result obtained by the proposed method is comparable

with the most of the state of the art methods.

REFERENCES

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on, volume 1, pages 886–

893. IEEE.

King, D. E. (2009). Dlib-ml: A machine learning

toolkit. Journal of Machine Learning Research,

10(Jul):1755–1758.

Liu, C. and Wechsler, H. (2002). Gabor feature based classi-

ﬁcation using the enhanced ﬁsher linear discriminant

model for face recognition. IEEE Transactions on Im-

age processing, 11(4):467–476.

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar,

Z., and Matthews, I. (2010). The extended cohn-

kanade dataset (ck+): A complete dataset for action

unit and emotion-speciﬁed expression. In Computer

Vision and Pattern Recognition Workshops (CVPRW),

2010 IEEE Computer Society Conference on, pages

94–101. IEEE.

Lyons, M., Akamatsu, S., Kamachi, M., and Gyoba,

J. (1998). Coding facial expressions with gabor

wavelets. In Automatic Face and Gesture Recognition,

1998. Proceedings. Third IEEE International Confer-

ence on, pages 200–205. IEEE.

Mase, K. (1991). Recognition of facial expression from

optical ﬂow. IEICE TRANSACTIONS on Information

and Systems, 74(10):3474–3483.

Shikkenawis, G. and Mitra, S. K. (2016). On some vari-

ants of locality preserving projection. Neurocomput-

ing, 173:196–211.

Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best

practices for convolutional neural networks applied to

visual document analysis. In null, page 958. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Sujata and Mitra, S. K. (2020). Dnnfg: Dnn based on fourier

transform followed by gabor ﬁltering for the modular

fer. In ICPRAM, pages 212–219.

Sun, Y., Liang, D., Wang, X., and Tang, X. (2015).

Deepid3: Face recognition with very deep neural net-

works. arXiv preprint arXiv:1502.00873.

Valstar, M., Pantic, M., and Patras, I. (2004). Motion history

for facial action detection in video. In 2004 IEEE In-

ternational Conference on Systems, Man and Cyber-

netics (IEEE Cat. No. 04CH37583), volume 1, pages

635–640. IEEE.

Walecki, R., Rudovic, O., Pavlovic, V., and Pantic, M.

(2015). Variable-state latent conditional random ﬁelds

for facial expression recognition and action unit detec-

tion. In 2015 11th IEEE International Conference and

Workshops on Automatic Face and Gesture Recogni-

tion (FG), volume 1, pages 1–8. IEEE.

Zhang, T. (2017). Facial expression recognition based on

deep learning: A survey. In International Conference

on Intelligent and Interactive Systems and Applica-

tions, pages 345–352. Springer.

Zhang, W., Zhang, Y., Ma, L., Guan, J., and Gong, S.

(2015). Multimodal learning for facial expression

recognition. Pattern Recognition, 48(10):3191–3202.

Zhao, G., Huang, X., Taini, M., Li, S. Z., and Pietik

aInen,

M. (2011). Facial expression recognition from

near-infrared videos. Image and Vision Computing,

29(9):607–619.

Zhong, L., Liu, Q., Yang, P., Huang, J., and Metaxas, D. N.

(2014). Learning multiscale active facial patches for

expression analysis. IEEE transactions on cybernet-

ics, 45(8):1499–1510.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

284