DNNFG: DNN based on Fourier Transform Followed by Gabor Filtering

for the Modular FER

Sujata

and Suman K. Mitra

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India

Keywords:

CNN, DNN, VGG16, SVM, KNN.

Abstract:

The modular approach mimics the capability of the human brain to identify a person with a limited facial part.

In this article, we experimentally show that some facial parts like eyes, nose, lips, and forehead contribute more

in the expression recognition task. Deep neural network, VGG16 ft, is proposed to automatically extricate

features from the given facial images. Fine-tuning is very fruitful to the FER (Facial Expression Recognition)

with pre-trained models, if sufﬁcient facial images are not collected. Two preprocessing approaches, Fourier

transform followed by Gabor ﬁlters and Data Augmentation (DA), are implemented to restrain the regions

used for Facial expression recognition (FER). The features from four facial regions are concatenated and

classiﬁcation is done using SVM and KNN (with different distance measure). The experimental result shows

that the proposed framework can recognize the facial expressions like happy, anger, sad, surprise, disgust and

fear with high accuracy for the benchmark datasets like “JAFFE”, “VIDEO”, “CK+” and “Oulu-Casia”.

1 INTRODUCTION

Many of the existing techniques conduct the facial

expression recognition supported by the image/image

sequences, while not considering temporal data due to

the convenience of data handling and the easy acces-

sibility of training and testing material. Training the

deep neural networks with small FER datasets leads

to overﬁtting. To moderate this issue, several studies

use further task-oriented information to pre-train their

networks from ﬁne-tuned or scratch on existing pre-

trained models like VGG (Simonyan and Zisserman,

2014), GoogleNet (Szegedy et al., 2015) and AlexNet

(Krizhevsky et al., 2012). Kahou et al. (Kahou et al.,

2013), (Kaneko et al., 2016) demonstrated that the

utilization of extra information can enables models

with high capacity without overﬁtting, accordingly

improves the FER performance.

Based on the traditional CNN architecture, several

studies have proposed the addition of well-designed

auxiliary layers or blocks to improve the potential of

the features learned from the expression. HoloNet

(Yao et al., 2016) was destined to FER, which is based

on the new architecture of CNN, where the resid-

ual structure is combined with CReLU (Shang et al.,

2016) to extend the depth of the network without de-

https://orcid.org/ 0000-0003-4166-1502

creasing the competition. Another network of CNN

SSE (Supervised Scoring Ensemble) (Hu et al., 2017)

extends the degree of supervision of the FER. And

an FSN (feature selection network) introduced the in-

corporation of a feature option into AlexNet, which

automatically ﬁlters irrelevant features and focuses

on related functionality from maps of facial features

learned. Previous analyzes had indicated that more

network assemblies would defeat a single network.

Many existing FER networks have specialized in one

task and learned expressive-sensitive characteristics

without considering interactions with various factors.

In any case, in reality, FER is intertwined with differ-

ent variables, for example the posture of the head, the

identity of the subject and illumination. Reed et al.

(Reed et al., 2014) has developed a Boltzmann ma-

chine with different coordinates for factors relevant to

facial expressions.

The works of (Devries et al., 2014) (Pons and

Masip, 2018) have suggested that at the same time

performs FER with different tasks, locate facial mile-

stones and detect units of facial action (AU) (Ekman

and Rosenberg, 1997), can mutually enhance the ex-

ecution of FER. In (Zhao et al., 2015), deep belief

networks (DBNs) were ﬁrst trained to detect faces

and then initially identify areas related to facial ex-

pression. At that point, these analyzed face segments

were grouped by a stacked automatic encoder. In (Ri-

212

Sujata, . and Mitra, S.

DNNFG: DNN based on Fourier Transform Followed by Gabor Filtering for the Modular FER.

DOI: 10.5220/0009144102120219

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 212-219

ISBN: 978-989-758-397-1; ISSN: 2184-4313

fai et al., 2012), it was proposed that CCNET would

acquire LTI representations (Local translation invari-

ant).

For the recognition of the invariant expression

of posture, Lai et al (Lai and Lai, 2018) proposed

the GAN (Generative adversarial networks)-based

frontalization system, whenever the generator frontal

the input face images whereas conserving the identity

and expression attributes and the discriminator recog-

nizes the original face images from the produced face

images. Also, Zhang et al.(Zhang et al., 2018) pro-

posed the GAN model that produces images with var-

ious facial expressions under discretionary postures

for multi-view FER.

The objective of this article is more basic, but

also more general, namely: can recurring connectiv-

ity from associative areas to perceptive areas be use-

ful for classifying expressive events? Our hypothe-

sis is that the deep connectivity of the neural network

offers an advantage in recognizing and anticipating

more ambiguous expressions. For example, at the be-

ginning of a sequence composed of expressions of

neutral with higher intensity. To validate this very

general hypothesis using computational models, we

compare the simplest and comparable types of deep

neural networks to test the importance of recurrent

connections, with everything as similar as possible

(ie identical learning rate, synaptic weight correction,

procedure of training / test, etc.).

Human’s have the capability to identify a person

with a limited facial part. To extract these facial parts

from the face we have used the Facial Landmark De-

tection algorithm offered by Dlib which is an open

source machine learning library. The facial landmark

detection algorithm offered by Dlib is an implemen-

tation of the Ensemble of Regression Trees. It utilizes

the technique of pixel intensity difference to directly

estimate the landmark positions. The algorithm has a

very fast response rate and detects a set of 68 land-

marks on a given face. The landmarks (key points) of

our interest are those that describe the forehead, eyes,

nose and lips. Using the landmarks of the eyes, eye-

brows and nose we ﬁnd the upper patch between two

eyes which we have called the forehead. Using the

landmarks of the eyes we have extracted the eye re-

gion and similarly the nose and lips. These patches

are cropped out for each face image and saved. These

regions are selected as they give most of the informa-

tion about the expressions as proved in (Taheri et al.,

2014).

Our proposed framework is primarily based on the

architecture that processes the gray-scale facial re-

gions of the input face image as shown in Fig. 1, some

preprocessing steps such as Fourier transform fol-

lowed by Gabor ﬁlters and Data Augmentation (AU)

(for increasing the number of training samples) are

essential for given facial images. Proposed VGG16 ft

uses the original parameters acquired from the pre-

trained VGG16 which is trained on ImageNet dataset

of grayscale facial images to extract the facial expres-

sion related features. Outputs from all the regions

are concatenated into a large feature vector. Finally,

SVM and KNN with different distance measures are

used for the classiﬁcation, to predict the basic facial

expressions (happiness, anger, sadness, disgust, sur-

prise, and fear).

To show its efﬁciency, proposed framework is

tested on well-known facial expression databases

like JAFFE Database (Lyons et al., 1998), VIDEO

Database (Shikkenawis and Mitra, 2016), CK+

Database (Lucey et al., 2010) and Oulu- Casia (Zhao

et al., 2011) in modular way. That is similar to the

Ekman FACS (Facial Action Coding System)(Ekman

and Friesen, 1976). This modular approach is another

main contribution of present work. Instead of taking

the full face some signiﬁcant portion (forehead, eyes,

nose, and lips) of the face image is used.

The rest of the article is organized as follows. Sec-

tion 2 provides details of the proposed framework.

Section 3 shows the experiment results and analysis.

Section 4 Concludes the study.

2 PROPOSED ARCHITECTURE

In this segment, we discusses the premise of our tech-

nique and proposed framework which improves the

efﬁciency and accuracy of the facial expression recog-

nition. As mentioned earlier in the modular approach,

now onward we only consider forehead, eyes, nose,

and lips regions. Fig. 1 demonstrates the proce-

dure of the proposed framework which is divided into

three phases - 1) Preprocessing 2) Feature extraction

3) Classiﬁcation using SVM and KNN.

2.1 Preprocessing

We used the little similar preprocessing as in the EM-

PATH model given by Dailey et al. (Dailey et al.,

2002). Before the facial recognition, some image pre-

processing need to be done ﬁrst. Our preprocessing

starts with the transformation of the input facial im-

age to grayscale. This process minimized the vari-

ation of face images. This is a necessary step be-

cause CNN depicted later expects 3 channel input

facial image, this grayscale facial image is depicted

within the 3 channel. Subsequently, we run two pro-

cedures, Fourier transforms followed by Gabor ﬁlters

DNNFG: DNN based on Fourier Transform Followed by Gabor Filtering for the Modular FER

213

Figure 1: Illustration of the proposed Framework.

(improve the speed and encodes the edges ) and Data

Augmentation (increase number of face images in the

database). The subsequent section describes each of

those steps in details.

2.1.1 Fast Fourier Transform (FFT) and Gabor

Filtering

Fast Fourier transform can speed up our procedure

very smoothly. Computation of the 1 Dimensional

(1D) Fourier transformation of N points speciﬁcally

requires the order of N

addition/multiplication oper-

ations. Whereas Fast Fourier Transform (FFT) fulﬁlls

the same task in NlogNoperations. 2D Fourier trans-

form is computed by the given equation-

F(p, q) =

M−1

∑

r=0

N−1

∑

s=0

f (r, s)e

− j2π(

)

(1)

The face images are transformed in the

Fourier domain and ﬁltered by 48 Gabor ﬁlters

(GFs) corresponding to 6 spatial frequencies,

with one octave between the focuses of two

continuous spatial frequency channels that are

= 5.41; 10.77; 21.60;43.20;86.40; 172.8 cycles per

face image and eight exclusive orientations that are

θ = 0,

2π

3π

4π

5π

6π

7π

in radians.

GF can effectively express the characteristics of

the texture. It captures the most exceptional visual

properties and has very positive results in facial recog-

nition. GF cores that contain the real part and the

imaginary part. GF kernels are similar to the proﬁles

of the receptive ﬁeld in simple cortical cells, charac-

terized by localization, selective orientation and fre-

quency selectivity. An image is processed by the

kernel element and, then, to produce its correspond-

ing frequency images, which are further employed to

compute to obtain Gabor features for the image.

Different experiments have demonstrated that the

use of GFs impacts in a pinnacle estimation of the

responsive ﬁelds of the primary cells of the impera-

tive visible cortex (Hubel and Wiesel, 1968)), given

that the applied math analysis of the residual error be-

tween the distinction within the response proﬁles of

V1 easy cells and Gabor ﬁlters aren’t distinguishable

from probability (Jones and Palmer, 1987).

The face images transferred in the Fourier space

to boost the speed and ease the mathematical pro-

cesses and GFs were applied to every thumbnail by

means that of multiplication within the spectral do-

main (which is resembling a convolution of the Gabor

receptive ﬁelds within the spatial domain) is:

G(p, q) = exp[−(

− f

)

2σ

)] (2)

where p

=p cos θ + q sin θ and q

=q cos θ -

p sin θ. σ

and σ

are standard deviations (SD’s) of

the Gaussian enfold in the p

and q

(for example or-

thogonal to θ). The yields of Gabor channels were the

provincial vitality spectra that are multiplied by the

kernel of the GF. The GF were applied to the images

acquired from the Fourier domain. So now we get-

ting 48 images of each given image from the 6 spatial

frequency and 8 Gabor channels.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

214

Figure 2: Framework for the modiﬁed VGG16 ft network, used for extraction of the expression features from the given facial

images

2.1.2 Data Augmentation

CNN needs massive data so to have the option, to sum

up to a given issue. However, publically available

FER databases do not have sufﬁcient images to han-

dle the problem. Simard et al. (Simard et al., 2003)

suggested data augmentation (DA) procedure extend

the databases through the creation of synthetic face

images for every original face image. Inspired by this

procedure, the following activities had been utilized

as the data augmentation: 1) ﬂipping image vertically

and horizontally 2) Rotate each database image, ro-

tate it at right angles if image is square and rotate it as

180

if image is rectangular 3) Add the random noise

to the landmarks so as to introduce little deformations

to faces.

2.2 Feature Extraction From Given

Facial Images

Our proposed framework utilizes DNN (deep neu-

ral networK) for feature extraction for FER is relies

on VGG network of Simonyan and Zisserman (Si-

monyan and Zisserman, 2014). They come up with

two versions of VGG: VGG-16 and VGG-19 (i.e. six-

teen and nineteen layers, respectively). VGG16 is

chosen due to the fact of its effective performance in

visible detection and speedy convergence. It’s con-

cerning 138 million parameters and contains 13 con-

volutional layers, followed by 3 fully-connected lay-

ers (FCs). The initial two fully connected layers (FCs)

have 4,096 outputs and the last layer has 2,622 out-

puts. Since the VGG framework not designed for the

FER tasks so we modiﬁed the framework according

to our requirements. Fig. 2 demonstrates the essential

module of the framework. Compared with the orig-

inal VGG16, our VGG16 ft (where “ft”means ﬁne-

tuning) is simpliﬁed by doing away with two dense

layers.

The dimension of the input data for forehead is

54 × 48, for eyes is 39 × 117 , for nose is 50 × 55 and

for lips is 48 × 74. At that point, we ﬁx the struc-

Table 1: Parameters set for ﬁfth block.

conv5 1 conv5 2 conv5 3 Maxpool5

ft ft ft ft

Filters 512 512 1024

size 7×7 5×5 3×3 2×2

stride 1 1 1 2

pad 3 0 0 0

tures of the initial four conv (convolution) blocks of

the VGG16 ft. But we change the structure of ﬁfth

conv block of VGG16 ft and also change the names

of each layer just by adding “ft”at the end of the origi-

nal layer name. So now layer name of ﬁfth conv block

is like conv5 1 ft. The parameters whose change the

structure of the layer is shown in Table. ??. Based

on experiments last dense layer preserved and set its

dimension to 1 × 1024. That dimension is actually

the extracted feature of input image denoted as fea-

ture vector “fv 1” for the forehead, “fv 2” for the

eyes, “fv 3” for the nose and “fv 4” for the lips. We

decline the learning rates of layers that have a place

with the ﬁfth conv block by 10 times (learning rate

for ﬁfth conv block is .001 ) of other block learning

rate (.01 used for other conv blocks) to ensure that

that they’ll learn more positive information. At last,

the initial portion of the system is initialized with the

VGG16 model weights which are trained on the Ima-

genet dataset. ReLu (Rectiﬁed Linear Unit) is applied

after every convolutional layer.

2.3 Concatenation of Different Outputs

and Classiﬁcation

Fig. 1 shows our proposed framework. Expres-

sion features fv is the concatenation of the feature

vector came from forehead (fv

1), eyes (fv 2), nose

(fv 3) and lips (fv 4). After getting the feature vector

next step to do the classiﬁcation. In the classiﬁca-

tion process, the similarity between extracted features

of the display set and the probe set is evaluated by

the SVM and K nearest-neighbor (K=1,2,3) classiﬁer

DNNFG: DNN based on Fourier Transform Followed by Gabor Filtering for the Modular FER

215

with various distance measures. Euclidean distance,

Chi-square distance, as well as histogram intersection

(HI) are utilized in our experiments. Which are de-

ﬁned as in Eq. 3, 4 and 5

d(x

, y

) =

∑

i=0

− y

)

(3)

∑

i, j

1i, j

− y

1i, j

)

1i, j

− y

1i, j

)

(4)

, y

) = −

∑

i, j

min(x

1i, j

, y

1i, j

) (5)

For the computation loss, we used the MSE (mean

square error) till now it is best for the SVM and KNN

classiﬁcation, Which is deﬁned as

Loss =

∑

k O

− O

(6)

Where N is the total numbers of input images, Y and

Y’ the true and predicted outputs, respectively.

3 EXPERIMENTAL RESULTS

AND ANALYSIS

To approve the hypothetical conclusion of the pro-

posed framework, experiments were performed on the

four facial datasets. 1) JAFFE database having 213 fa-

cial images of 10 Japanese female models of 7 facial

expressions (6 basic facial expressions + 1 neutral)

All the face images are of size 256 × 256 which are

cut as discussed in four regions. Out of 213 images,

random 140 images were chosen for the training and

the remaining 73 were used for testing.

2) The Video database has videos of 11 persons.

Each video contains four different expressions: Nor-

mal, Smiling, Angry, and Open mouth. Out of 6668

images, randomly 70% images were chosen for train-

ing and remaining 30% images used as testing.

3) In CK+ there are 593 sequences across 123

persons giving 8 facial expressions. This paper uses

image sequences of 99 subjects with 7 facial expres-

sions.The face images of CK+ are cut into four infor-

mative regions.

4) Oulu-Casia has 6 facial expressions (anger,

happiness, surprise, fear, disgust and sad) form 80 dif-

ferent subjects between 23 to 58 years of age. 73.8%

of the persons are males. Out of 3360 images ran-

domly 70% images were chosen for training and re-

maining 30% images used as testing. The face images

of Oulu-Casia are cut into four informative regions.

The convergences of the proposed methodology

are assessed in four benchmark datasets, and the out-

comes are delineated in Figs. 3, 4, 5 and 6. Each sub-

ﬁgure demonstrates the trends of accuracy and loss

with the rise in iterations. Table 2 shows the Com-

parison between the Holistic and Modular approach

in our proposed framework in the light of SVM and

KNN as the classiﬁer for all datasets.

Figure 3: Curves of Accuracy and Loss during training and

testing phases for JAFFE dataset.

Figure 4: Curves of Accuracy and Loss during training and

testing phases for VIDEO dataset.

Figure 5: Curves of Accuracy and Loss during training and

testing phases for CK+ dataset.

Figure 6: Curves of Accuracy and Loss during training and

testing phases for Oulu-Casia dataset.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

216

Table 2: Comparison between the Holistic and Modular approach in our proposed framework in the light of SVM and

KNN as the classiﬁer for all datasets (In terms of average accuracy (%) reported for 50 iterations).

Holistic Modular

Datasets SVM KNN SVM KNN

Euclidean Chi Histogram Euclidean Chi Histogram

Square Intersection Square Intersection

JAFFE 93.02 90.32 82.42 80.02 95.87 93.42 90.32 87.27

VIDEO 92.47 88.23 80.98 78.02 96.67 92.50 89.41 85.49

CK+ 91.45 88.71 86.41 83.54 96.78 91.24 87.79 89.64

OULU-CASIA 91.40 87.30 79.89 76.20 96.08 89.56 85.64 83.20

Table 3: Comparison with Recognition Accuracy reported in some State-of-the-Art facial expression methods.

DataBase Methods Network Additional Accuracy

Type Classiﬁers

CK+

Ouellet (Ouellet, 2014) CNN (AlexNet) SVM 94.40%

Li et al. (Li and Lam, 2015) RBM - 95.04 %

Liu et al. (Liu et al., 2014) DBN Adaboost 96.07%

Liu et al. (Liu et al., 2013) CNN, RBM SVM 92.05%

Khorrami et al. (Khorrami et al., 2015) zero-bias CNN - 95.01%

Ding et al. (Ding et al., 2017) CNN with ﬁne-tune - 96.08%

Zeng et al. (Zeng et al., 2018) DAE - 93.78%

Cai et al. (Cai et al., 2018) CNN+loss layer - 90.66%

Liu et al. (Liu et al., 2017) CNN+loss layer - 96.10%

Yang et al. (Yang et al., 2018) GAN - 96.00%

Ours DNNFG 96.78%

JAFFE

Hamester et al. (Hamester et al., 2015) CNN, CAE - 95.8%

shan et al. (Shan et al., 2017) CNN - 76.74 %

Liu et al. (Liu et al., 2014) DBN Adaboost 91.8%

Yang et al. (Yang et al., 2018) WMDNN - 92.89%

Su et al. (Su et al., 2017) zero-bias CNN - 95.01%

Ours DNNFG SVM 95.87%

OULU-CASIA

Yang et al. (Yang et al., 2018) WMDNN - 92.89%

Salman et al. (Salmam et al., 2018) FDP NN 84.70 %

Aly et. al. (Aly et al., 2016) HOG DKDA 84.21%

Lopes et. al. (Lopes et al., 2017) CNN - 96.42%

Ours DNNFG SVM 96.08%

To assess the qualitative execution of the proposed

framework, facial images are gathered from the In-

ternet for evaluation. Fig.7 interpret the successful

expression recognition, Whereas Fig. 8 failed recog-

nition of expression. Table 3 compares recognition

results of the proposed technique with that of few

State-of-the-art Neural Network-based facial expres-

sion recognition techniques.

4 CONCLUSION

This study investigates the FER technique primarily

based on the architecture that processes facial regions

of the given grayscale facial image and captures the

local information of the face. VGG16 ft has auto-

matically extracted the features from the given facial

regions and concatenated the features from all the re-

gions of the face.

Fine-tuning utilized to train the system with

the original parameters achieved from the Imagenet

dataset.

Furthermore, classiﬁers like SVM and KNN (with

different distance measures) are used to classify the

concatenated features. The proposed technique to

recognize an individual expression using partial facts

from the whole face image is explored during this

work. The proposed method is applied to most infor-

mative regions of the face i.e. forehead, eyes, nose,

and lips. It is observed that a combination of these

regions is useful enough to distinguish facial expres-

DNNFG: DNN based on Fourier Transform Followed by Gabor Filtering for the Modular FER

217

Figure 7: Successful recognition of face images taken from

the Internet.

Figure 8: Unsuccessful recognition of face images taken

from the Internet.

sions of different persons or the same persons in most

of the cases. The evaluation was done on four datasets

(JAFFE, VIDEO, CK+ and Oulu-casia) to prove the

effectiveness of our framework by recognizing the ba-

sic expressions.

In the future, we will focus on simplifying the net-

work used to boost up the algorithm. Furthermore, we

intend to add channels of facial images that can be uti-

lized to improve the framework.

REFERENCES

Aly, S., Abbott, A. L., and Torki, M. (2016). A multi-

modal feature fusion framework for kinect-based fa-

cial expression recognition using dual kernel discrimi-

nant analysis (dkda). In 2016 IEEE Winter Conference

on Applications of Computer Vision (WACV), pages 1–

10. IEEE.

Cai, J., Meng, Z., Khan, A. S., Li, Z., O’Reilly, J., and

Tong, Y. (2018). Island loss for learning discrimina-

tive features in facial expression recognition. In Au-

tomatic Face & Gesture Recognition (FG 2018), 2018

13th IEEE International Conference on, pages 302–

309. IEEE.

Dailey, M. N., Cottrell, G. W., Padgett, C., and Adolphs,

R. (2002). Empath: A neural network that categorizes

facial expressions. Journal of cognitive neuroscience,

14(8):1158–1173.

Devries, T., Biswaranjan, K., and Taylor, G. W. (2014).

Multi-task learning of facial landmarks and expres-

sion. In 2014 Canadian Conference on Computer and

Robot Vision (CRV), pages 98–103. IEEE.

Ding, H., Zhou, S. K., and Chellappa, R. (2017).

Facenet2expnet: Regularizing a deep face recognition

net for expression recognition. In Automatic Face &

Gesture Recognition (FG 2017), 2017 12th IEEE In-

ternational Conference on, pages 118–126. IEEE.

Ekman, P. and Friesen, W. V. (1976). Measuring facial

movement. Environmental psychology and nonverbal

behavior, 1(1):56–75.

Ekman, P. and Rosenberg, E. L. (1997). What the face

reveals: Basic and applied studies of spontaneous

expression using the Facial Action Coding System

(FACS). Oxford University Press, USA.

Hamester, D., Barros, P., and Wermter, S. (2015). Face ex-

pression recognition with a 2-channel convolutional

neural network. In Neural Networks (IJCNN), 2015

International Joint Conference on, pages 1–8. IEEE.

Hu, P., Cai, D., Wang, S., Yao, A., and Chen, Y. (2017).

Learning supervised scoring ensemble for emotion

recognition in the wild. In Proceedings of the 19th

ACM International Conference on Multimodal Inter-

action, pages 553–560. ACM.

Hubel, D. H. and Wiesel, T. N. (1968). Receptive ﬁelds and

functional architecture of monkey striate cortex. The

Journal of physiology, 195(1):215–243.

Jones, J. P. and Palmer, L. A. (1987). An evaluation of the

two-dimensional gabor ﬁlter model of simple recep-

tive ﬁelds in cat striate cortex. Journal of neurophysi-

ology, 58(6):1233–1258.

Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P.,

ulc¸ehre, C¸ ., Memisevic, R., Vincent, P., Courville,

A., Bengio, Y., Ferrari, R. C., et al. (2013). Combin-

ing modality speciﬁc deep neural networks for emo-

tion recognition in video. In Proceedings of the 15th

ACM on International conference on multimodal in-

teraction, pages 543–550. ACM.

Kaneko, T., Hiramatsu, K., and Kashino, K. (2016). Adap-

tive visual feedback generation for facial expression

improvement with multi-task deep neural networks. In

Proceedings of the 2016 ACM on Multimedia Confer-

ence, pages 327–331. ACM.

Khorrami, P., Paine, T., and Huang, T. (2015). Do deep

neural networks learn facial action units when doing

expression recognition? In Proceedings of the IEEE

International Conference on Computer Vision Work-

shops, pages 19–27.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Lai, Y.-H. and Lai, S.-H. (2018). Emotion-preserving repre-

sentation learning via generative adversarial network

for multi-view facial expression recognition. In Auto-

matic Face & Gesture Recognition (FG 2018), 2018

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

218

13th IEEE International Conference on, pages 263–

270. IEEE.

Li, J. and Lam, E. Y. (2015). Facial expression recogni-

tion using deep neural networks. In Imaging Systems

and Techniques (IST), 2015 IEEE International Con-

ference on, pages 1–6. IEEE.

Liu, M., Li, S., Shan, S., and Chen, X. (2013). Au-aware

deep networks for facial expression recognition. In

FG, pages 1–6.

Liu, P., Han, S., Meng, Z., and Tong, Y. (2014). Fa-

cial expression recognition via a boosted deep belief

network. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

1805–1812.

Liu, X., Kumar, B. V., You, J., and Jia, P. (2017). Adaptive

deep metric learning for identity-aware facial expres-

sion recognition. In CVPR Workshops, pages 522–

531.

Lopes, A. T., de Aguiar, E., De Souza, A. F., and Oliveira-

Santos, T. (2017). Facial expression recognition with

convolutional neural networks: coping with few data

and the training sample order. Pattern Recognition,

61:610–628.

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar,

Z., and Matthews, I. (2010). The extended cohn-

kanade dataset (ck+): A complete dataset for action

unit and emotion-speciﬁed expression. In Computer

Vision and Pattern Recognition Workshops (CVPRW),

2010 IEEE Computer Society Conference on, pages

94–101. IEEE.

Lyons, M., Akamatsu, S., Kamachi, M., and Gyoba,

J. (1998). Coding facial expressions with gabor

wavelets. In Automatic Face and Gesture Recognition,

1998. Proceedings. Third IEEE International Confer-

ence on, pages 200–205. IEEE.

Ouellet, S. (2014). Real-time emotion recognition for gam-

ing using deep convolutional network features. arXiv

preprint arXiv:1408.3750.

Pons, G. and Masip, D. (2018). Multi-task, multi-label and

multi-domain learning with residual convolutional

networks for emotion recognition. arXiv preprint

arXiv:1802.06664.

Reed, S., Sohn, K., Zhang, Y., and Lee, H. (2014). Learn-

ing to disentangle factors of variation with manifold

interaction. In International Conference on Machine

Learning, pages 1431–1439.

Rifai, S., Bengio, Y., Courville, A., Vincent, P., and Mirza,

M. (2012). Disentangling factors of variation for facial

expression recognition. In Computer Vision–ECCV

2012, pages 808–822. Springer.

Salmam, F. Z., Madani, A., and Kissi, M. (2018). Emotion

recognition from facial expression based on ﬁducial

points detection and using neural network. Interna-

tional Journal of Electrical and Computer Engineer-

ing, 8(1):52.

Shan, K., Guo, J., You, W., Lu, D., and Bie, R. (2017). Au-

tomatic facial expression recognition based on a deep

convolutional-neural-network structure. In Software

Engineering Research, Management and Applications

(SERA), 2017 IEEE 15th International Conference on,

pages 123–128. IEEE.

Shang, W., Sohn, K., Almeida, D., and Lee, H. (2016). Un-

derstanding and improving convolutional neural net-

works via concatenated rectiﬁed linear units. In In-

ternational Conference on Machine Learning, pages

2217–2225.

Shikkenawis, G. and Mitra, S. K. (2016). On some vari-

ants of locality preserving projection. Neurocomput-

ing, 173:196–211.

Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best

practices for convolutional neural networks applied to

visual document analysis. In null, page 958. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Su, W., Chen, L., Wu, M., Zhou, M., Liu, Z., and Cao, W.

(2017). Nesterov accelerated gradient descent-based

convolution neural network with dropout for facial ex-

pression recognition. In Control Conference (ASCC),

2017 11th Asian, pages 1063–1068. IEEE.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Taheri, S., Qiu, Q., and Chellappa, R. (2014). Structure-

preserving sparse decomposition for facial expression

analysis. IEEE Transactions on Image Processing,

23(8):3590–3603.

Yang, H., Ciftci, U., and Yin, L. (2018). Facial expression

recognition by de-expression residue learning. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 2168–2177.

Yao, A., Cai, D., Hu, P., Wang, S., Sha, L., and Chen, Y.

(2016). Holonet: towards robust emotion recognition

in the wild. In Proceedings of the 18th ACM Interna-

tional Conference on Multimodal Interaction, pages

472–478. ACM.

Zeng, N., Zhang, H., Song, B., Liu, W., Li, Y., and

Dobaie, A. M. (2018). Facial expression recognition

via learning deep sparse autoencoders. Neurocomput-

ing, 273:643–649.

Zhang, F., Zhang, T., Mao, Q., and Xu, C. (2018). Joint

pose and expression modeling for facial expression

recognition. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

3359–3368.

Zhao, G., Huang, X., Taini, M., Li, S. Z., and Pietik

aInen,

M. (2011). Facial expression recognition from

near-infrared videos. Image and Vision Computing,

29(9):607–619.

Zhao, X., Shi, X., and Zhang, S. (2015). Facial expression

recognition via deep learning. IETE technical review,

32(5):347–355.

DNNFG: DNN based on Fourier Transform Followed by Gabor Filtering for the Modular FER

219