PPG and EMG Based Emotion Recognition using Convolutional Neural

Network

Min Seop Lee

, Ye Ri Cho

, Yun Kyu Lee

, Dong Sung Pae

, Myo Taeg Lim

and Tae Koo Kang

School of Electrical Engineering, Korea University, Seoul, Republic of Korea

Department of Human Intelligence and Robot Engineering, Sangmyung University, Cheonan, Republic of Korea

Keywords:

Valence, Arousal, Convolutional Neural Network, Physiological Signal, PPG, EMG.

Abstract:

Emotion recognition is an essential part of human computer interaction and there are many sources for emotion

recognition. In this study, physiological signals, especially electromyogram (EMG) and photoplethysmogram

(PPG) are used to detect the emotion. To classify emotions in more detail, the existing method of modeling

emotion which represents the emotion as valence and arousal is subdivided by four levels. Convolutional

Neural network (CNN) is adopted for feature extraction and emotion classiﬁcation. We measure the EMG and

PPG signals from 30 subjects using selected 32 videos. Our method is evaluated by what we acquired from

participants.

1 INTRODUCTION

Emotion based research has affected many ﬁelds in

modern society (Hudlicka, 2003). Many recent stud-

ies have considered human conditions, such as hu-

man emotions (Cowie et al., 2001). Emotions can

be recognized in various ways, including facial ex-

pressions, voice, and physiological signals. Emotion

which is represented by facial expressions can be hid-

den through the poker-face (Anagnostopoulos et al.,

2015; Kim, 2017). Recent advances in emotional

recognition technology used a number of physiolog-

ical sensors to measure feelings (Yin et al., 2017).

Biometric signal based emotional awareness has been

applied to systems, therefore, the current thesis also

adopted biometric based emotion recognition meth-

ods.

Physiological signals for recognizing emotions

typically include electroencephalography (EEG), res-

piration (RSP), electromyography (EMG), photo-

plethysmography (PPG), and galvanic skin response

(GSR) sensors. EEG measures the voltage ﬂuctua-

tions in the brain and RSP is the signal about breath-

ing. EMG measures the muscle tension of its activ-

ity or stress, and PPG measures the amount of blood

ﬂowing through vessels (Lee et al., 2011). GSR is the

property of the human body that causes continuous

Corresponding author: Tae Koo Kang (Tel:+82-41-

550-5355, E-mail: tkkang@smu.ac.kr

variation in the electrical characteristics of the skin

(Wu et al., 2010).

EEG signals have been used by many recent stud-

ies of emotion recognition. To extract features, hand-

crafted features which are composed of statistical fea-

tures were selected in the initial research (Alarcao and

Fonseca, 2017). After Deep learning method evolv-

ing, machine learning and deep learning algorithm

are used in various research (Tabar and Halici, 2016;

Zhang et al., 2017). However, it is not efﬁcient to use

EEG signal because it needs 32 channels for emotion

classiﬁcation which means we have to obtain 32 kinds

of physiological signals. Therefore, we selected EMG

and PPG signal for emotion recognition and acquired

these signals for 30 subjects.

Current emotion awareness is based on two emo-

tional models which are the method of modeling emo-

tions. The ﬁrst method is based the six basic feel-

ings of happiness, sadness, surprise, fear, anger, and

disgust (An et al., 2017). The second method is

based on two parameters of arousal and valence (Peng

et al., 2018). Rather than classifying six emotions, the

arousal and valence model is more frequently used

and we selected this method. Previous studies have

been used two level classiﬁcation of arousal and va-

lence which means valence values can be classiﬁed

as high valence or low valence. However, this two

level classiﬁcation is simple, and we proposed 4 level

classiﬁcation from PPG and EMG signals using CNN

model.

Lee, M., Cho, Y., Lee, Y., Pae, D., Lim, M. and Kang, T.

PPG and EMG Based Emotion Recognition using Convolutional Neural Network.

DOI: 10.5220/0007797005950600

In Proceedings of the 16th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2019), pages 595-600

ISBN: 978-989-758-380-3

595

Figure 1: Arousal and valence model.

We recorded PPG and EMG signal from 32 par-

ticipants using video clips depending on the emo-

tion. Videos were selected through the survey about

whether it stimulates emotions. Then, PPG and EMG

signal were preprocessed and segmented for making

input data. Feature extraction and classiﬁcation was

conducted using the CNN approach.

The rest of paper is organized as follows. Section

2 provides fundamental theories related the arousal

and valence model. Section 3 proposes the method

of data collection and feature extraction method us-

ing CNN model. Section 4 describes the experiment

of 4 level classiﬁcation with obtained signals. Finally,

conclusion is explained in section 5.

2 AROUSAL AND VALENCE

MODEL

As mentioned before, multiple parameters are used to

label emotions. The most representative parameters

are arousal and valence which are based on Russel’s

circumplex theory (Russell, 1980). Arousal repre-

sents activation level and valence expresses the degree

of pleasantness. As Figure 1, emotions can be repre-

sented in two-dimensional space and each parameters

has a value between 1 and 9. For example, joy has

high valence and high arousal, whereas sad has low

valence and low arousal. Through these parameters,

it is possible to express the stage of emotion.

Typically, emotion classiﬁcation is conducted by

binary-classiﬁcation of valence and arousal value. In

this case, emotion can be divided by threshold value

of 5. Therefore, we proposed the 4 level classiﬁca-

tion of arousal and valence to express more detailed

emotion.

3 EMOTION RECOGNITION

SYSTEM

In this section, we introduce the total system of emo-

tion recognition. The overall architecture is shown in

Figure 2. First, we collected physiological signal us-

ing video clips. After that, we preprocessed to use

the signal as an input of deep learning. Finally, we

classiﬁed emotion by training CNN model.

In section 3.1, data collection method with video

stimuli is explained. Subdivision emotion model is

mentioned in section 3.2. After that, preprocessing

method is in section 3.3 and feature extraction and

classiﬁcation model with CNN is covered in section

3.4.

3.1 Data Collection

We used the EMG and PPG signal for emotion recog-

nition. To collect these signal, we utilized the physio-

logic recorder (P400, PhysioLab Inc.) to record EMG

and PPG signals (phy, ). The P400 measures various

signals, including bioelectrical and physiological sig-

nals. It can measure six measurement modules simul-

taneously from four channels. We used the base mod-

ule to connect physiological sensors. Table 1 shows

the P400 base module characteristics and speciﬁca-

tions. Additional sensors can be connected to the

module to acquire other biological signals. A ﬁnger-

tip pulse oximeter sensor is connected for recording

PPG signal and it illuminates the skin and measures

the blood changing in the ﬁnger. Electrode patch is

attached in the back of both shoulders for EMG sig-

nal. Figure 3 shows recorded signals.

Data collecting participants are composed of 30

people with different ages between 20 and 30. Each

participants watched 16 videos which were prese-

lected for visual stimuli. One emotion elicitation

video was shown to a subject on a single day in the

quite room. For minimizing the noise, subjects re-

mained motionless when recording the signal.

For emotion detection, emotion stimulus is re-

quired to trigger emotions. Previous studies have se-

lected various stimuli, including photos, videos, and

music. Videos were used for stimuli in DECAF stud-

Figure 2: Emotion recognition architecture.

ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics

596

Figure 3: PPG and EMG signals.

Figure 4: The total video number after video selection.

ies (Donahue et al., 2014). Music and photography

can evoke emotions through sight and hearing, re-

spectively (Wang and Huang, 2014). Triggering emo-

tion from images without sounds or images is not as

efﬁcient as using both simultaneously which means

we chose the video for emotion stimulus.

We selected video to use hearing and seeing as

a stimulus for recording physiological signal. The

video was selected from subject as a survey on the age

between 20 to 30. A total of 30 people were surveyed

and participated in an experiment. We prepared the

total 80 videos which means 5 videos for each section

(4 level of arousal and valence) as shown in Figure

4. After viewing 80 5-minute videos, participants se-

lected the suitable 2 videos for each section. As a

result, we selected a total of 32 videos based on the

highest scoring of surveying. Physiological signals

were measured for the selected videos.

3.2 Subdivision Emotion Model

We propose a 4 level emotional classiﬁcation based

on the arousal-valence emotion model described in

the prior section. Russell’s emotion model expressed

emotion using the arousal and valence variates, which

can be divided into four quadrants. Most of research

conducted binary classiﬁcation of arousal and va-

lence, for example, high valence or low valence. This

method cannot accurately represent the emotion, so

we divided the arousal and valence as 4 levels (very

high, high, low, very low). Figure 5 describes the 4

levels of arousal and valence. Arousal and valence are

divided by 4 classes according to the threshold values

(3, 5, and 7).

3.3 Preprocessing

To induce emotion, videos are used for the stimuli in

this research. We obtained EMG and PPG signals for

30 subjects. There were 32 videos for participants and

each video was 5 minutes long. Sampling rate was

128Hz and signals were passed through a high-pass

ﬁlter to eliminate the noise.

For making dataset, we segmented the EMG and

PPG signals as Figure 6. We found the peak point

of PPG signal and made an input data with a length

of 1000 samples. Based on ﬁrst PPG peak point,

we found the nearest peak of EMG signal which is

marked as yellow circle in Figure 6. As a result, red

line implies the PPG input and the green line means

the EMG input for neural network. Another peak

point of PPG signal after 1000 samples is the next

input’s peak point.

To use these signals at the same time, we concate-

nated these two signals. PPG signals are attached ﬁrst

and EMG signals are attached later with totally 2000

samples. As green line of EMG signal was not longer

than 1000 samples, the blank parts are ﬁlled with ze-

ros.

3.4 Four-level Classiﬁcation with CNN

CNN model can be designed depending on the appli-

cation. The proposed emotion recognition model uses

the CNN architecture to extract features represent-

ing emotional characteristics. Previous studies have

highlighted that at least two CNN layers are required

to reliably extract physiological signal characteristics.

Therefore, we applied only two convolutional layers

since the input length is not long.

Figure 7 shows the detail of CNN model architec-

ture. Input for the CNN is created by concatenating

PPG and EMG signal. CNN parameters including the

number of convolution ﬁlters, their size, and the stride

size are important for the great performance. Larger

number of ﬁlters will enable more diverse character-

istics to be learned. Therefore, we employed 32 ﬁlters

PPG and EMG Based Emotion Recognition using Convolutional Neural Network

597

Table 1: P400 base module characteristics and speciﬁcations.

Characteristics

4-Channel physiological input (Bio, ECG, PPG, Bridge)

Up to 2000 sample rate for each channel

12bit resolution

Four physiologic measurement modules for conﬁguration

Plug-in connector

Direct PC connection to monitoring the analysis

Category Speciﬁcation

Number of input channels 4- Channel

Input signal Input voltage range Module: 2.5V; input signal: 5V

Sampling rate Maximum 2000 SPS

Output signal ADC resolution 12 bit

Communication Communication method USB 1.1 (12 Mbps)

Power supply

Input:80–240V AC; Output:12V

Voltage / Current 12V / 2A

Figure 5: Four-level arousal and valence model.

for the ﬁrst convolution layer and 64 for the second.

Filter size was 3 × 1 for all layers and stride was set

to 1. Relu layers were used for non-linearity with two

subsequent max-pooling layers. After extracting fea-

tures, we used the softmax function for classiﬁcation.

As mentioned before, classes are composed of 4 level

for valence and arousal.

4 EXPERIMENTAL RESULT

4.1 Experimental Environment

We measured signals from sensors directly for each

participant, and created the required dataset. The

number of total dataset for each class is 7200 and we

used 80% dataset for training data and 20% dataset

for test data. Table 2 represents the dataset.

Convolution and max-pooling ﬁlter sizes were set

to the optimum value from training, assessed by com-

Table 2: Training and test dataset.

Training Test

Arousal 1 / Valence 1 5760 1440

Arousal 2 / Valence 2 5760 1440

Arousal 3 / Valence 3 5760 1440

Arousal 4 / Valence 4 5760 1440

paring performance. It is critical to avoid overwriting

when training the model. We used a dropout layer

with dropout rate = 0.5 that learned through neural

networks which omits some neurons for training. For

training, we set max epochs = 100, initial learning rate

= 0.001, and mini batch size = 64, and used relu as the

activation function.

4.2 Result for Accuracy

This section describes experimental results using the

obtained dataset. We used a CNN model to clas-

sify the four-level emotions of valence and arousal,

ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics

598

Figure 6: The segmentation of PPG and EMG signal.

Figure 7: The CNN architecture.

whereas conventional arousal-valence models catego-

rize two stage arousal-Valence model. It implies that

we can detect 16 emotions by mixing 4 levels of va-

lence and arousal. Figure 8 shows the accuracy of

experiment. Individual accuracy ranged from 90 to

96% and the overall accuracy is 83%. Individual re-

sult was conducted by training individual dataset and

overall accuracy was performed by training overall

dataset. We compare our method with artiﬁcial neural

network (ANN). ANN method used a neural network

with hand-crafted features (Yoo et al., 2018). They

extracted hand-crafted features from 5 physiological

signals based on statistical approaches. We applied

it to our dataset. The ANN method had 75.3% accu-

racies which implied our algorithm had better perfor-

mance than comparative method.

Figure 8: The individual result.

PPG and EMG Based Emotion Recognition using Convolutional Neural Network

599

5 CONCLUSIONS

This paper introduced an emotion recognition method

by using PPG and EMG signals. In order to classify

the detailed emotion, we subdivided the valence and

arousal as 4 levels, whereas the existing method of

dividing emotion was 2 levels. For the experiment,

we obtained own dataset by extracting from 30 sub-

jects using video clips. We adopted CNN architecture

for extracting features of signals and classifying the

valence and arousal. To use the PPG and EMG sig-

nal as an input of deep learning, we segmented and

concatenated them. The proposed method identiﬁed

individual and overall result with 90 to 96% and 83%

accuracies respectively.

REFERENCES

www.physiolab.co.kr.

Ahonen, T., Hadid, A., and Pietikainen, M. (2006). Face

description with local binary patterns: Application to

face recognition. IEEE Transactions on Pattern Anal-

ysis & Machine Intelligence, (12):2037–2041.

Alarcao, S. M. and Fonseca, M. J. (2017). Emotions recog-

nition using eeg signals: A survey. IEEE Transactions

on Affective Computing.

An, S., Ji, L.-J., Marks, M., and Zhang, Z. (2017). Two sides

of emotion: exploring positivity and negativity in six

basic emotions across cultures. Frontiers in psychol-

ogy, 8:610.

Anagnostopoulos, C.-N., Iliou, T., and Giannoukos, I.

(2015). Features and classiﬁers for emotion recog-

nition from speech: a survey from 2000 to 2011. Ar-

tiﬁcial Intelligence Review, 43(2):155–177.

Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis,

G., Kollias, S., Fellenz, W., and Taylor, J. G. (2001).

Emotion recognition in human-computer interaction.

IEEE Signal processing magazine, 18(1):32–80.

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N.,

Tzeng, E., and Darrell, T. (2014). Decaf: A deep con-

volutional activation feature for generic visual recog-

nition. In International conference on machine learn-

ing, pages 647–655.

Hudlicka, E. (2003). To feel or not to feel: The role of affect

in human–computer interaction. International journal

of human-computer studies, 59(1-2):1–32.

Kim, W.-G. (2017). Emotional speaker recognition us-

ing emotional adaptation. CHONGI HAKHOE NON-

MUNJI, 66(7):1105–1110.

Lee, Y.-K., Kwon, O.-W., Shin, H. S., Jo, J., and Lee, Y.

(2011). Noise reduction of ppg signals using a par-

ticle ﬁlter for robust emotion recognition. In Con-

sumer Electronics-Berlin (ICCE-Berlin), 2011 IEEE

International Conference on, pages 202–205. IEEE.

Peng, S., Zhang, L., Ban, Y., Fang, M., and Winkler, S.

(2018). A deep network for arousal-valence emotion

prediction with acoustic-visual cues. arXiv preprint

arXiv:1805.00638.

Russell, J. A. (1980). A circumplex model of affect. Journal

of personality and social psychology, 39(6):1161.

Seyeditabari, A., Tabari, N., and Zadrozny, W. (2018).

Emotion detection in text: a review. arXiv preprint

arXiv:1806.00674.

Tabar, Y. R. and Halici, U. (2016). A novel deep learning

approach for classiﬁcation of eeg motor imagery sig-

nals. Journal of neural engineering, 14(1):016003.

Wang, H.-M. and Huang, S.-C. (2014). Musical rhythms af-

fect heart rate variability: Algorithm and models. Ad-

vances in Electrical Engineering, 2014.

Wu, G., Liu, G., and Hao, M. (2010). The analysis of

emotion recognition from gsr based on pso. In Intel-

ligence Information Processing and Trusted Comput-

ing (IPTC), 2010 International Symposium on, pages

360–363. IEEE.

Yin, Z., Zhao, M., Wang, Y., Yang, J., and Zhang, J.

(2017). Recognition of emotions using multimodal

physiological signals and an ensemble deep learn-

ing model. Computer methods and programs in

biomedicine, 140:93–110.

Yoo, G., Seo, S., Hong, S., and Kim, H. (2018). Emo-

tion extraction based on multi bio-signal using back-

propagation neural network. Multimedia Tools and

Applications, 77(4):4925–4937.

Zhang, D., Yao, L., Zhang, X., Wang, S., Chen, W.,

and Boots, R. (2017). Eeg-based intention recogni-

tion from spatio-temporal representations via cascade

and parallel convolutional recurrent neural networks.

arXiv preprint arXiv:1708.06578.

ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics

600