Subject-independent Pain Recognition using Physiological Signals and

Para-linguistic Vocalizations

Nadeen Shoukry, Omar Elkilany, Patrick Thiam, Viktor Kessler and Friedhelm Schwenker

Ulm University, Institute of Neural Information Processing, 89081 Ulm, Germany

Keywords:

Machine Learning, Pain Recognition, Affective Computing, Biopotential Signals, Para-linguistic Data.

Abstract:

Pain is the result of a complex interaction among the various parts of the human nervous system. It plays

an important role in the diagnosis and treatment of patients. The standard method for pain recognition is

self-report; however, not all patients can communicate pain effectively. In this work, the task of automated

pain recognition is addressed using para-linguistic and physiological data. Hand-crafted and automatically

generated features are extracted and evaluated independently. Several state-of-the-art machine learning algo-

rithms are applied to perform subject-independent binary classiﬁcation. The SenseEmotion dataset is used for

evaluation and comparison. Random forests trained on hand-crafted features from the physiological modal-

ities achieved an accuracy of 82.61%, while support vector machines trained on hand-crafted features from

the para-linguistic data achieved an accuracy of 63.86%. Hand-crafted features outperformed automatically

generated features.

1 INTRODUCTION

Pain is the result of a complex interaction among the

various parts of the nervous system and plays an im-

portant role in the diagnosis and treatment of patients.

It varies in duration, intensity, and meaning. For ex-

ample, the sudden pain associated with syndromes,

such as causalgia, can be contrasted with the progres-

sive pain related to some types of cancer (Turk et al.,

1983). Furthermore, there are physical and psycho-

logical causes of pain. While physical pain has con-

crete causes, psychogenic pain occurs without exter-

nal stimuli (Engel, 1959). These factors complicate

the pain assessment task.

The standard method for pain assessment is self-

reporting, which assumes that the subjects experi-

encing the pain are capable of comprehending the

task and communicating their feelings effectively

(Zwakhalen et al., 2006). This assumption, however,

does not hold with elderly patients suffering from de-

mentia or cognitive disabilities. For these patients,

pain may become a barrier to social inclusion (Velana

et al., 2016). Moreover, if it is not managed correctly,

it may induce pathological effects, such as higher

blood pressure and rapid heart rates, that may lead to

further complications (Turk and Gatchel, 2018). With

https://orcid.org/0000-0001-5118-0812

the growth in the proportion of the population above

the age of 65 in modern industrial societies, more reli-

able pain recognition and assessment methods are re-

quired, possibly not to replace but to aid self-reports

(Velana et al., 2016).

In this paper, two different approaches for au-

tomatic pain recognition are explored. The ﬁrst

approach relies on physiological signals, collected

through sensors attached to the bodies of patients,

while the second approach depends on para-linguistic

vocalizations, such as moans and similar sounds,

which do not belong to a standardized language.

State-of-the-art machine learning algorithms are ap-

plied to perform binary, subject-independent classi-

ﬁcation, and their accuracies are compared brieﬂy.

Leave-one-subject-out cross-validation is used as the

evaluation approach. Assessment is performed on the

recently recorded SenseEmotion dataset (Velana et al.,

2016).

The rest of this work is organized as follows: In

section 2, an overview of related work is presented.

In section 3, the dataset itself is described with more

focus on the physiological and para-linguistic modali-

ties. The data preprocessing and preparation steps are

described in section 4. The conducted experiments

and their results are provided in section 5. Finally,

this work is concluded in section 6.

142

Shoukry, N., Elkilany, O., Thiam, P., Kessler, V. and Schwenker, F.

Subject-independent Pain Recognition using Physiological Signals and Para-linguistic Vocalizations.

DOI: 10.5220/0008912201420150

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 142-150

ISBN: 978-989-758-397-1; ISSN: 2184-4313

2 RELATED WORK

This section provides an overview of the research re-

lated to the pain recognition task with more focus on

the two relevant modalities: physiological signals and

para-linguistic vocalizations.

Accurate pain recognition requires complex anal-

ysis and a consideration of the environment where the

pain occurs (Hammal and Cohn, 2014). With this

fact in mind, several databases have been developed

to support research on pain recognition. One of the

ﬁrst of these databases is the UNBC-McMaster Shoul-

der Pain Expression Archive Database, which com-

prises data from patients suffering from shoulder pain

(Lucey et al., 2011). To collect the data, patients were

asked to perform exercises using their limbs, and their

facial expressions were captured and annotated. The

database was limited to this modality.

The EmoPain Dataset is another pain recognition

database, which focused on chronic pain and included

data from both patients and healthy individuals (Aung

et al., 2016). To collect the data, patients performed

exercises in a rehabilitation setting. This database

contained multi-modal data, including multiple-view

face videos, audio signals, 3D motion capture data,

and electromyographic signals.

The BioVid Heat Pain Database provided vi-

sual and biopotential data to advance the recogni-

tion of acute pain (Walter et al., 2013). Similarly,

the SenseEmotion dataset provided a multi-modal

database on acute pain, which was artiﬁcially in-

duced using heat stimulation in healthy patients (Ve-

lana et al., 2016).

2.1 Para-linguistic Data

Most of the previous machine learning projects done

in automatic pain recognition relied on facial expres-

sions, bio-potential signals, and in some cases the fu-

sion of both data modalities (Thiam and Schwenker,

2017; Thiam et al., 2017; Sellner et al., 2018;

Schwenker, 2018). In a single study, para-linguistic

audio signals were used along with other data modal-

ities to train pain recognition models (Thiam et al.,

2019). However, para-linguistic audio signals were

not used on their own to recognize pain before.

Environmental and background sound classiﬁ-

cation projects used hand-crafted features of au-

dio signals, such as Mel Frequency Cepstral Coefﬁ-

cients (MFCC). Support vector machines and random

forests were trained using the audio features to per-

form the classiﬁcation task (Saki and Kehtarnavaz,

2014; Wang et al., 2008; Lu et al., 2003).

There were previous attempts to use infant para-

linguistic vocalizations (cries) to recognize pain, non-

pain, fear, and hunger. Instead of manually extract-

ing audio features, deep neural networks were trained

on the audio signals to extract features automati-

cally and perform the classiﬁcation (Chang and Li,

2016). Other studies used hand-crafted features to

train deep neural networks (Abdulaziz and Ahmad,

2010; Petroni et al., 1995). Another study used raw

audio data to train a convolutional neural network

(CNN) and manually extracted audio features to train

a fuzzy support vector machine (FSVM); in this last

study, the FSVM outperforms the CNN and requires

signiﬁcantly less amount of data to achieve better re-

sults (Barajas-Montiel and Reyes-Garc

ıa, 2006).

2.2 Physiological Signals

Several medical studies have established correlations

between physiological responses produced by the au-

tonomic nervous system and the experience of pain

(Ledowski et al., 2009; Colloca et al., 2006; Log-

gia et al., 2011). Many of these responses have been

measured and used as potential indicators of affec-

tive state. They include heart rate, diastolic and sys-

tolic blood pressure, pupil dilation, pulse, respiration,

skin conductance and color, and temperature (Picard,

2000).

Several studies have used electrocardiography,

electromyography, and electrodermal activity to ad-

dress the pain recognition task (Gruss et al., 2015;

Thiam et al., 2019; K

achele et al., 2016). Sig-

nal processing and extensive feature extraction were

used to produce low-dimensional representations of

the information contained in these signals, and the

extracted features were fed to support vector ma-

chines or random forests for classiﬁcation. These

three signals were also used to develop real-time pain

recognition systems, which are capable of outputting

personalized, continuous estimations of pain inten-

sity (K

achele et al., 2016). Furthermore, unaffected

by differences among the individuals experiencing

pain, electrodermal activity has been found to be the

best performing single modality for pain recognition

(Thiam et al., 2019).

3 THE DATASET

As described by Velana et al., the SenseEmotion

dataset comprises multi-modal data from 45 healthy

participants. Pain was elicited in these participants

using thermal stimuli ranging from 32 to 50.5 degrees

Celsius. A baseline temperature T

of 32 degrees Cel-

sius was used commonly in all of the experiments. At

Subject-independent Pain Recognition using Physiological Signals and Para-linguistic Vocalizations

143

the start of an experimental session, calibration was

conducted to determine the pain threshold T

and the

tolerance threshold T

of each individual participant,

where the pain threshold marks the point where the

participant perceives pain and the tolerance threshold

marks the point where the participant can no longer

bear the pain. The mean of T

and T

was used to

deﬁne another level labelled T

. The resulting dataset

consists of 120 events per participant, 30 for each heat

level (Velana et al., 2016).

In this paper, the no-pain state corresponds to the

baseline temperature T

, while the pain state corre-

sponds to the pain tolerance threshold T

3.1 The Para-linguistic Modality

The audio signals used consist of para-linguistic data,

which are non-verbal signals of speech, including

breathing, moaning, and sighing sounds, beyond the

pure transcriptional contents of spoken speech (Cai

et al., 2017).

The recording of the audio signals relied primar-

ily on a digital wireless headset microphone (Line6

XD-V75HS) combined with a directional microphone

(Rode M3). The wireless headset was crucial to al-

low free head movements and record any sounds pro-

duced by the participants. The directional micro-

phone was used to record the ambient acoustic noises

surrounding the participants. Another audio signal

was recorded from the Microsoft Kinect V2 inte-

grated microphone, which was able to capture ambi-

ent noises as well. All recordings were sampled at a

rate of 48kHz and synchronized with both the video

and bio-physiological streams using the SSI frame-

work (Wagner et al., 2013).

3.2 The Physiological Modalities

The physiological modalities were recorded using

the Social Signal Interpretation (SSI) framework syn-

chronously in real-time (Wagner et al., 2013). The

following subsections provide a brief description of

each signal.

3.2.1 Electrocardiography

Electrocardiography is used to trace the electric cur-

rent that the heart muscle generates during a heart-

beat. It provides information about the condition and

performance of the heart. This modality can be used

to extract features such as heart rate, inter-beat inter-

val, and heart rate variability. Heart rate has been

used to differentiate between positive and negative

emotions, with ﬁner differentiation obtainable using

a measure of ﬁnger temperature (Kim and Andr

2008).

3.2.2 Electromyography

Electromyography is the graphing of the electrical

characteristics of muscles. Three electrodes were

used to measure the activity of the upper-right trapez-

ius muscle and collect data for the SenseEmotion

dataset. Electrical muscle activity is a general in-

dicator of arousal. Increased muscle activity corre-

sponds to increased sympathetic nervous system ac-

tivity. Furthermore, high muscle tension is expected

to occur in the case of pain (Velana et al., 2016).

3.2.3 Electrodermal Activity

The psychogalvanic reﬂex, or galvanic skin response,

is an alteration in the electrical properties of the skin

that follows harmful or alerting stimuli. Because

sweat glands, which affect skin conductivity, are con-

trolled by the sympathetic branch of the autonomic

nervous system, the galvanic skin response is more

sensitive as an indicator of emotional arousal than

other physiological responses (

Ohman et al., 1993).

3.2.4 Respiration

The respiration signal was captured using an elastic

belt, which was worn by the participants over their

clothing in the thorax area. It is suggested that respi-

ration patterns reﬂect relaxation and tension (Boiten

et al., 1994). Acute, cutaneous pain stimuli have been

shown to cause increases in inspiratory ﬂow, which

affects inspiration intensity, even under general anaes-

thesia (Jafari et al., 2017). Therefore, the respiration

signal may provide valuable and consistent data for

the pain recognition task.

4 DATA PREPARATION

From the data of the 45 participants of the SenseEmo-

tion dataset, the data of 40 participants could be used.

The other ﬁve participants were excluded because

they had erroneous raw data, which were missing one

or more sensor information in certain time segments.

In this section, an overview of the data preparation

steps is presented.

4.1 Para-linguistic Data

The audio signals that were preprocessed and used

came from the wireless headset microphone alone be-

cause it was the only device able to capture sounds

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

144

produced by the participants at a satisfactory level;

it was placed at the nasolabial area, nearer than the

integrated Kinect V2 microphone and the directional

microphone, which were placed at a distance of about

one meter from the participants and captured ambient

noises only (Thiam et al., 2019).

There are several representations of the audio sig-

nals: the raw audio recordings, the audio spectro-

grams, and the hand-crafted features. The audio spec-

trogram is a visual representation of the spectrum of

frequencies of the audio signals as they vary with time

(Mallik et al., 2019). The spectrogram is able to rep-

resent many useful acoustic features, including fre-

quency, pitch, and dB (Chang and Li, 2016).

Each participant experienced 60 stimuli on each

forearm throughout the duration of the experiment.

During the application of each pain/non-pain stimu-

lus, 10 frames of spectrograms were recorded. Sam-

ples of the data spectrograms are presented in ﬁgures

1 and 2.

Figure 1: Sample of the data spectrograms (using the left

forearm recorded data of participant 20160404 14 w during

the application of stimulus 1 (pain stimulus)).

Figure 2: Sample of the data spectrograms (using the left

forearm recorded data of participant 20160404 14 w during

the application of stimulus 3 (non-pain stimulus)).

Obtaining hand-crafted features started with the

processing of the data. Several low-level descriptors

were extracted from the raw audio signal. Low-Level

Descriptors (LLDs) are parameters computed from

short time frames of a whole signal to signiﬁcantly

reduce the processed amount of data. LLDs were ob-

tained using the openSMILE feature extraction toolkit

(Thiam et al., 2019). The most commonly used LLDs

in speech processing are the Mel Frequency Cepstral

Coefﬁcients (MFCCs).

MFCCs have been shown to work well for struc-

tured sounds, such as speech and music, but their per-

formance tends to drop in the presence of noise; there-

fore, other LLDs are analyzed along with the MFCCs.

Obtaining MFCCs is basically the process of convert-

ing audio in the time domain to the frequency domain

to be easily interpreted, analyzed, and processed. One

can think of it as mimicking the human ear cochlea,

which has few ﬁlters at high frequencies and more ﬁl-

ters at low frequencies to allow for the detection of

the most quiet sounds possible (Dave, 2013).

MFCCs are known to provide clear insights about

sound data and outstanding results in speech recog-

nition, emotion recognition, and speaker identiﬁca-

tion tasks (Dave, 2013). For the present work, 13

MFCCs were extracted, each combined with its ﬁrst

and second order temporal derivatives, resulting in a

total of 39 MFCC-based LLDs. In addition to the

MFCCs, another set of LLDs was computed using the

Relative Spectral Perceptual Linear Predictive Cod-

ing (RASTA-PLP), which is an extension to the per-

ceptual linear predictive analysis that improves the

robustness of the computed coefﬁcients against dis-

tortion. For the present work, 6 RASTA-PLP coef-

ﬁcients were extracted, each in combination with its

ﬁrst and second order temporal derivatives, resulting

in a total of 18 RASTAPLP-based LLDs. Finally, a

third set of LLDs was computed which involved the

root mean square signal energy and logarithmic signal

energy, in combination with their ﬁrst and second or-

der temporal derivatives. Additionally, the following

descriptors were extracted: the loudness contour, the

zero-crossing rate, the mean-crossing rate, the maxi-

mum absolute sample value, the minimum and max-

imum sample values, and the arithmetic mean of the

sample values. This last set represents a total of 13

LLDs (Thiam et al., 2019).

Then the resulting signals were further processed

to substantially reduce the noise using band-pass ﬁl-

tering, signal smoothing, and detrending. After these

operations, a set of high-level descriptors (HLDs)

was extracted from the previously processed sig-

nals. HLDs are extracted from segmenting the LLDs

based on a ﬁxed window. In the current work, the

following set of 14 statistical functions is applied

on the segmented LLD signals for the extraction

of HLDs: mean, median, standard deviation, maxi-

mum, minimum, range, skewness, kurtosis, ﬁrst and

second quartiles, inter-quartile, 1%-percentile, 99%-

percentile, and range from 1%- to 99%-percentile

(Thiam et al., 2019).

The MFCC-based feature vectors have a total di-

mensionality of 14 × 39 = 546. The RASTA-PLP-

based feature vectors have a total dimensionality of

14 × 18 = 252, and the last set of feature vectors

from the temporal domain has a total dimensionality

of 14×13 = 182. In the following step of preprocess-

ing, the HLDs were standardised individually and per

Subject-independent Pain Recognition using Physiological Signals and Para-linguistic Vocalizations

145

participant using the z-score (Thiam et al., 2019).

4.2 Physiological Data

Both hand-crafted and automatic features were used

to perform pain recognition using the physiological

modalities.

The hand-crafted features were engineered using

the same methodology presented in (Thiam et al.,

2019). The work presented in this paper has made

use of 335 hand-crafted features after they were pro-

vided by the authors. Out of these 335 features, 10

features were removed because they had zero variance

among the dataset samples. Furthermore, 14 features

were dropped because more than 25% of their val-

ues were undeﬁned, under the assumption that they

would not be beneﬁcial to the learning task. The

median value was calculated independently for each

feature and used to replace what remained of miss-

ing or undeﬁned values. Finally, before they were fed

to support vector machines and neural networks, the

remaining 311 features were standardized. The stan-

dardization step was skipped for random forests.

Standardization for feature x was done by sub-

tracting the mean µ and dividing by the standard de-

viation σ to produce the standardized feature z as fol-

lows:

z =

x −µ

(1)

For the automatic extraction of features, Butter-

worth ﬁlters were used to process the raw signals of

the four modalities before the pain events were iso-

lated. A ﬁfth-order Butterworth bandpass ﬁlter was

applied to the electrocardiography signals with a fre-

quency range of [0.4, 35] Hz. A third-order Butter-

worth bandpass ﬁlter was applied to the respiration

signals with a frequency range of [0.2,0.8] Hz. A

third-order Butterworth bandpass ﬁlter was applied to

the electromyographic signals with a frequency range

of [0.05,25] Hz, and a third-order Butterworth low-

pass ﬁlter was applied to the electrodermal activity

signals with a cut-off frequency of 0.2 Hz.

To mitigate inter-subject variance, all the signals

were normalized to ﬁt in the range [−1, 1]. For the

training set signals, the normalization was done as fol-

lows:

= 2 ·

−x

min

max

−x

min

−1 (2)

where z is the output signal, x is the input signal, x

min

is the minimum value of the input signal, x

max

is the

maximum value of the input signal, and the subscript

i is used to index signal values.

The average maximum and minimum over all the

training samples were calculated for every modality

as follows:

max

avg

∑

j=1

max, j

(3)

min

avg

∑

j=1

min, j

(4)

where m is the number of samples in the training set,

max, j

is the maximum of the j

sample, and x

min, j

the minimum of the j

sample.

Finally, the test data was normalized as follows:

= 2 ·

−min

avg

max

avg

−min

avg

−1 (5)

where z is the output signal, x is the input signal, the

subscript i is used to index signal values, and max

avg

and min

avg

are obtained from equations 3 and 4 re-

spectively.

Figure 3 shows a sample of the electrodermal ac-

tivity signal before any preprocessing, while ﬁgure 4

shows the same sample after the previously described

preprocessing and normalization methods are applied.

Figure 3: A sample from the electrodermal activity signal

of one of the dataset samples before any preprocessing.

Figure 4: The same sample from ﬁgure 3 after it went

through the preprocessing pipeline.

5 EXPERIMENTS AND RESULTS

In this section, the different experiments that were

conducted are presented along with their results.

All the experiments were conducted using the com-

bined data from the left and right forearms from the

SenseEmotion dataset.

Several algorithms were applied to the problem.

Support vector machines work by maximizing the

margin between the training instances and the clas-

siﬁcation boundary (Boser et al., 1992). A random

forest is an ensemble classiﬁcation model, where de-

cision tree classiﬁers are trained on different subsets

of the training data. The predictions of the individ-

ual decision tree classiﬁers are then combined to-

gether through a voting scheme to produce the ﬁnal

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

146

output of the random forest (Breiman, 2001). Neu-

ral networks are nonlinear models that are trained us-

ing back-propagation and that output predictions us-

ing the feed-forward operation (Schwenker, 2018).

5.1 Para-linguistic Data

The para-linguistic data were used to train a support

vector machine, a random forest, a naive Bayes clas-

siﬁer, a voting classiﬁer, and a deep neural network.

Randomized search cross-validation was used to tune

the hyperparameters of the different models. The

data was evaluated using 10-fold cross-validation and

leave-one-subject-out cross-validation.

5.1.1 Support Vector Machines

The three hand-crafted features from the audio sig-

nals were combined and fed to a support vector ma-

chine. Randomized cross-validation search was used

to tune three hyperparameters. The resulting hy-

perparameters from this search were kernel = RBF,

C = 2 and γ = 1/3. These hyperparameters achieved a

mean accuracy of 64.94% over the 10 cross-validation

folds. The same model was evaluated using leave-

one-subject-out cross-validation and achieved a mean

accuracy of 63.86% over the 40 subjects.

5.1.2 Random Forests

The three hand-crafted features from the audio sig-

nals were combined and fed to a random forest. Ran-

domized cross-validation search was used to tune

the hyperparameters. The resulting hyperparameters

from this search were number of estimators (trees) =

200, maximum features = 1/

√

3, maximum depth =

17, minimum samples to split = 40, and minimum

samples per leaf = 2. This combination achieved a

mean accuracy of 63.52% over the 10 cross-validation

folds. This model achieved a mean accuracy of

61.20% with the same combination of hyperparam-

eters in the leave-one-subject-out cross-validation.

5.1.3 Naive Bayes Classiﬁers

The three hand-crafted features from the audio signals

were combined and fed to a naive Bayes classiﬁer.

It achieved a mean accuracy of 61.97% over the 10

cross-validation folds and a mean accuracy of 61.68%

over the 40 subjects.

5.1.4 Voting Classiﬁer

A voting classiﬁer is a classiﬁer that formulates the

classiﬁcation decision based on the aggregation of the

votes from several classiﬁers. The voting classiﬁer

consisted of a support vector machine, a logistic re-

gression classiﬁer, a Gaussian naive Bayes classiﬁer,

and a random forest. The voting classiﬁer achieved a

mean accuracy of 63.87% over the 10 cross-validation

folds and a mean accuracy of 63.28% in the leave-

one-subject-out cross-validation.

5.1.5 Neural Network

A pre-trained model was used to do automatic fea-

ture extraction from the audio spectrograms. The pre-

trained model used was VGG16, a 16-layer model

originally trained on the ImageNet database (Si-

monyan and Zisserman, 2014). The classiﬁcation

was done using a bidirectional long short-term mem-

ory neural network (LSTM). A bidirectional LSTM

uses previous and upcoming information from the

sequence while learning (Graves and Schmidhuber,

2005). Dropout with a keep probability of 0.5 was

used to reduce overﬁtting (Srivastava et al., 2014).

The output layer used the sigmoid activation function.

The model was trained using binary cross-entropy

and the Adam optimizer (Kingma and Ba, 2014). It

was ﬁrst trained for 20 epochs with a batch size of

64. It achieved 49.47% mean accuracy over the 10

cross-validation folds and 49.01% mean accuracy in

the leave-one-subject-out cross validation. After the

same model was trained for 50 epochs, it achieved a

mean accuracy of 50.35% over the 10 cross-validation

folds and 50.12% in the leave-one-subject-out cross-

validation.

5.2 Physiological Data

The physiological data were used to train random

forests, support vector machines, neural networks, au-

toencoders, and convolutional neural networks. 5-

fold cross-validation was used to tune the hyperpa-

rameters of the different models. Then Leave-one-

subject-out cross-validation was used as the ﬁnal eval-

uation metric. Table 1 provides a summary of the

achieved results.

5.2.1 Random Forests

The hand-crafted features from the four physiologi-

cal modalities were concatenated and fed to random

forests. Two searches of the hyperparameter space

were conducted. In both, entropy was used to grow

the trees. 240 combinations of hyperparameters were

tried in total. The best setting had a maximum depth

of 20 levels per tree. All features were examined at

every node, and a minimum of 10 samples was needed

Subject-independent Pain Recognition using Physiological Signals and Para-linguistic Vocalizations

147

to create child nodes in the trees. A total of 100 deci-

sion trees were trained. This setting achieved a mean

accuracy of 83.36% on the 5 cross-validation folds

and a mean accuracy of 82.61% on the 40 subjects

of the leave-one-subject-out cross-validation.

5.2.2 Support Vector Machines

The hand-crafted features from the four physiological

modalities were concatenated and fed to support vec-

tor machines with both polynomial and Radial-Basis

Function (RBF) kernels.

With the RBF kernel, 195 hyperparameter combi-

nations were evaluated in total. The best combination

was C = 2000 and γ = 2 ×10

−5

. This combination

achieved a mean accuracy of 81.94% over the 5 cross-

validation folds and a mean accuracy of 81.34% over

the 40 subjects.

With the polynomial kernel, 297 hyperparameter

combinations were evaluated in total. The best com-

bination of hyperparameters was degree = 3, C = 20,

and γ = 2 ×10

−3

. This combination achieved a mean

accuracy of 78.78% over the 5 cross-validation folds

and a mean accuracy of 78.04% over the 40 subjects.

5.2.3 Neural Networks

Several architectures of neural networks with fully

connected layers were tested. The number of hidden

layers ranged from 2 to 4. The Exponential Linear

Unit (ELU) function was used as the activation func-

tion for the hidden layers. The sigmoid function was

used as the activation function for the output layer.

Batch normalization and He initialization were used

(Ioffe and Szegedy, 2015; He et al., 2015). Dropout

with a keep probability of 0.5 was applied to the out-

puts of all the hidden layers in the networks (Srivas-

tava et al., 2014). Binary cross-entropy was mini-

mized using the Adam optimizer to train the networks

(Kingma and Ba, 2014).

The best performing architecture had 3 hidden

layers with 32, 16, and 4 units respectively. It

achieved a mean accuracy of 81.96% on the 5 cross-

validation folds and a mean accuracy of 81.57% on

the 40 subjects.

5.2.4 Autoencoders

A single autoencoder was trained on all four modal-

ities at once to perform automatic feature extraction.

The input to this autoencoder consisted of 10240 nor-

malized signal values. The autoencoder consisted of

three hidden layers with 128, 64, and 128 units re-

spectively. The 64-unit layer was used as the encod-

ing layer. Batch normalization and He initialization

were used. The ELU function was used as the activa-

tion for the hidden layers of the autoencoder. A small

neural network consisting of two layers with 32 and

1 units respectively was used on top of the encoding

layer to perform the classiﬁcation. This architecture

achieved a mean accuracy of 70.00% on the 40 sub-

jects in leave-one-subject-out cross-validation.

5.2.5 Convolutional Neural Networks

A 1D convolutional neural network with 3 convolu-

tional layers was trained on the normalized signals

of the four modalities to perform automatic feature

extraction and classiﬁcation (Kiranyaz et al., 2019).

Each layer had 96 ﬁlters, and each ﬁlter was of size

3. The three layers were followed by a global averag-

ing layer, which computes the average over each ﬁl-

ter output independently. Finally, a dense layer with

one unit was used to output the classiﬁcation results.

The binary cross-entropy function was used as the

cost function to train the network. The Rectiﬁed Lin-

ear Unit (ReLU) function was used as the activation

function for the convolutional layers. The sigmoid

function was used as the activation function for the

output layer. The model was trained using stochastic

gradient descent. This architecture achieved a mean

accuracy of 76.90% on the 40 subjects in leave-one-

subject-out cross-validation.

Table 1: A summary of the results of the six algorithms

that were applied to the physiological modalities to per-

form pain recognition. The mean accuracy of leave-one-

subject-out cross-validation is presented. Random forests,

support vector machines (SVM), and neural networds used

hand-crafted features, while autoencoders and 1D convolu-

tional neural networks (CNN) performed automatic feature

extraction. For the autoencoders approach, a 2-layer neural

network was used as the classiﬁer.

Algorithm Accuracy

Random Forests 82.61 ±10.74%

Neural Network 81.57 ±10.58%

RBF Kernel SVM 81.34 ±10.62%

Polynomial Kernel SVM 78.04 ±10.57%

1D CNN 76.90 ±12.60%

Autoencoders 70.00 ±12.00%

6 CONCLUSION

In this paper, several machine learning algorithms

have been evaluated and compared on the pain recog-

nition task using data from the SenseEmotion dataset.

On para-linguistic data, support vector machines

were the best performing models in the leave-one-

subject-out cross-validation. The naive Bayes clas-

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

148

sifer was also able to achieve comparable results with

less processing time. Analysis of the trained models

shows that the MFCC values were the most important

features for the classiﬁers.

The physiological signals achieved higher accu-

racies. This high performance can be attributed to

the use of the electrodermal activity signal, which has

been conﬁrmed as an effective indicator of pain. The

random forests achieved the highest performance on

the hand-crafted features. Support vector machines

and neural networks followed closely. Error analy-

sis has shown that the four models that used hand-

crafted features made similar mistakes in their pre-

dictions. Automatic feature extraction achieved lower

accuracies but produced different classiﬁcation errors.

Therefore, a combination of both hand-crafted and

automatic features can be tested in the future and may

lead to improvements in accuracy.

In both approaches, the use of hand-crafted fea-

tures outperformed automatic feature extraction. The

difference is especially clear in the case of the

para-linguistic data, where the VGG-16 bidirectional

model performed like a random classiﬁer. This dis-

parity can be attributed to the data size; the dataset

does not contain enough data samples to train auto-

matic feature extractors that can generalize well. Fur-

thermore, the inter-subject variance in the values of

the signals further complicates automatic extraction.

Data augmentation can be tested in the future as one

method to deal with these problems.

Pain recognition remains a difﬁcult task. The fu-

ture collection of more data, especially from a re-

alistic clinical setting, can help improve the perfor-

mance of the state-of-the-art machine learning algo-

rithms and can allow for the use of deep learning for

feature extraction. Data collection should be coupled

with the development of new methods to deal with

inter-subject variance, which stands in the way of gen-

eralizable results.

REFERENCES

Abdulaziz, Y. and Ahmad, S. M. S. (2010). Infant cry

recognition system: A comparison of system perfor-

mance based on mel frequency and linear prediction

cepstral coefﬁcients. In 2010 International Confer-

ence on Information Retrieval & Knowledge Manage-

ment (CAMP), pages 260–263. IEEE.

Aung, M. S. H., Kaltwang, S., Romera-Paredes, B., Mar-

tinez, B., Singh, A., Cella, M., Valstar, M., Meng, H.,

Kemp, A., Shaﬁzadeh, M., Elkins, A. C., Kanakam,

N., de Rothschild, A., Tyler, N., Watson, P. J.,

d. C. Williams, A. C., Pantic, M., and Bianchi-

Berthouze, N. (2016). The automatic detection of

chronic pain-related expression: Requirements, chal-

lenges and the multimodal emopain dataset. IEEE

Transactions on Affective Computing, 7(4):435–451.

Barajas-Montiel, S. E. and Reyes-Garc

ıa, C. A. (2006).

Fuzzy support vector machines for automatic infant

cry recognition. In Intelligent Computing in Signal

Processing and Pattern Recognition, pages 876–881.

Springer.

Boiten, F. A., Frijda, N. H., and Wientjes, C. J. E. (1994).

Emotions and respiratory patterns: review and critical

analysis. International journal of psychophysiology,

17(2):103–128.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A

training algorithm for optimal margin classiﬁers. In

Proceedings of the ﬁfth annual workshop on Compu-

tational learning theory, pages 144–152. ACM.

Breiman, L. (2001). Random forests. Machine learning,

45(1):5–32.

Cai, D., Ni, Z., Liu, W., Cai, W., Li, G., Li, M., Cai, D.,

Ni, Z., Liu, W., and Cai, W. (2017). End-to-end deep

learning framework for speech paralinguistics detec-

tion based on perception aware spectrum. In INTER-

SPEECH, pages 3452–3456.

Chang, C.-Y. and Li, J.-J. (2016). Application of deep learn-

ing for recognizing infant cries. In 2016 IEEE Interna-

tional Conference on Consumer Electronics-Taiwan

(ICCE-TW), pages 1–2. IEEE.

Colloca, L., Benedetti, F., and Pollo, A. (2006). Repeata-

bility of autonomic responses to pain anticipation and

pain stimulation. European Journal of Pain, 10(7):659

– 665.

Dave, N. (2013). Feature extraction methods LPC, PLP and

MFCC in speech recognition. International journal

for advance research in engineering and technology,

1(6):1–4.

Engel, G. L. (1959). psychogenic pain and the pain-

prone patient. The American journal of medicine,

26(6):899–918.

Graves, A. and Schmidhuber, J. (2005). Framewise

phoneme classiﬁcation with bidirectional lstm and

other neural network architectures. Neural networks,

18(5-6):602–610.

Gruss, S., Treister, R., Werner, P., Traue, H. C., Crawcour,

S., Andrade, A., and Walter, S. (2015). Pain intensity

recognition rates via biopotential feature patterns with

support vector machines. PloS one, 10(10):e0140330.

Hammal, Z. and Cohn, J. F. (2014). Towards multimodal

pain assessment for research and clinical use. In Pro-

ceedings of the 2014 Workshop on Roadmapping the

Future of Multimodal Interaction Research including

Business Opportunities and Challenges, pages 13–17.

ACM.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving

deep into rectiﬁers: Surpassing human-level perfor-

mance on imagenet classiﬁcation. In 2015 IEEE In-

ternational Conference on Computer Vision (ICCV),

pages 1026–1034.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-

celerating deep network training by reducing internal

covariate shift. arXiv preprint arXiv:1502.03167.

Subject-independent Pain Recognition using Physiological Signals and Para-linguistic Vocalizations

149

Jafari, H., Courtois, I., Van den Bergh, O., Vlaeyen, J. W.,

and Van Diest, I. (2017). Pain and respiration: a sys-

tematic review. Pain, 158(6):995–1006.

achele, M., Thiam, P., Amirian, M., Schwenker, F., and

Palm, G. (2016). Methods for person-centered contin-

uous pain intensity assessment from bio-physiological

channels. IEEE Journal of Selected Topics in Signal

Processing, 10(5):854–864.

Kim, J. and Andr

e, E. (2008). Emotion recognition based on

physiological changes in music listening. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

30(12):2067–2083.

Kingma, D. and Ba, J. (2014). Adam: A method for

stochastic optimization. International Conference on

Learning Representations.

Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj,

M., and Inman, D. J. (2019). 1d convolutional neural

networks and applications: A survey.

Ledowski, T., Ang, B., Schmarbeck, T., and Rhodes, J.

(2009). Monitoring of sympathetic tone to assess post-

operative pain: skin conductance vs surgical stress in-

dex. Anaesthesia, 64(7):727–731.

Loggia, M. L., Juneau, M., and Bushnell, M. C. (2011).

Autonomic responses to heat pain: Heart rate, skin

conductance, and their relation to verbal ratings and

stimulus intensity. PAIN, 152(3):592 – 598.

Lu, L., Zhang, H.-J., and Li, S. Z. (2003). Content-based au-

dio classiﬁcation and segmentation by using support

vector machines. Multimedia systems, 8(6):482–492.

Lucey, P., Cohn, J. F., Prkachin, K. M., Solomon, P. E., and

Matthews, I. (2011). Painful data: The unbc-mcmaster

shoulder pain expression archive database. In Face

and Gesture 2011, pages 57–64.

Mallik, S., Chowdhury, D., and Chttopadhyay, M. (2019).

Development and performance analysis of a low-cost

mems microphone-based hearing aid with three dif-

ferent audio ampliﬁers. Innovations in Systems and

Software Engineering, 15(1):17–25.

Ohman, A., Esteves, F., Flykt, A., and Soares, J. J. F. (1993).

Gateways to Consciousness: Emotion, Attention, and

Electrodermal Activity, pages 137–157. Springer US,

Boston, MA.

Petroni, M., Malowany, A. S., Johnston, C. C., and Stevens,

B. J. (1995). Identiﬁcation of pain from infant cry

vocalizations using artiﬁcial neural networks (anns).

In Applications and Science of Artiﬁcial Neural Net-

works, volume 2492, pages 729–739. International

Society for Optics and Photonics.

Picard, R. W. (2000). Emotions Are Physical and Cognitive,

pages 21–45. MIT press, Cambridge, MA.

Saki, F. and Kehtarnavaz, N. (2014). Background noise

classiﬁcation using random forest tree classiﬁer for

cochlear implant applications. In 2014 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 3591–3595. IEEE.

Schwenker, F. (2018). Multimodal affect classiﬁca-

tion using deep neural networks. Studientexte

zur Sprachkommunikation: Elektronische Sprachsig-

nalverarbeitung 2018, pages 240–246.

Sellner, J., Thiam, P., and Schwenker, F. (2018). Visualizing

facial expression features of pain and emotion data. In

IAPR Workshop on Multimodal Pattern Recognition of

Social Signals in Human-Computer Interaction, pages

101–115. Springer.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: a simple way

to prevent neural networks from overﬁtting. The Jour-

nal of Machine Learning Research, 15(1):1929–1958.

Thiam, P., Kessler, V., Amirian, M., Bellmann, P., Layher,

G., Zhang, Y., Velana, M., Gruss, S., Walter, S., Traue,

H. C., Kim, J., Schork, D., Andre, E., Neumann, H.,

and Schwenker, F. (2019). Multi-modal pain inten-

sity recognition based on the senseemotion database.

IEEE Transactions on Affective Computing.

Thiam, P., Kessler, V., and Schwenker, F. (2017). Hierar-

chical combination of video features for personalised

pain level recognition. In ESANN.

Thiam, P. and Schwenker, F. (2017). Multi-modal data fu-

sion for pain intensity assessment and classiﬁcation.

In 2017 Seventh International Conference on Image

Processing Theory, Tools and Applications (IPTA),

pages 1–6. IEEE.

Turk, D. C. and Gatchel, R. J. (2018). Psychological ap-

proaches to pain management: A practitioner’s hand-

book. Guilford publications.

Turk, D. C., Meichenbaum, D., and Genest, M. (1983). Pain

and behavioral medicine: A cognitive-behavioral per-

spective, volume 1. Guilford Press.

Velana, M., Gruss, S., Layher, G., Thiam, P., Zhang, Y.,

Schork, D., Kessler, V., Meudt, S., Neumann, H.,

Kim, J., Schwenker, F., Andr

e, E., Traue, H. C., and

Walter, S. (2016). The senseemotion database: A

multimodal database for the development and system-

atic validation of an automatic pain- and emotion-

recognition system. In MPRSS 2016, pages 127–139.

Wagner, J., Lingenfelser, F., Baur, T., Damian, I., Kistler,

F., and Andr

e, E. (2013). The social signal interpre-

tation (ssi) framework: Multimodal signal processing

and recognition in real-time. In Proceedings of the

21st ACM International Conference on Multimedia,

MM ’13, pages 831–834, New York, NY, USA. ACM.

Walter, S., Gruss, S., Ehleiter, H., Junwen Tan, Traue, H. C.,

Werner, P., Al-Hamadi, A., Crawcour, S., Andrade,

A. O., and Moreira da Silva, G. (2013). The biovid

heat pain database data for the advancement and sys-

tematic validation of an automated pain recognition

system. In 2013 IEEE International Conference on

Cybernetics (CYBCO), pages 128–131.

Wang, J.-C., Lee, H.-P., Wang, J.-F., and Lin, C.-B. (2008).

Robust environmental sound recognition for home au-

tomation. IEEE Transactions on Automation Science

and Engineering, 5(1):25–31.

Zwakhalen, S. M., Hamers, J. P., Abu-Saad, H. H., and

Berger, M. P. (2006). Pain in elderly people with

severe dementia: a systematic review of behavioural

pain assessment tools. BMC geriatrics, 6(1):3.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

150