Sentiment Analysis from Sound Spectrograms via Soft BoVW

and Temporal Structure Modelling

George Pikramenos

1,4

, Georgios Smyrnis

, Ioannis Vernikos

3,4

, Thomas Konidaris

Evaggelos Spyrou

3,4

and Stavros Perantonis

Department of Informatics & Telecommunications, National Kapodistrian University of Athens, Athens, Greece

School of Electrical & Computer Engineering, National Technical University of Athens, Athens, Greece

Department of Computer Science and Telecommunications, University of Thessaly, Lamia, Greece

Institute of Informatics & Telecommunications, NSCR-“Demokritos”, Athens, Greece

Keywords:

Sentiment Analysis, Speech Analysis, Bag-of-Visual-Words.

Abstract:

Monitoring and analysis of human sentiments is currently one of the hottest research topics in the ﬁeld of

human-computer interaction, having many applications. However, in order to become practical in daily life,

sentiment recognition techniques should analyze data collected in an unobtrusive way. For this reason, ana-

lyzing audio signals of human speech (as opposed to say biometrics) is considered key to potential emotion

recognition systems. In this work, we expand upon previous efforts to analyze speech signals using computer

vision techniques on their spectrograms. In particular, we utilize ORB descriptors on keypoints distributed

on a regular grid over the spectrogram to obtain an intermediate representation. Firstly, a technique similar to

Bag-of-Visual-Words (BoVW) is used, where a visual vocabulary is created by clustering keypoint descriptors,

but instead a soft candidacy score is used to construct the histogram descriptors of the signal. Furthermore,

a technique which takes into account the temporal structure of the spectrograms is examined, allowing for

effective model regularization. Both of these techniques are evaluated in several popular emotion recognition

datasets, with results indicating an improvement over the simple BoVW method.

1 INTRODUCTION

The recognition of human emotional states consti-

tutes one of the most recent trends in the broader re-

search area of human computer interaction (Cowie

et al., 2001). Several modalities are typically used

towards this goal; information is extracted by sen-

sors, either placed on the subject’s body or within

the user’s environment. In the ﬁrst case, one may

use e.g., physiological and/or inertial sensors, while

on the latter case cameras and/or microphones. Even

though video is becoming the main public means of

self-expression (Poria et al., 2016), people still con-

sider cameras to be more invasive than microphones

(Zeng et al., 2017). Body sensors offer another rea-

sonably practical alternative but they have been criti-

cized for causing discomfort, when used for extended

periods of time. Therefore, for many applications,

microphones are the preferred choice of monitoring

emotion.

As such, a plethora of research works in the area

of emotion recognition is based on the scenario where

microphones are placed within the subject’s environ-

ment, capturing its vocalized speech, which is com-

prised of a linguistic and a non-linguistic compo-

nent. The ﬁrst consists of articulated patterns, as pro-

nounced by the speaker, while the second captures

how the speaker pronounces these patterns. In other

words, the linguistic speech is what the subject said

and the non-linguistic speech is how the subject pro-

nounced it (Anagnostopoulos et al., 2015) (i.e., the

acoustic aspect of speech).

For several years, research efforts were mainly

based on hand-crafted features, extracted from non-

linguistic content. Such features are e.g., the rhythm,

pitch intensity etc. (Giannakopoulos and Pikrakis,

2014). The main advantage of non-linguistic ap-

proaches is that they are able to provide language-

independent models since they do not require a speech

recognition step; instead, recognition is based only on

pronunciation. Of course, the problem remains chal-

lenging in this case, since cultural particularities can

majorly affect the non-linguistic content, even when

dealing with data originating from a single language;

Pikramenos, G., Smyrnis, G., Vernikos, I., Konidaris, T., Spyrou, E. and Perantonis, S.

Sentiment Analysis from Sound Spectrograms via Soft BoVW and Temporal Structure Modelling.

DOI: 10.5220/0009174503610369

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 361-369

ISBN: 978-989-758-397-1; ISSN: 2184-4313

361

there exists a plethora of different sentences, speak-

ers, speaking styles and rates (El Ayadi et al., 2011).

In this work, we extend previous work (Spyrou

et al., 2019) and we propose an approach for emo-

tion recognition from speech, which is based on the

generation of spectrograms. Visual descriptors are

extracted from these spectrograms and a visual vo-

cabulary is generated. Contrary to previous work,

each spectrogram is described using soft histogram

features. In addition, a sequential representation is

utilized to capture the temporal structure of the origi-

nal audio signal. The rest of this paper is organized as

follows: In Section 2 certain works related to our sub-

ject of study are presented. In Section 3 we present

the proposed methodology for emotion recognition

from audio signals. Experiments are presented and

discussed in Section 4 and conclusions are drawn in

Section 5.

2 RELATED WORK AND

CONTRIBUTIONS

Several research efforts have been conducted towards

the goal of emotion recognition from human speech.

Wang and Guan make use of both low level audio fea-

tures, namely prosodic features, MFCCs and formant

frequencies, as well as visual features extracted from

the facial image of the speaker, for this task, demon-

strating a high level of accuracy when both modalities

are used (Wang and Guan, 2008). In a similar vein,

while also studying purely audio signals, Nogueiras et

al. make use of low level audio features, in conjunc-

tion with HMMs, for the emotion recognition task

(Nogueiras et al., 2001). Their results demonstrate

the value of classiﬁcation methods relying on sequen-

tial data, as well as that of features extracted directly

from the auditory aspects of speech.

Research on emotion recognition has also been

conducted on text data, which is can be thought as

an alternative representation of speech. In particular,

Binali et al. study the task of positive and negative

emotion recognition through text derived from online

sources (Binali et al., 2010). This is done via syntac-

tic and semantic preprocessing of text, before using

an SVM for the classiﬁcation of the data. Such lexi-

cal features can also be utilized for vocalized speech

analysis, as demonstrated by Rozgi

c et al., who make

use of an ASR system to extract the corresponding

text, before using both acoustic and lexical features

for emotion recognition (Rozgi

c et al., 2012). A sim-

ilar fusion of low level acoustic and lexical features,

using an SVM as a classiﬁer is performed by Jin et al.,

where the text transcription is available beforehand,

rather than extracted (Jin et al., 2015).

The task at hand has also been studied via the anal-

ysis of spectrograms corresponding to the audio sig-

nals of speech. Spyrou et al make use of visual fea-

tures extracted from keypoints of spectrograms, and a

Bag-of-Visual-Words (BoVW) representation is used

to describe the entire speech signal. In this represen-

tation, clusters are created from the feature vectors

corresponding to keypoint descriptors, and then a his-

togram, where each keypoint is assigned to its clos-

est word/cluster, is used to represent the signal (Spy-

rou et al., 2019). However, this representation might

be slightly rigid, since each keypoint is forcefully as-

signed to a single visual word, even if it is closely

related to more than one. Moreover, a single feature

vector is assigned to the entire audio signal, so no in-

formation is extracted from the temporal relations en-

coded in the spectrogram.

The value of these temporal relations in the task of

emotion recognition can be seen from several works

which make use of recurrent classiﬁers to obtain regu-

larized models for sentiment analysis. W

ollmer et al.

make use of Bidirectional LSTMs to recognize emo-

tion based on features derived from both the speech

and facial image of the speaker, demonstrating the ca-

pacity of this architecture for such a task (W

ollmer

et al., 2010). Lim et al. also use both LSTMs and

CNNs when analysing spectrograms for speech, with

similarly good results regarding the accuracy of emo-

tion recognition (Lim et al., 2016). Moreover, Trentin

et al. make use of probabilistic echo state networks

(π-ESNs) for the given task (Trentin et al., 2015).

They also note that recurrent networks of this alter-

native type not only perform very well on the task of

emotion recognition based on acoustic features, but

are also able to handle unlabeled data, since their un-

supervised nature allows them to increase the number

of distinct emotions as necessary. Finally, this tem-

poral structure can also be modeled solely via the use

of CNNs, as performed by Mao et al. (Mao et al.,

2014). In that work, the authors analyze the spec-

trograms of the audio corresponding to speech, us-

ing convolutional layers over the input to extract data

which preserves the structure of the image. After-

wards, discriminative features are extracted from this

analysis, leading to a model with great capacity in

emotion recognition from the corresponding speech

signals.

In this context, the main contributions of this work

are the use of a modiﬁed BoVW model which utilizes

soft histograms for the frequency of visual words in a

spectrogram, as well as the examination of a sequen-

tial representation of the spectrograms, where we con-

sider a soft histogram of visual words at each “posi-

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

362

EMO-DBEMOVOSAVEE

Anger HappinessFear Sadness Neutral

Figure 1: Spectrograms produced by applying DTSTFT with window size 40ms and step size 20ms to randomly selected audio

samples from each of the considered emotions and datasets. The used audio clips have length 2s and where appropriately

cropped when necessary. The vertical axis corresponds to frequency while the horizontal axis corresponds to time (Figure

best viewed in color).

tion” of the sliding window used in the Discrete-Time

Short-Time Fourier Transform (see section 3.2). It is

empirically shown that for practical numbers of words

the representation relying on these soft histograms

leads to better classiﬁcation results compared to nor-

mal histogram representations in the emotion recog-

nition task. Furthermore, by making use of a sequen-

tial representation for the audio signal, we take into

account its temporal structure, allowing us to build

better performing models.

3 METHODOLOGY

3.1 Target Variables in Emotion

Recognition

In this subsection we give a few remarks regarding

the labeling of data for sentiment analysis. Most ex-

isting datasets (e.g., (Burkhardt et al., 2005), (Jack-

son and ul haq, 2011), (Costantini et al., 2014)) with

categorical labels utilize in some way the basic emo-

tions featured in Plutchik’s theory (Plutchik, 1980).

These are joy, trust, expectation, fear, sadness, dis-

gust, anger, and surprise. The boundaries between

these may be thin and one can argue that many af-

fective states are not well represented by this dis-

cretization. For this reason other datasets consider

real-valued target variables to capture emotional state

(Busso et al., 2008). Such characterizations of

emotions may vary but typically rely on the PAD

model (Mehrabian, 1995), which analyzes emotion

into Pleasure-Arousal-Dominance, each represented

by a real number value. Nevertheless, in this work

we treat sentiment recognition as a classiﬁcation task

as we feel that this approach leads to more intuitive

results and more clear validation.

3.2 Spectrogram Generation &

Descriptors

In the ﬁrst step of building our representation, we start

from a subsampled sound signal {s(t

)}

N−1

k=0

and use

the Discrete-Time Short-Time Fourier Transform (Gi-

annakopoulos and Pikrakis, 2014), to obtain a two-

dimensional spectrogram given by,

S(k, n) =

N−1

∑

k=0

s(t

)w(t

− nT

−

2kπit

, (1)

where

w(t − τ) =

(

1, |t − τ| ≤ T

0, o/w

(2)

and N

is the number of samples in each window.

The window slides over the entire signal using a step

size T

. The resulting spectrogram is converted to a

Sentiment Analysis from Sound Spectrograms via Soft BoVW and Temporal Structure Modelling

363

grayscale image I. In our performed experiments, the

window size was set to 40ms while the step size was

20ms and each audio clip was cropped to be of length

2s.

The next step is to deﬁne a grid of keypoints,

each speciﬁed by its pixel coordinates and a scale

value. For each keypoint an ORB descriptor is ob-

tained (Rublee et al., 2011). This replaces the choice

of SIFT/SURF (Lowe, 2004), (Bay et al., 2006) in

previous work (Spyrou et al., 2019) and has the bene-

ﬁt that unlike SIFT and SURF, ORB is free, increas-

ing the potential for utilizing our method in a real

emotion recognition system.

3.3 Extracting Visual Words & Soft

Histogram Features

After extracting the ORB descriptors for each im-

age, a pool P = {w

}

of visual words is created us-

ing some clustering technique on the entire corpus of

ORB descriptors for all images and keypoints. The

size of P , (i.e., the number of clusters used/produced)

corresponds to the number of visual words that will

be subsequently used to construct the histograms. In

(Spyrou et al., 2019), for a given pool of words, a

histogram representation of an image is obtained by

matching each descriptor to its nearest word and then

counting the instances for each word.

In this work, we instead use a soft histogram rep-

resentation. For a given keypoint descriptor x, we

compute its l

-distance from each word w

, denoted

(x), and obtain a vector h

with its i

component

given by,

−d

(x)

∑

j∈P

−d

(x)

. (3)

The representation of an image I as a soft histogram

is then given by,

h(I) =

∑

x∈I

. (4)

can be thought of as a soft candidacy score of x

for each word i. This is to be contrasted with the hard

candidacy score used for each keypoint in the simple

histogram approach in (Spyrou et al., 2019). Note that

if only one word w

is close enough to a keypoint x,

then h

≈ 1 and h

≈ 0 for each j 6= i. For such key-

points, the contribution of x to the soft histogram is

similar to its contribution in the hard case. However,

keypoints that have similar distances to many words

are better described by their soft candidacy scores.

Our methodology for this section is described in

more detail in Procedure 1 and in Figure 2.

Procedure 1: Pseudocode for soft-histogram extraction.

Input: Input: Data of subsampled audio signals

D, parameter set λ;

Result: Soft histogram representation of

elements in D;

# Compute Spectrograms

S ← {},{};

for each signal S in D do

ˆs ← DT ST FT (s,λ(T

));

keypoints ← get grid( ˆs,λ(resolution)) ;

descriptors ← ORB(keypoints);

S ←

S ∪ { ˆs};

D ←

D ∪ {descriptors};

end

# Compute Visual Vocabulary

words ← Clustering(

D,λ(vocabulary size));

# Compute Soft Histograms

H ← {};

for each set of descriptors d in

D do

histo ←

for each descriptor d in d do

histo ← histo + soft score(d,words);

end

H ← H ∪ {histo};

end

Note that if enough words are used, simple his-

tograms may offer a sufﬁciently good description.

Nevertheless, increasing the number of words by a lot

increases both preprocessing and inference complex-

ity. Soft histogram representations are thus beneﬁcial

because they allow us to achieve better results with

fewer words.

We may equivalently think of this procedure as

performing a form of fuzzy clustering to obtain a

fuzzy vocabulary of visual words. In fact an alter-

native methodology would be to perform some fuzzy

clustering algorithm, e.g., gaussian mixture clustering

(Theodoridis and Koutroumbas, 1999), and directly

sum the candidacy scores of each keypoint in a given

spectrogram to obtain a soft histogram representation.

Obvious other alternatives like using a different soft

scoring function are also possible.

3.4 Modelling Temporal Structure

An additional limitation of the BoVW model is that

it ignores the temporal structure of matched visual

words. Recall that the spectrogram columns trace a

sliding window in time. This fact can be exploited to

build more robust models for the classiﬁcation task.

In particular, our approach in this work is to represent

the spectrograms as a sequence of soft histograms

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

364

Figure 2: A visual overview of the proposed signal representation for our emotion recognition scheme. A spectrogram

is produced from the audio signal and a grid of keypoints characterized by ORB descriptors is constructed. Clustering

of keypoint descriptors is performed, leading to a vocabulary of visual words corresponding to the obtained clusters. A

soft histogram is obtained by summing the soft candidacy scores of keypoint descriptors to each cluster. The resulting

representation can be fed into a classiﬁer to recognise a speaker’s affective state. (Figure best viewed in color).

of the visual words appearing in each column of the

spectrogram. In more detail, after each descriptor in

the spectrogram is matched to a visual word, each col-

umn of descriptors (including keypoints at all scales)

is converted to a soft histogram. The spectrogram is

then represented as the sequence of soft-histograms

appearing in its columns.

This sequential representation consisting of soft

histograms, encodes more information than the sim-

ple BoVW representation, which is insensitive to ran-

dom permutations of the columns of the spectrogram.

Such permutations of the columns lead to a spectro-

gram corresponding to an entirely different audio sig-

nal than the original, yet, both will have the same

BoVW representation. In other words, information

about the temporal structure of the audio signal is pre-

served by our sequential representation. As such, our

procedure may potentially boost the performance of

classiﬁers.

Our methodology for this section is described in

more detail in Procedure 2 and in Figure 3. The

spectrogram descriptors and visual vocabulary are ob-

tained by following the same steps as in Procedure 1.

Recurrent neural network architectures have

shown great success in processing sequential data and

are thus a natural candidate model for for processing

Procedure 2: Pseudocode for sequential soft-histogram

representation of spectrograms.

Input: Data of subsampled audio signals D,

parameter set λ;

Result: Sequential soft histogram representation

of elements in D;

D,words ← Procedure 1(D,λ);

seqH ← {};

for each set of descriptors d in

D do

seq ← [ ];

for each column of descriptors c in d do

histo ←

for each descriptor d in c do

histo ← histo + soft score(d,words);

end

seq.append(histo)

end

seqH ← seqH ∪ {seq};

end

spectrograms in our sequential representation.

Among other architectures, in our methodology

and experiments we propose a Long Short-Term

Memory (LSTM) network for emotion classiﬁcation

(Hochreiter and Schmidhuber, 1997). These networks

Sentiment Analysis from Sound Spectrograms via Soft BoVW and Temporal Structure Modelling

365

contain memory cells which are deﬁned by the fol-

lowing equations, as presented by Sak et al.:

= σ(W

t−1

+ b

)

= σ(W

f x

f m

t−1

f c

t−1

+ b

)

= f

 c

t−1

+ i

 g (W

t−1

+ b

) (5)

= σ(W

t−1

+ b

)

= o

 h (c

)

= φ(W

+ b

)

which, when applied iteratively for a given input se-

quence (x

,...,x

), produce a corresponding output

sequence (y

,...,y

) (Sak et al., 2014). This archi-

tecture is capable of recognising long-term dependen-

cies between items in the sequence, while also being

superior to alternative recurrent architectures in this

aspect.

4 EXPERIMENTS

In this section we describe the experimental proce-

dure we followed to provide evidence for the effec-

tiveness of our proposed methods in the sentiment

recognition problem. Our experiments are split in

two rounds. In the ﬁrst round, our aim is to show

that using soft histograms in the usual bag of words

model (Spyrou et al., 2019), provides better results

with fewer visual words. In the second round our aim

is to combine the ideas from section 3.4 to create a

classiﬁer that outperforms the simple BoVW models,

even using soft histograms.

For all our experiments, visual words were ob-

tained from the spectrograms using the mini-batch k-

means algorithm (Sculley, 2010). In particular, the

entire corpus was used for each dataset to extract the

words, i.e., the validation and test set too. In prac-

tice this can always be done; once we have a dataset

on which we want to recognise emotion, we can con-

struct a new pool of words which utilizes information

(through an unsupervised algorithm) from the unla-

beled set to improve prediction performance.

4.1 Datasets

The datasets utilized in our experiments consist of

the following popular emotion recognition datasets:

EMO-DB (Burkhardt et al., 2005), SAVEE (Jackson

and ul haq, 2011) and EMOVO (Costantini et al.,

2014). The considered emotions in our experiments

were Happiness, Sadness, Anger, Fear and Neutral

which were common to all considered datasets. For

each dataset and sample corresponding to these emo-

tions, a spectrogram was constructed from the corre-

sponding audio ﬁle, and a 50-by-50 grid of keypoints

was used at three different scales for a total of 7500

descriptors per recording. The spectrograms which

were extracted from the audio ﬁles corresponding to

each dataset can be seen in Figure 1.

4.2 Experimental Procedure

4.2.1 Feature Representation Comparison

For the ﬁrst round of experiments, we evaluated

our soft histogram representation using three popular

classiﬁers: SVM (with an RBF kernel), Random For-

est and k-nearest neighbors with Euclidean distance

as a proximity measure. Each was trained in the emo-

tion recognition task on different representations of

the dataset; soft and simple histograms resulting from

different pools of visual words. We also evaluated the

regular BoVW representation, in order to have a base-

line for comparison with our method.

We performed a 90% − 10% train-test split on

each dataset. For each representation, the values

of hyperparameters for each of the considered clas-

siﬁers, were also selected through grid search and

cross-validation over the training set. In particular,

we used 5 folds for cross validation. For each combi-

nation of hyperparameters, the one attaining the high-

est average accuracy among all folds was chosen as

the best. After obtaining the hyperparameters, the re-

sulting classiﬁer was trained on the entire training set,

and evaluated on the test set. In addition, three differ-

ent numbers of visual words were used and the entire

evaluation was repeated for each. These were 250,

500 and 1000.

4.2.2 Classiﬁcation Benchmarking

In this section we describe our methodology for the

second round of experiments. We built a recur-

rent neural network model using Long-Short Term

Memory (LSTM) units (Hochreiter and Schmidhu-

ber, 1997) to classify spectrograms represented as

sequences of soft histograms (see section 3.4). A

simple evaluation procedure was repeated for each

dataset and each choice of the number of visual words

used. A bidirectional LSTM network with a single

layer was used as a feature extractor and and a multi-

label logistic regression model was used on top for

classiﬁcation. For each trial, an 80% − 10% − 10%

train-validation-test split was made and the model was

trained on the training set using an early-stopping

callback for regularization, which monitored the ac-

curacy on the validation set. In particular, the accu-

racy on the validation set was monitored with a pa-

tience of 15 epochs.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

366

Table 1: Results for the ﬁrst round of experiments, where the histogram and soft-histogram representations are compared.

Dataset W.C. Method SVM KNN Random

Forest

EMO-DB

250

SMH 47.22 50.00 38.89

BoVW 44.44 33.33 38.89

500

SMH 52.78 41.67 52.78

BoVW 52.78 38.89 50.00

1000

SMH 55.56 47.50 55.00

BoVW 52.50 47.50 52.50

SAVEE

250

SMH 31.25 31.25 25.00

BoVW 28.13 31.25 25.00

500

SMH 52.78 48.89 61.11

BoVW 52.78 44.44 61.11

1000

SMH 58.33 52.78 58.33

BoVW 55.56 36.11 55.56

EMOVO

250

SMH 35.00 32.50 35.00

BoVW 30.00 27.50 32.50

500

SMH 32.50 42.50 35.00

BoVW 27.50 27.50 32.50

1000

SMH 32.50 40.00 50.00

BoVW 27.50 17.50 45.00

Table 2: Results for classiﬁcation benchmarking experiment. Accuracies are given in percentages. W.C. represents word

count and B represents the benchmark obtained from the previous round of experiments.

Dataset W.C.

Accuracy

Mean Max Min

EMO-DB

250 59.76 60.98 58.54 50.00

500 62.93 63.17 60.54 52.78

1000 56.48 57.34 56.10 55.56

SAVEE

250 37.77 44.44 30.55 31.25

500 65.08 69.16 62.22 61.11

1000 68.77 70.55 66.39 58.33

EMOVO

250 40.91 42.86 38.90 35.00

500 50.98 57.14 47.62 42.50

1000 50.12 53.64 46.53 50.00

The resulting model was evaluated on the test set

and 10 trials were made for each word count-dataset

setup. The mean, maximum and minimum achieved

accuracies are listed as percentages and contrasted

with the benchmarks obtained in the ﬁrst round of ex-

periments.

4.3 Results

The observed results on both experiments are in line

with our expectations. For the ﬁrst round, the soft

histogram representation is indeed found to aid clas-

siﬁcation for all tested classiﬁers relative to the simple

histogram representation. The results are summarized

in Table 1. In the second round of experiments, we

observe the added value of the sequential represen-

tation of spectrograms, through the much better ob-

tained classiﬁers relative to the BoVW model from

experiment 1. The results are summarized in Table 2.

We observe that although there is a signiﬁcant in-

crease in performance when going from a vocabu-

lary size of 250 to 500 for both methods, the perfor-

mance change when increasing the vocabulary size

from 500 to 1000 is not signiﬁcant and may even

lead to decreased performance. This is possibly be-

cause the signal-to-noise ratio is decreased for too

large visual vocabularies and this provides additional

evidence that using soft histogram representations is

beneﬁcial, since it leads to increased performance for

fewer words.

5 CONCLUSIONS

There is an increasing interest in emotion recognition

from audio signals, as sound recordings can be col-

Sentiment Analysis from Sound Spectrograms via Soft BoVW and Temporal Structure Modelling

367

…………………

...

Figure 3: A visual overview of the proposed sequential representation for our emotion recognition scheme. Similar steps to

procedure 1 are taken to produce a visual vocabulary. Then each column of the spectrogram, corresponding to a temporal

“position” of the sliding window used in the DTSTFT transformation of the audio signal is converted to soft histogram in the

obvious manner. The sequence of soft histograms produced for a single audio signal is the passed to an LSTM classiﬁer to

recognise the speaker’s sentiment. (Figure best viewed in color).

lected without causing as much discomfort to the un-

derlying subjects as other methods (e.g., video, body

sensors e.t.c.). For this reason, we have explored

new methods for analyzing audio signals of vocalized

speech for the purpose of recognizing the affective

state exhibited by the speaker, which build on spec-

trogram analysis methods found in the literature.

We conclude that the proposed approaches to sen-

timent recognition from sound spectrograms offer a

signiﬁcant improvement over previous work. In par-

ticular, by combining the soft histogram representa-

tion of visual words along with temporal structure

modelling, we are able to obtain much better classi-

ﬁers in terms of accuracy. Another upside is that we

replaced the use of SIFT/SURF (which previous work

utilized) with ORB, which is a free alternative, with-

out loss in model quality.

Overall, the explored approach to emotion recog-

nition offers good performance on data collected

through a non-invasive method and can be found use-

ful in many HCI applications including assisted liv-

ing and personalizing content. Moreover, because our

method relies on low-level features (spectrograms) it

leads to language independent models and we have

empirically veriﬁed its good performance on two dif-

ferent languages (English and German) for the same

sentiment analysis tasks.

ACKNOWLEDGEMENTS

This research has been co-ﬁnanced by the European

Union and Greek national funds through the Oper-

ational Program Competitiveness, Entrepreneurship

and Innovation, under the call RESEARCH – CRE-

ATE – INNOVATE (project code: 1EDK-02070).

REFERENCES

Anagnostopoulos, C.-N., Iliou, T., and Giannoukos, I.

(2015). Features and classiﬁers for emotion recog-

nition from speech: a survey from 2000 to 2011. Ar-

tiﬁcial Intelligence Review, 43(2):155–177.

Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf:

Speeded up robust features. In European conference

on computer vision, pages 404–417. Springer.

Binali, H., Wu, C., and Potdar, V. (2010). Computational

approaches for emotion detection in text. In 4th IEEE

International Conference on Digital Ecosystems and

Technologies, pages 172–177. IEEE.

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F.,

and Weiss, B. (2005). A database of german emotional

speech. In Ninth European Conference on Speech

Communication and Technology.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower,

E., Kim, S., Chang, J. N., Lee, S., and Narayanan,

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

368

S. S. (2008). Iemocap: Interactive emotional dyadic

motion capture database. Language resources and

evaluation, 42(4):335.

Costantini, G., Iadarola, I., Paoloni, and Todisco, M.

(2014). Emovo corpus: an italian emotional speech

database.

Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis,

G., Kollias, S., Fellenz, W., and Taylor, J. G. (2001).

Emotion recognition in human-computer interaction.

IEEE Signal processing magazine, 18(1):32–80.

El Ayadi, M., Kamel, M. S., and Karray, F. (2011). Sur-

vey on speech emotion recognition: Features, classi-

ﬁcation schemes, and databases. Pattern Recognition,

44(3):572–587.

Giannakopoulos, T. and Pikrakis, A. (2014). Introduction

to audio analysis: a MATLAB

 approach. Academic

Press.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Jackson, P. and ul haq, S. (2011). Surrey audio-visual ex-

pressed emotion (savee) database.

Jin, Q., Li, C., Chen, S., and Wu, H. (2015). Speech emo-

tion recognition with acoustic and lexical features.

In 2015 IEEE international conference on acoustics,

speech and signal processing (ICASSP), pages 4749–

4753. IEEE.

Lim, W., Jang, D., and Lee, T. (2016). Speech emotion

recognition using convolutional and recurrent neural

networks. In 2016 Asia-Paciﬁc Signal and Informa-

tion Processing Association Annual Summit and Con-

ference (APSIPA), pages 1–4. IEEE.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International journal of computer

vision, 60(2):91–110.

Mao, Q., Dong, M., Huang, Z., and Zhan, Y. (2014). Learn-

ing salient features for speech emotion recognition us-

ing convolutional neural networks. IEEE transactions

on multimedia, 16(8):2203–2213.

Mehrabian, A. (1995). Framework for a comprehensive de-

scription and measurement of emotional states. Ge-

netic, social, and general psychology monographs.

Nogueiras, A., Moreno, A., Bonafonte, A., and Mari

no,

J. B. (2001). Speech emotion recognition using hid-

den markov models. In Seventh European Conference

on Speech Communication and Technology.

Plutchik, R. (1980). A general psychoevolutionary theory of

emotion. In Theories of emotion, pages 3–33. Elsevier.

Poria, S., Chaturvedi, I., Cambria, E., and Hussain, A.

(2016). Convolutional mkl based multimodal emo-

tion recognition and sentiment analysis. In 2016

IEEE 16th international conference on data mining

(ICDM), pages 439–448. IEEE.

Rozgi

c, V., Ananthakrishnan, S., Saleem, S., Kumar, R.,

Vembu, A. N., and Prasad, R. (2012). Emotion recog-

nition using acoustic and lexical features. In Thir-

teenth Annual Conference of the International Speech

Communication Association.

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. R.

(2011). Orb: An efﬁcient alternative to sift or surf. In

ICCV, volume 11, page 2. Citeseer.

Sak, H., Senior, A., and Beaufays, F. (2014). Long short-

term memory recurrent neural network architectures

for large scale acoustic modeling. In Fifteenth annual

conference of the international speech communication

association.

Sculley, D. (2010). Web-scale k-means clustering. In

Proceedings of the 19th international conference on

World wide web, pages 1177–1178. ACM.

Spyrou, E., Nikopoulou, R., Vernikos, I., and Mylonas, P.

(2019). Emotion recognition from speech using the

bag-of-visual words on audio segment spectrograms.

Technologies, 7(1):20.

Theodoridis, S. and Koutroumbas, K. D. (1999). Pattern

recognition. IEEE Trans. Neural Networks, 19:376.

Trentin, E., Scherer, S., and Schwenker, F. (2015). Emo-

tion recognition from speech signals via a probabilis-

tic echo-state network. Pattern Recognition Letters,

66:4–12.

Wang, Y. and Guan, L. (2008). Recognizing human emo-

tional state from audiovisual signals. IEEE transac-

tions on multimedia, 10(5):936–946.

ollmer, M., Metallinou, A., Eyben, F., Schuller, B., and

Narayanan, S. (2010). Context-sensitive multimodal

emotion recognition from speech and facial expres-

sion using bidirectional lstm modeling. In Proc. IN-

TERSPEECH 2010, Makuhari, Japan, pages 2362–

2365.

Zeng, E., Mare, S., and Roesner, F. (2017). End user se-

curity and privacy concerns with smart homes. In

Thirteenth Symposium on Usable Privacy and Secu-

rity ({SOUPS} 2017), pages 65–80.

Sentiment Analysis from Sound Spectrograms via Soft BoVW and Temporal Structure Modelling

369