Acoustic Detection of Violence in Real and Fictional Environments

Marta Bautista-Dur

an, Joaqu

ın Garc

ıa-G

omez, Roberto Gil-Pita, H

ector S

anchez-Hevia,

Inma Mohino-Herranz and Manuel Rosa-Zurera

Signal Theory and Communications Department, University of Alcal

a, 28805 Alcal

a de Henares, Madrid, Spain

{marta.bautista, joaquin.garciagomez}@edu.uah.es

Keywords:

Violence Detection, Audio Processing, Feature Selection, Real Environment, Fictional Environment.

Abstract:

Detecting violence is an important task due to the amount of people who suffer its effects daily. There is a

tendency to focus the problem either in real situations or in non real ones, but both of them are useful on

its own right. Until this day there has not been clear effort to try to relate both environments. In this work

we try to detect violent situations on two different acoustic databases through the use of crossed information

from one of them into the other. The system has been divided into three stages: feature extraction, feature

selection based on genetic algorithms and classiﬁcation to take a binary decision. Results focus on comparing

performance loss when a database is evaluated with features selected on itself, or selection based in the other

database. In general, complex classiﬁers tend to suffer higher losses, whereas simple classiﬁers, such as linear

and quadratic detectors, offers less than a 10% loss in most situations.

1 INTRODUCTION

The term of violence has a subjective connotation, but

one deﬁnition extracted from The World Health Or-

ganization deﬁned violence as “the intentional use of

physical force or power, threatened or actual, against

oneself, another person, or against a group or commu-

nity, which either results in or has a high likelihood of

resulting in injury, death, psychological harm, malde-

velopment, or deprivation” (Krug et al., 2002). There

are many more valid deﬁnitions, such as “physical vi-

olence or accident resulting in human injury or pain”

(Demarty et al., 2012), “a series of human actions

accompanying with bleeding” (Chen et al., 2011) or

“any situation or action that may cause physical or

mental harm to one or more persons” (Giannakopou-

los et al., 2006). In the context of this work the kind of

actions that will be consider as violence are shouting

and hits.

Violence can take place in multiple environments

and in multiple ways. It is important to obtain a

method capable of detecting violent situations on their

early stages with the aim of stopping them or prevent-

ing them from escalating.

Some related work on the literature is based on

multimedia contest, such as (Demarty et al., 2012),

(Xu et al., 2005), or (Nam et al., 1998), where the

database is composed by audio and video signals ex-

tracted from movies. With a mixed setup it is possi-

ble to detect violent content from ‘bloody’ scenes, or

simply from the behavior of people extracted from the

video. If the task is to detect violence in real environ-

ments, using cameras entails a privacy intrusion that

can be avoided using audio alone. That is why our

purpose is evaluating only the audio.

These studies have been done using pretended vi-

olence from ﬁlms, although can distort the general-

ization of the results when presented with actual vio-

lence. Violence detection is an emerging ﬁeld related

with smart cities. For that reason the objective in this

work is to evaluate the results when data from both

real and pretended scenarios is combined.

In this paper two different kinds of violence have

been considered. On the one hand, actual violent sit-

uations where the audio has been taken from records

directly from real recordings. On the other hand, ﬁc-

tional situations, where the data is composed of vari-

ous audio clips extracted from ﬁlm scenes. The possi-

ble applications of violence detection in real scenarios

have been explained in detail in (Garc

ıa-G

omez et al.,

2016). Fictional violence detection can be a useful

tool for content tagging on videogames or movies, in

order to secure child protection.

There are some reasons why we have distin-

guished between these two situations. One of them is

that in real scenarios the signals are not preprocessed,

unlike ﬁctional scenarios where the signals are heav-

ily modiﬁed by different factors. This preprocessing

456

Bautista-Durán, M., García-Gómez, J., Gil-Pita, R., Sánchez-Hevia, H., Mohino-Herranz, I. and Rosa-Zurera, M.

Acoustic Detection of Violence in Real and Fictional Environments.

DOI: 10.5220/0006195004560462

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 456-462

ISBN: 978-989-758-222-6

FEATURE

EXTRACTION

VIOLENCE

DETECTION

Figure 1: Proposed system.

modiﬁes the properties of the audio signal. Another

reason is the different situation in where they take

place. In real environments the sound is very different

to than on ﬁctional ones, hits or speech in real envi-

ronments may have background noise and the speech

loudness varies over time. Movies recreate sound as

much as the actions taking place on screen. Audio

tracks are commonly composed of a series of care-

fully chosen sounds with the main objective of being

pleasing to the ear. The objective of this paper is to

evaluate the performance of a single violence detec-

tion system when exposed to sounds coming from two

sources. This will be explained in detail below.

This paper is structured as follows. First, Sec-

tion 2 introduces the implemented detector system,

the feature extraction (Subsection 2.1) and the feature

selection (Subsection 2.2). Then, Section 3 describes

the experiments and results, including the description

of the database (Subsection 3.1), the description of

the experiments (Subsection 3.2) and the discussion

of the results (Subsection 3.3). Finally, Section 4

presents the conclusions.

2 PROPOSED SYSTEM

The proposed system has the aim of resolving the vi-

olence detection problem in both real and ﬁctional

environments on its own, as well as comparing the

performance when they are combined. As we previ-

ously stated, the system will be only based on audio,

which will be processed to extract useful information

and then the data will be classiﬁed to make a deci-

sion every T seconds. In Figure 1 the scheme of the

system is proposed.

In this study three different classiﬁers will be

tested: a Least Squares Linear Detector (LSLD), a

simpliﬁed version of Least Squares Quadratic Detec-

tor (LSQD) and a Neural Network based Detector

with 5 hidden neurons. All of them are explained in

detail in (Garc

ıa-G

omez et al., 2016).

2.1 Feature Extraction

The objective of the feature extraction is to process

the input audio signal in order to obtain useful infor-

mation that helps the classifying algorithm to properly

detect violent situations. Features have been evalu-

ated in frequency or time domain. In order to evalu-

ate these features, the audio segments have been di-

vided into S frames of 400 ms length with an overlap

of 95%. The evaluated features are:

• Mel-Frecuency Cepstral Coefﬁcients (MFCCs)

Mel-Frecuency Cepstral Coefﬁcients have been

computed from the Short Time Fourier Transform

(STFT). MFCCs is commonly used in speech

recognition due to the fact that Mel scale divides

the frequency bands in a similar way to the human

ear. The information provided by this feature is a

compact representation of the spectral envelope,

so most of the signal energy is located in the ﬁrst

coefﬁcients. We are using 25 coefﬁcients. The

statistics applied to this feature are: mean, stan-

dard deviation (std), maximum (max) and median

(these two last only in some MFCCs).

• Delta Mel-Frequency Cepstral Coefﬁcients

(∆MFCCs)

This feature is extracted from the previous one,

and represent the difference between two MFCCs.

The implementation details are presented in (Mo-

hino et al., 2013). The statistics applied to this

feature are: mean and standard deviation.

• Pitch

Also named fundamental frequency, this feature

determines the tone of the speech and can be used

to distinguish between persons (Gil-Pita et al.,

2015). In order to get this measure, the prediction

error is obtained by ﬁltering the audio frames with

the linear prediction coefﬁcients and then the au-

tocorrelation of the error is evaluated. If the value

of peaks in the autocorrelation is 20% higher than

the maximum of the autocorrelation, the frame

will be considered as voiced, otherwise unvoiced

(Mohino et al., 2013) The statistics applied to this

measure are: mean and standard deviation.

• Harmonic Noise Rate (HNR)

Harmonic Noise Rate measures the relationship

between the harmonic energy produced by the vo-

cal cords versus non-harmonic energy present in

the signal (Mohino et al., 2011). The statistics

applied to this measure are: mean and standard

deviation.

• Short Time Energy (STE)

Short Time Energy is the energy of a short speech

segment. This parameter is considered a good fea-

ture to differentiate between voiced and unvoiced

frames (Jalil et al., 2013). The statistics applied to

this measure are: mean and standard deviation.

Acoustic Detection of Violence in Real and Fictional Environments

457

• Energy Entropy (EE)

The Energy Entropy expresses abrupt changes in

the energy level of the audio signal. This is useful

for detecting violence due to rapid changes occur-

ring in the tone of voice. To obtain this parameter

the frames are subdivided into small subframes.

The statistics applied to this measure are: mean,

standard deviation, maximum, ratios of maximum

to mean and maximum to median value.

• Zero Crossing Rate (ZCR)

Zero Crossing Rate shows how quickly the power

spectrum of a signal fram is changing in relation

to the previous one (Giannakopoulos et al., 2006).

The statistics applied to this measure are: mean

and standard deviation.

• Spectral Flux (SF)

This feature is evaluated in the frequency domain.

It represents the squared difference between the

normalized magnitudes of successive spectral dis-

tributions (Tzanetakis and Cook, 2002). The

statistics applied to this measure are: mean and

standard deviation.

• Spectral Rolloff (SR)

This measure represent the skewness of the spec-

tral shape (Giannakopoulos et al., 2006). It is

deﬁned as the frequency below which a percent-

age of the magnitude distribution of the Discrete

Fourier Transform (DFT) coefﬁcients are con-

centrated for frame. Different information can

be extracted from music, speech or gunshots,

so it might be interesting for violence detection

(Garc

ıa-G

omez et al., 2016). The statistics ap-

plied to this measure are: mean and standard de-

viation.

• Spectral Centroid (SC)

Spectral Centroid studied in the frequency domain

is deﬁned as the center of gravity of the magni-

tude spectrum of the STFT (Tzanetakis and Cook,

2002). The statistics applied to this measure are:

mean and standard deviation.

• Ratio of Unvoiced Time Frames (RUF)

This value is associated to the presence or ab-

sence of strong speech in the analyzed audio. For

that, the amount of unvoiced frames is evaluated

(Garc

ıa-G

omez et al., 2016).

• Spectrum (SP)

This measure corresponds to the DFT of the signal

(Doukas and Maglogiannis, 2011). The statistics

applied to this measure are: maximum and stan-

dard deviation.

2.2 Selecting Features

In order to select the best features, we have resorted

to a Genetic Algorithm (GA), which is based on the

random exchange of features between the individuals

of a population. This population represents the pos-

sible set of solutions for the problem. GA involves

four steps: creation of the population, individual se-

lection, crossover and mutation. After the ﬁrst itera-

tion, the algorithm goes back to the selection step and

repeats cyclically. The parameter to be optimized is

the probability of detection for a given probability of

false alarm. In order to maximize this value, features

are ranked according to their performance. By using

only the best features the performance will increase

and the computational cost of the implementation will

decrease. The classiﬁers used in the optimization

process were LSLD and LSQD as in (Garc

ıa-G

omez

et al., 2016), to soften the computational cost, which

is far less than employing neural networks.

The parameters used are the same than in (Garc

ıa-

omez et al., 2016): 51 total features, 20 selected fea-

tures, 100 individuals, 10 parents, 90 generated sons,

a probability of mutation of 4%, 30 iterations and 10

repetitions of the GA.

3 EXPERIMENTS AND RESULTS

The main objective of the paper is to study the rela-

tion between actual violence and violence recreations

from movies. Because of that, a set of experiments

has been carried out using two different databases,

both sampled to a frequency of 22,050 Hz and com-

posed of audio segments of 5 seconds length. Frame

length was selected due to its performance when com-

pared to other values.

3.1 Database Description

In order to carry out the experiments of this paper, we

need two different databases: one composed of real

world audio and another from ﬁlms. The ﬁrst one was

developed in (Garc

ıa-G

omez et al., 2016), so we will

use the same to ease a comparison. The details of this

database are summarized in Table 1.

The new database shares the most its important

properties with the old one, such as the amount of per-

centage of violence (around 10%) and the sampling

frequency (22,050 Hz). The ﬁlm database is com-

posed of small extracts from ﬁlms (between tenths of

seconds and a few minutes), labeled according to the

kind of content in as to indicate when a violent situ-

ation is taking place. The summarized properties are

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

458

Table 1: Summary of the real world database.

Parameters Value

Total duration 27,802 s

Violence duration 3,051 s

Percentage of violence 10.97%

Number of fragments 109

Minimum audio length 1.51 s

Maximum audio length 4,966 s

Table 2: Summary of the movie database.

Parameters Value

Total duration 15,701 s

Violence duration 1,466 s

Percentage of violence 9.34%

Number of fragments 902

Number of ﬁlms 119

Minimum audio length 15 s

Maximum audio length 126.30 s

detailed in Table 2. In order to get a database suitable

for a general study, many ﬁlm genres have been in-

cluded in the database, such as: action (Aliens, The

Avengers, The Dark Knight), comedy (Anchorman,

Balls of Fury, You, me and Dupree), fantasy (Avatar,

The Chronicles of Narnia, Harry Potter and the Half-

Blood Prince), drama (Braveheart, Cast Away, Get-

tysburg), horror (I know what you did last summer,

The Ring, Red Dawn) and others.

3.2 Description of the Experiments

In this study two different kinds of experiments are

considered. First the system is trained and tested with

one of the databases, then the databased are crossed.

That is to say, the training step is performed with the

real database and the test step with the ﬁctional one,

or vice versa. The procedure when only one database

is used is explained in detail in (Garc

ıa-G

omez et al.,

2016), Section 3. For this study, the same process has

been done over the ﬁlm database.

Figure 2 shows a block diagram which describes

the process carried out in the experiment using both

databases.

In case both databases are used for the experiment,

the process differs beyond the feature extraction pro-

cess. The training set is composed by one database

and the test set by the other one, so the classiﬁcation

process is completely different.

In this case, feature selection has been done with

the whole database, while in the previous case we ap-

plied k − f old cross-validation, so all but of the sub-

sets were used (Garc

ıa-G

omez et al., 2016).

The way to extract the best features is the same

in the two experiments: the k − f old cross-validation

process is done in both cases to avoid generalization

EXTRACTING

THE AUDIO

FRAMES

FEATURE

EXTRACTION

TRAINING

SET

TEST SET

SELECTING

BEST

FEATURES

CLASSIFYING

FINAL

DECISION

DATABASE 1 DATABASE 2

DATABASE 1

Figure 2: Block diagram of the experiments.

loss and the classiﬁcation methods used are the same,

LSLD and LSQD. The division of the ﬁlm database

has been done in K subsets. The set of signals of each

ﬁlm corresponds to 1 subset.

The selected features are then applied to the train-

ing set (composed by the same database), and the test

step is done over the other database. This classiﬁ-

cation process differs from the previous one because

there is not k − f old cross-validation process due to

the use of two databases. In the previous experiment

only one database was available and this step was use-

ful to mantain generalization.

3.3 Results Discussion

This section will show the results obtained from the

experiments explained in previous sections. We will

mainly focus in two parameters: the probability of de-

tection as a function of the probability of false alarm

and the selected features for both databases.

Figure 3 shows the Probability of Detection versus

the probability of False Alarm obtained for different

classiﬁers with the ﬁlms database. The solid line cor-

responds to the movie-based training while the dashed

line represents the training with real signals.

Acoustic Detection of Violence in Real and Fictional Environments

459

1 2 3 4 5 6 7 8 9 10

Probability of False Alarm (%)

Probability of Detection (%)

Linear FS and LSLD

Quadratic FS and LSQD

Linear FS and MLP

Linear FS and LSLD

Quadratic FS and LSQD

Linear FS and MLP

Figure 3: Probability of Detection versus Probability of

False Alarm obtained for the ﬁlm database.

As it might be expected, the best results are ob-

tained when the same database is used both in train-

ing and test (solid line). The best performance cor-

responds to the Quadratic Feature Selection (FS) and

LSQD, followed by the Linear FS and LSLD. When

the real world database is used to train, the relative

performance of the detectors is similar.

Nevertheless, the most important aspect is to com-

pare the performance when using a single database or

when crossing them. This is shown on Table 3, where

the probability of detection in function of some low

signiﬁcant probabilities of false alarm (2%, 5% and

10%) is displayed. We distinguish between training

and testing with the same database or crossing them.

The loss parameter represents how the performance

decreases when using different databases and the av-

erage loss parameter is used to compare the different

classiﬁers applied to the set of probabilities of false

alarm.

In view of the results obtained here, the linear de-

tector is more resistant to database changes, since the

average loss in the set probabilities of false alarm is

approximately 7%. The quadratic detector and the

one based in Neural Networks have a similar behav-

ior, with losses higher tan 10%. If we focus in low

probabilities of false alarm (2% and 5%), the linear

detector works better than the others, with a loss of

5.70% and 5.26%. Interestingly, Neural Networks are

the best option for higher probabilities of false alarm

(10%), with a loss of only 4.82%.

Now we will focus in the other crossover be-

tween databases. Figure 4 shows the Probability of

Detection versus the Probability of False Alarm ob-

tained for the different classiﬁers with the real world

database. The solid line corresponds to real world

database based training while the dashed line repre-

1 2 3 4 5 6 7 8 9 10

Probability of False Alarm (%)

Probability of Detection (%)

Linear FS and LSLD

Quadratic FS and LSQD

Linear FS and MLP

Linear FS and LSLD

Quadratic FS and LSQD

Linear FS and MLP

Figure 4: Probability of Detection versus Probability of

False Alarm obtained for real world database.

sents ﬁlms database based training. The latter can be

very interesting for the situations where we do not

have access to violent data in the real world and we

have to design the algorithm using ﬁlms, video-games

or other substitutes.

As in movie-based violence, the best results are

obtained when the same database is used for the train-

ing and test steps (solid line). The best option is the

Quadratic FS and LSQD, followed by the Linear FS

and LSLD, and the same happens when training with

the other database. In this way, the results are essen-

tially the same in both cases.

Table 4 represents the performance loss of the real

world database results when using different databases

for training and test, in the same way as it was done

for the ﬁlms database.

It is possible to see that LSLD is the best de-

tector again, with a loss of only 4.46% in average.

LSQD works great too, especially for low probabili-

ties (1.29% loss for 2% false alarm and 3.87% loss for

5% false alarm). Regarding the neural network based

detector, overﬁtting makes the average loss higher

than 26%.

If we compare the previous ﬁgures and tables, it

can be deduced that Neural Networks are not recom-

mended when databases are crossed during training

and test steps because the results have a poor perfor-

mance. It is more reliable to use linear or quadratic

detectors, depending on the probability of false alarm

we are interested in and the database used for training.

In order to compare the feature selection with both

databases, Tables 5 and 6 show the most selected fea-

tures in the ﬁlms database for the linear FS process

and the quadratic FS process, ranked by the occur-

rence percentage (Occ. %). Shaded rows indicate the

repeated features in both real and ﬁlms database.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

460

Table 3: Comparative results for different database usage during training.

Classiﬁer Linear Quadratic MLP

Pfa (%) 2 5 10 2 5 10 2 5 10

Pd (%) - Films Training, Films Test 24.12 46.49 67.54 28.95 53.95 70.18 23.68 46.05 61.40

Pd (%) - Real Training, Films Test 18.42 41.23 57.46 22.37 41.23 58.77 10.96 30.70 56.58

Loss (%) 5.70 5.26 10.08 6.58 12.72 11.41 12.72 15.35 4.82

Average loss (%) 7.01 10.24 10.96

Table 4: Comparative results for different database usage during training.

Classiﬁer Linear Quadratic MLP

Pfa (%) 2 5 10 2 5 10 2 5 10

Pd (%) - Real Training, Real Test 32.42 55.97 75.48 31.61 59.68 81.45 37.58 59.52 76.13

Pd (%) - Films Training, Real Test 25.48 54.35 70.65 30.32 55.81 67.74 17.74 31.29 44.52

Loss (%) 6.94 1.62 4.83 1.29 3.87 13.71 19.84 28.23 31.61

Average loss (%) 4.46 6.29 26.56

Considering this information we can appreciate

that the most useful features are quite different for

both databases. This is especially remarkable with the

linear FS, where only 6 of 20 features match, while

in the quadratic FS this number is increased to 12.

From this data it can be inferred that the linear FS is

more dependent on the used database with regard to

feature selection process than the quadratic FS, which

can successfully use 12 features for the two databases.

Focusing on common features, the robustness of

some of the features can be appreciated, such as

MFCCs and ∆MFCCs, pitch, short time energy or en-

ergy entropy. Concerning MFCCs and ∆MFCCs, 3

features match in the databases for the LSLD and 4

for the LSQD, representing a large amount of the total

features. It is noted that most of these are calculated

as standard deviation statistics. Pitch features are very

important because in the two detectors mean and/or

standard deviation appear with a 100% percentage of

occurrence. In respect of energy features, short time

energy is relevant for the two detectors, while energy

entropy ranks highly (7th and 8th) only in the LSQD.

It is also of interest to point out that the proposed

feature in (Garc

ıa-G

omez et al., 2016) ranks at the

top of the LQSD list and 10th in the LSLD list. Fur-

thermore, the appearance of features related to Har-

monic Noise Rate and Spectrum in the ﬁlms database

is remarkable, which were not selected with the real

one. In addition, results show that features like spec-

tral centroid, spectral rolloff or spectral ﬂux are not

useful in ﬁlms, in contrast to the real world situations.

If we compare Tables 5 and 6, we can appreci-

ate that 14 of the total features are the same in both

tables, exactly the same number that it was obtained

in (Garc

ıa-G

omez et al., 2016) for real database. It

demonstrates once again that many of the statistics

can be applied in both quadratic and linear detectors.

Table 5: Summary of the selected features for the LSLD.

No. Measure Statistic Occ. (%)

1 MFCC 4 Mean 100.00

2 Pitch Mean 100.00

3 ∆MFCC 3 Std 98.20

4 ZCR Std 95.50

5 MFCC 1 Std 93.69

6 Pitch Std 93.69

7 MFFC 2 Mean 90.09

8 ∆MFCC 2 Std 90.09

9 MFCC 5 Std 88.29

10 RUF - 86.49

11 SP Mean 84.68

12 MFCC 3 Mean 75.68

13 STE Mean 62.16

14 HNR Mean 61.26

15 ∆MFCC 4 Std 60.36

16 EE Maximum 57.66

17 STE Std 38.74

18 MFCC 5 Mean 37.84

19 MFCC 3 Median 36.94

20 EE Max/Median 32.43

4 CONCLUSION

The purpose of this paper is to examine the viabil-

ity of violence detection on real audio recording with

a system trained using ﬁctional data, and vice versa.

This differentiation is made because recording con-

ditions and audio preprocessing is different from one

environment to the other. This approach could be in-

teresting in case there is not enough available data for

a given scenario. Another possible application could

be to transfer the research efforts validated in one en-

vironment to another one.

The results with database crossover bring us a

similar conclusion: an increase of classiﬁer complex-

ity implies a higher performance loss. Speciﬁcally,

linear detectors works better than quadratic detectors,

Acoustic Detection of Violence in Real and Fictional Environments

461

Table 6: Summary of the selected features for the LSQD.

No. Measure Statistic Occ. (%)

1 Pitch Mean 100.00

2 Pitch Std 100.00

3 RUF - 100.00

4 MFFC 4 Mean 99.10

5 MFCC 5 Std 97.30

6 HNR Mean 84.68

7 EE Std 84.68

8 EE Max/Median 83.78

9 SP Mean 83.78

10 MFCC 3 Mean 82.88

11 MFCC 1 Std 78.38

12 ZCR Max/Mean 78.38

13 STE Std 66.67

14 ∆MFCC 5 Std 55.86

15 EE Mean 52.25

16 ∆MFCC 1 Std 51.35

17 STE Mean 48.65

18 ∆MFCC 3 Std 46.85

19 MFCC 1 Mean 43.24

20 MFCC 5 Mean 37.84

and quadratic detectors better than those based on

neural networks. This can be explained by the fact

that the loss of generalization is directly related to

overﬁtting tendencies. In that way, neural networks

can work better for a speciﬁc environment (real or ﬁc-

tional), or when a single database is used for training

and test. However, they are not able to get good re-

sults when the databases are crossed.

Future work will focus on using other types

of classiﬁers and testing the system with different

databases (e.g. videogames). The use of additional

features and statistics will also be explored.

ACKNOWLEDGEMENTS

This work has been funded by the Spanish Ministry

of Economy and Competitiveness (under project

TEC2015-67387-C4-4-R, funds Spain/FEDER)

and by the University of Alcal

a (under project

CCG2015/EXP-056).

REFERENCES

Chen, L.-H., Hsu, H.-W., Wang, L.-Y., and Su, C.-W.

(2011). Violence detection in movies. In Computer

Graphics, Imaging and Visualization (CGIV), 2011

Eighth International Conference on, pages 119–124.

IEEE.

Demarty, C.-H., Penet, C., Gravier, G., and Soleymani,

M. (2012). The mediaeval 2012 affect task: violent

scenes detection. In Working Notes Proceedings of

the MediaEval 2012 Workshop.

Doukas, C. N. and Maglogiannis, I. (2011). Emergency

fall incidents detection in assisted living environments

utilizing motion, sound, and visual perceptual compo-

nents. IEEE Transactions on Information Technology

in Biomedicine, 15(2):277–289.

Garc

ıa-G

omez, J., Bautista-Dur

an, M., Gil-Pita, R.,

Mohino-Herranz, I., and Rosa-Zurera, M. (2016). Vi-

olence detection in real environments for smart cities.

In Ubiquitous Computing and Ambient Intelligence:

10th International Conference, UCAmI 2016, San

Bartolom

e de Tirajana, Gran Canaria, Spain, Novem-

ber 29–December 2, 2016, Part II, pages 482–494.

Springer.

Giannakopoulos, T., Kosmopoulos, D., Aristidou, A., and

Theodoridis, S. (2006). Violence content classiﬁca-

tion using audio features. In Hellenic Conference on

Artiﬁcial Intelligence, pages 502–507. Springer.

Gil-Pita, R., L

opez-Garrido, B., and Rosa-Zurera, M.

(2015). Tailored mfccs for sound environment clas-

siﬁcation in hearing aids. In Advanced Computer

and Communication Engineering Technology, pages

1037–1048. Springer.

Jalil, M., Butt, F. A., and Malik, A. (2013). Short-time en-

ergy, magnitude, zero crossing rate and autocorrela-

tion measurement for discriminating voiced and un-

voiced segments of speech signals. In Technological

Advances in Electrical, Electronics and Computer En-

gineering (TAEECE), 2013 International Conference

on, pages 208–212. IEEE.

Krug, E. G., Mercy, J. A., Dahlberg, L. L., and Zwi, A. B.

(2002). The world report on violence and health. The

lancet, 360(9339):1083–1088.

Mohino, I., Gil-Pita, R., and Alvarez, L. (2011). Stress de-

tection through emotional speech analysis. Springer.

Mohino, I., Goni, M., Alvarez, L., Llerena, C., and Gil-Pita,

R. (2013). Detection of emotions and stress through

speech analysis. Proceedings of the Signal Process-

ing, Pattern Recognition and Application-2013, Inns-

bruck, Austria, pages 12–14.

Nam, J., Alghoniemy, M., and Tewﬁk, A. H. (1998). Audio-

visual content-based violent scene characterization. In

Image Processing, 1998. ICIP 98. Proceedings. 1998

International Conference on, volume 1, pages 353–

357. IEEE.

Tzanetakis, G. and Cook, P. (2002). Musical genre classiﬁ-

cation of audio signals. IEEE Transactions on speech

and audio processing, 10(5):293–302.

Xu, M., Chia, L.-T., and Jin, J. (2005). Affective con-

tent analysis in comedy and horror videos by audio

emotional event detection. In 2005 IEEE Interna-

tional Conference on Multimedia and Expo, pages 4–

pp. IEEE.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

462