Fast Violence Detection in Video

Oscar Deniz

, Ismael Serrano

, Gloria Bueno

and Tae-Kyun Kim

VISILAB group, University of Castilla-La Mancha, E.T.S.I.Industriales,

Avda. Camilo Jose Cela s/n, Ciudad Real, 13071 Spain

Department of Electrical and Electronic Engineering, Imperial College,

South Kensington Campus, London SW7 2AZ, U.K.

Keywords:

Action Recognition, Violence Detection, Fight Detection.

Abstract:

Whereas the action recognition problem has become a hot topic within computer vision, the detection of ﬁghts

or in general aggressive behavior has been comparatively less studied. Such capability may be extremely useful

in some video surveillance scenarios like in prisons, psychiatric centers or even embedded in camera phones.

Recent work has considered the well-known Bag-of-Words framework often used in generic action recognition

for the speciﬁc problem of ﬁght detection. Under this framework, spatio-temporal features are extracted from

the video sequences and used for classiﬁcation. Despite encouraging results in which near 90% accuracy

rates were achieved for this speciﬁc task, the computational cost of extracting such features is prohibitive for

practical applications, particularly in surveillance and media rating systems. The task of violence detection

may have, however, speciﬁc features that can be leveraged. Inspired by psychology results that suggest that

kinematic features alone are discriminant for speciﬁc actions, this work proposes a novel method which uses

extreme acceleration patterns as the main feature. These extreme accelerations are efﬁciently estimated by

applying the Radon transform to the power spectrum of consecutive frames. Experiments show that accuracy

improvements of up to 12% are achieved with respect to state-of-the-art generic action recognition methods.

Most importantly, the proposed method is at least 15 times faster.

1 INTRODUCTION

In the last years, the problem of human action recog-

nition from video has become tractable by using com-

puter vision techniques, see for example the sur-

vey (Poppe, 2010). Despite its potential useful-

ness, the speciﬁc task of violent action detection has

been comparatively less studied. A violence detector

has, however, immediate applicability in the surveil-

lance domain. The primary function of large-scale

surveillance systems deployed in institutions such

as schools, prisons and psychiatric care facilities is

for alerting authorities to potentially dangerous situ-

ations. However, human operators are overwhelmed

with the number of camera feeds and manual response

times are slow, resulting in a strong demand for auto-

mated alert systems. Similarly, there is increasing de-

mand for automated rating and tagging systems that

can process the great quantities of video uploaded to

websites. Violence detection is becoming important

not only on an application level but also on a more sci-

entiﬁc level, because it has particularities that make

it different from generic action recognition. For all

these reasons the interest in violence detection has

been steadily growing, and different proposals are al-

ready being published in major journals and confer-

ences. Also, public datasets are becoming increas-

ingly available that are speciﬁcally designed for this

task.

One of the ﬁrst proposals for violence recognition

in video is Nam et al. (Nam et al., 1998), which

proposed recognizing violent scenes in videos using

ﬂame and blood detection and capturing the degree

of motion, as well as the characteristic sounds of vi-

olent events. Cheng et al. (Cheng et al., 2003) rec-

ognizes gunshots, explosions and car-braking in au-

dio using a hierarchical approach based on Gaussian

mixture models and Hidden Markov models (HMM).

Giannakopoulos et al. (Giannakopoulos et al., 2006)

also propose a violence detector based on audio fea-

tures. Clarin et al. (Clarin et al., 2005) present a sys-

tem that uses a Kohonen self-organizing map to detect

skin and blood pixels in each frame and motion inten-

sity analysis to detect violent actions involving blood.

Zajdel et al. (Zajdel et al., 2007), introduced the CAS-

SANDRA system, which employs motion features re-

478

Deniz O., Serrano I., Bueno G. and Kim T..

Fast Violence Detection in Video.

DOI: 10.5220/0004695104780485

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 478-485

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

lated to articulation in video and scream-like cues in

audio to detect aggression in surveillance videos.

More recently, Gong et al. (Gong et al., 2008)

propose a violence detector using low-level visual and

auditory features and high-level audio effects identi-

fying potential violent content in movies. Chen et al.

(Chen et al., 2008) use binary local motion descriptors

(spatio-temporal video cubes) and a bag-of-words ap-

proach to detect aggressive behaviors. Lin and Wang

(Lin and Wang, 2009) describe a weakly-supervised

audio violence classiﬁer combined using co-training

with a motion, explosion and blood video classiﬁer

to detect violent scenes in movies. Giannakopou-

los et al. (Giannakopoulos et al., 2010) present a

method for violence detection in movies based on

audio-visual information that uses a statistics of audio

features and average motion and motion orientation

variance features in video combined in a k-Nearest

Neighbor classiﬁer to decide whether the given se-

quence is violent. Chen et al. (Chen et al., 2011) pro-

posed a method based on motion and detecting faces

and nearby blood. Violence detection has been even

approached using static images (Wang et al., 2012).

Also recently, (Zou et al., 2012) approached the prob-

lem within the context of video sharing sites by us-

ing textual tags along with audio and video. Proof

of the growing interest is also the MediaEval Affect

Task, a competition that aims at discovering violence

in color movies (Demarty et al., 2012). In this case

the algorithms have access to additional information

such as audio, subtitles and previously-annotated con-

cepts. Besides, no comparisons are made about com-

putational times.

In summary, a number of previous works require

audio cues for detecting violence or rely on color to

detect cues such as blood. In this respect, we note

that there are important applications, particularly in

surveillance, where audio and color are not available.

Besides, while explosions, blood and running may be

useful cues for violence in action movies, they are rare

in real-world situations. In any case, violence detec-

tion per se is an extremely difﬁcult problem, since vi-

olence is a subjective concept. Fight detection, on

the contrary, is a speciﬁc violence-related task that

may be tackled using action recognition techniques

and which has immediate applications.

Whereas there is a number of well-studied

datasets for action recognition, signiﬁcant datasets

with violent actions (ﬁghts) have not been made avail-

able until the work (Bermejo et al., 2011). In that

work the authors demonstrated encouraging results

on violence detection, achieving 90% accuracy us-

ing MoSIFT features ((Chen et al., 2010)). MoSIFT

descriptors are obtained from salient points in two

parts: the ﬁrst is an aggregated histogram of gra-

dients (HoG) which describe the spatial appearance.

The second part is an aggregated histogram of optical

ﬂow (HoF) which indicates the movement of the fea-

ture point. Despite being considered within state-of-

the-art action recognition methods, the computational

cost of extracting these features is prohibitively large,

taking near 1 second per frame on a high-end laptop.

This precludes use in practical applications, where

many camera streams may have to be processed in

real-time. Such cost is also a major problem when the

objective is to embed a ﬁght detection functionality

into a smart camera (i.e. going from the extant em-

bedded motion detection to embedded violent motion

detection).

Features such as MoSIFT encode both motion and

appearance information. However, research on hu-

man perception of other’s actions (using point-light

displays, see Figure 1) has shown that the kinematic

pattern of movement is sufﬁcient for the perception

of actions (Blake and Shiffrar, 2007). This same idea

has been also supported by research on the computer

vision side (Oshin et al., 2011; Bobick and Davis,

1996). More speciﬁcally, empirical studies in the

ﬁeld have shown that relatively simple dynamic fea-

tures such as velocity and acceleration correlate to

emotional attributes perceived from the observed ac-

tions (Saerbeck and Bartneck, 2010; Clarke et al.,

2005; Castellano et al., 2007; Hidaka, 2012), albeit

the degree of correlation varies for different emotions.

Thus, features such as acceleration and jerkiness tend

to be associated to emotions with high activation (eg.

anger, happiness), whereas slow and smooth move-

ments are more likely to be judged as emotions with

low activation (eg. sadness).

In this context, this work assumes that ﬁghts in

video can be reliably detected by such kinematic cues

that represent violent motion and strokes. Since ex-

treme accelerations play a key role we propose a novel

method to infer them in an efﬁcient way. The pro-

posed ﬁght detector attains better accuracy rates than

state-of-the-art action recognition methods at much

less computational cost. The paper is organized as

follows. Section 2 describes the proposed method.

Section 3 provides experimental results. Finally, in

Section 4 the main conclusions are outlined.

2 PROPOSED METHOD

As mentioned above, the presence of large acceler-

ations is key in the task of violence recognition. In

this context, body part tracking can be considered,

as in (Datta et al., 2002), which introduced the so-

FastViolenceDetectioninVideo

479

Figure 1: Three frames in a point light display movie de-

picting a karate kick.

called Acceleration Measure Vectors (AMV) for vi-

olence detection. In general, acceleration can be in-

ferred from tracked point trajectories. However, we

have to note that extreme acceleration implies image

blur (see for example Figure 2), which makes tracking

less precise or even impossible.

Motion blur entails a shift in image content to-

wards low frequencies. Such behavior allows to build

an efﬁcient acceleration estimator for video. First,

we compute the power spectrum of two consecutive

frames. It can be shown that, when there is a sud-

den motion between the two frames, the power spec-

trum image of the second frame will depict an ellipse

(Barlow and Olshausen, 2004). The orientation of the

ellipse is perpendicular to the motion direction, the

frequencies outside the ellipse being attenuated, see

Figure 3. Most importantly, the eccentricity of this el-

lipse is dependent on the acceleration. Basically, the

proposed method aims at detecting the sudden pres-

ence of such ellipse. In the following, the method is

described in detail.

Figure 2: Two consecutive frames in a ﬁght clip from a

movie. Note the blur on the left side of the second frame.

Let I

i−1

and I

be two consecutive frames. Motion

Figure 3: Left: Sample image. Center: simulated camera

motion at 45

◦

. Right: Fourier transform of the center image.

blur is equivalent to applying a low-pass oriented ﬁlter

F (I

) = F (I

i−1

) ·C (1)

where F (·) denotes the Fourier Transform. Then:

C =

F (I

)

F (I

i−1

)

(2)

The low-pass oriented ﬁlter in C is the above-

mentioned ellipse.

For each pair of consecutive frames, we compute

the power spectrum using the 2D Fast Fourier Trans-

form (in order to avoid edge effects, a Hanning win-

dow was applied before computing the FFT). Let us

call these spectra images P

i−1

and P

. Next, we simply

compute the image:

C =

i−1

(3)

When there is no change between the two frames,

the power spectra will be equal and C will have a

constant value. When motion has occurred, an el-

lipse will appear in C. Our objective is then to detect

such ellipse and estimate its eccentricity, which repre-

sents the magnitude of the acceleration. Ellipse detec-

tion can be reliably performed using the Radon trans-

form, which provides image projections along lines

with different orientations, see Figure 4.

After computing the Radon transform image R,

its vertical maximum projection vector vp is obtained

and normalized to maximum value 1 (see Figure 4-

bottom). When there is an ellipse in C, this vector

will show a sharp peak, representing the major axis of

the ellipse. The kurtosis K of this vector is therefore

taken as an estimation of the acceleration.

Note that kurtosis alone cannot be used as a mea-

sure, since it is obtained from a normalized vector

(i.e. it is dimensionless). Thus, the average power per

pixel P of image C is also computed and taken as an

additional feature. Without it, any two frames could

lead to high kurtosis even without signiﬁcant motion.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

480

Figure 4: Top: Radon transform image of Figure 3-left un-

der a simulated camera motion at 98

◦

. The horizontal axis

represents angles between 0 and 180

◦

. Bottom: vertical

projection of the Radon image.

The previous paragraphs have described a proce-

dure that obtains two features K and P for each pair

of consecutive frames. Deceleration was also con-

sidered as an additional feature, and it can be ob-

tained by swapping the consecutive frames and ap-

plying the same algorithm explained above. For video

sequences, we compute histograms of these features,

so that acceleration/deceleration patterns can be in-

ferred.

In a variant of the proposed method, ellipse eccen-

tricity can be estimated by ﬁrst locating the position p

of the maximum of vp. This maximum is associated

to the major axis of the ellipse. The minor axis is then

located at position:

q = p + 90

◦

i f (p + 90

◦

) ≤ 180

◦

q = p − 90

◦

otherwise

The ratio of the two values may then be used as a

feature, instead of the kurtosis:

r =

vp(p)

vp(q)

(4)

Algorithm 1 shows the detailed steps of the pro-

posed method.

Since the proposed method does not involve track-

ing or optical-ﬂow techniques it is more suitable for

measuring extreme accelerations. Lastly, it is impor-

Input: S = (Short) sequence of gray scale images.

Each image in S is denoted as f

x,y,t

, where

x = 1,2,. ..,N, y = 1,2,.. .,M and

t = 1, 2,... ,T .

Result: 3 · n bins discriminant features

for t = 1 to T do

1. Apply a Hanning Window to f

x,y,t

= f

x,y,t

· H

x,y

where H

x,y

= h(N)· h(M)

and h is a

column vector given by:

h(L) =

1 −cos



2π

i

, for l = 1,2, ··· , L

2. Apply FFT to f

x,y,t

: F

v,w,t

= F (g

x,y,t

)

3. Compute C

v,w

= F

v,w,t

v,w,t−1

4. Compute Radon transform of C:

d,θ

= R (C

v,w

)

5. Compute vertical max projection of R:

= max

d,θ

)

6. Normalize p

= p

/m, where m = max

)

7. Compute feature A

= Kurtosis(p

)

8. Compute feature P

= mean

v,w

9. Compute feature D

(deceleration) using the

same steps above but swapping t for t −1

end

return Histogram(A, n bins), Histogram(P,n bins),

Histogram(D,n

bins)

Algorithm 1: Algorithm for computing the main fea-

tures in the proposed method.

tant to note that global (i.e. camera) motion could

also cause blur in the image. In order to remove such

blur, it is necessary to perform a deconvolution pre-

processing step. The phase correlation technique is

ﬁrst used to infer global motion between each pair of

consecutive frames. If global motion is detected, the

estimated angle and length of the displacement is used

to form a PSF with which to perform deconvolution

of the second frame (we used the Lucy-Richardson

iterative deconvolution method). This is intended to

remove the blur caused by global motion (camera mo-

tion), while any local blurs will remain. The method

described above is then applied to the pair of frames

as shown in Algorithm 1 above.

When backgrounds are relatively uniform and dis-

placements small, global motion estimation may still

fail. The fail mode is typically represented by real

global motion which goes undetected, i.e. an incor-

rect (0,0) displacement is estimated. Since the pro-

posed method is heavily dependent on global motion,

FastViolenceDetectioninVideo

481

further measures must be taken in practice to at least

detect the presence of global motion versus local mo-

tion. The window function mentioned above restricts

processing to the inner part of the image. It is reason-

able to assume that, when motion is global, changes

in the outer part of the image will be relatively on

par with those in the inner part. Thus, an additional

’Outer Energy’ O feature was computed and used in

the same way as the others:

x,y,t

| f

x,y,t

− f

x,y,t−1

| · (1 − H

x,y

)

M · N

(5)

The mean and standard deviation of O are then

used as additional features.

3 EXPERIMENTS

The work (Bermejo et al., 2011) introduced the ﬁrst

two datasets explicitly designed for assessing ﬁght de-

tection. The ﬁrst dataset (“Hockey”) consists of 1000

clips at a resolution of 720x576 pixels, divided in

two groups, 500 ﬁghts (see Fig. 5 top) and 500 non-

ﬁghts, extracted from hockey games of the National

Hockey League (NHL). Each clip was limited to 50

frames and resolution lowered to 320x240. The sec-

ond dataset (“Movies”) introduced in (Bermejo et al.,

2011) consists of 200 video clips in which ﬁghts were

extracted from action movies (see Figure 5 bottom).

The non-ﬁght videos were extracted from public ac-

tion recognition datasets. Unlike the hockey dataset,

which was relatively uniform both in format and con-

tent, these videos depicted a wider variety of scenes

and were captured at different resolutions.

In the experiments, the Radon transform was com-

puted between 0 and 180 in steps of θ = 20 degrees.

4-bin histograms were computed for each of the three

main features (acceleration, deceleration and power,

see the previous Section). The results measured us-

ing 10-fold cross-validation are shown in Table 1.

For convenience we also show the results reported in

(Bermejo et al., 2011), which used an SVM classiﬁer.

In (Bermejo et al., 2011) STIP features performed

poorly on the Movie dataset and so MoSIFT was con-

sidered the best descriptor. MoSIFT’s superiority has

been also proven in other action recognition works.

The proposed method gives roughly equivalent accu-

racy and AUC for the Hockey dataset whereas it im-

proves on the Movie dataset by 9%.

Since the proposed method is based on extreme

acceleration patterns, energetic actions may pose a

problem. However, the method performs quite well

in this respect, as evidenced in the Hockey dataset re-

sults. Although the Hockey dataset may represent the

Figure 5: Sample ﬁght videos from the Hockey (top) dataset

and the action movie (bottom) dataset.

most difﬁcult dataset for a ﬁght detector, in practice

we aim at separating ﬁghts from other actions. Con-

sequently, a more challenging dataset was also con-

sidered. The UCF101 (Soomro et al., 2012) is a data

set of realistic action videos collected from YouTube,

having 101 action categories. UCF101, see Figure

6, gives the largest diversity in terms of actions and

with the presence of large variations in camera mo-

tion, object appearance and pose, object scale, view-

point, cluttered background and illumination condi-

tions it is the most challenging dataset to date. For

our case, it is even more challenging since it includes

50 actions from sports. To our knowledge, this is the

largest and most challenging dataset in which a ﬁght

detection algorithm has been tested.

In the experiments with UCF101, for the ﬁght

set we pooled the ﬁght clips of both the Hockey

and Movies dataset plus two of the 101 UCF ac-

tions that actually represented ﬁghts (“Punching” and

“Sumo”). This gave a total of 1843 ﬁght clips. Non-

ﬁght clips were taken from the other 99 action cate-

gories (42278 non-ﬁght clips, totaling approximately

2 Million frames). In order to avoid unbalanced sets

we used randomly chosen subsets of 500 ﬁght and 500

non-ﬁght clips. For each subset we performed a 10-

fold cross-validation. This was in turn repeated 10

times.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

482

Table 1: Results on the Hockey and Movies datasets, 5 runs

of 10-fold cross-validation. Note: A stands for accuracy,

DR stands for detection rate, FR stands for false positive

rate, AUC stands for area under the (ROC) curve. In bold

are shown the best accuracies for each dataset and classiﬁer.

Features ClassiﬁerMeasure

Dataset

Movies Hockey

BoW(STIP)

SVM

A 82.5 ± 1.12 88.6 ± 0.15

DR 83.4 ±1.14 93.6 ±0.22

FR 18.4 ±1.14 16.5 ±0.18

AUC 0.8844 0.9383

Adaboost

A 74.3 ±2.31 86.5 ±0.19

DR 70.2 ±3.70 88.7 ±0.76

FR 21.6 ±3.21 15.7 ±0.52

AUC 0.8121 0.9220

BoW(MoSIFT)

SVM

A 84.2 ±1.15 91.2 ±0.24

DR 100 ±0 92 ±0.17

FR 31.6 ±2.30 9.6 ±0.41

AUC 0.9267 0.9547

Adaboost

A 86.5 ±1.58 89.5 ±0.40

DR 99.6 ±0.55 90.1 ±0.88

FR 26.6 ±3.05 11.1 ±0.27

AUC 0.9518 0.9492

Proposed

SVM

A 85.4 ±9.33 90.1 ±0

DR 71.4 ±19.42 80.2 ±0

FR 0.8 ±0.83 0 ±0

AUC 0.7422 0.9480

Adaboost

A 98.9 ±0.22 90.1 ±0

DR 97.8 ±0.45 80.2 ±0

FR 0.0 ±0.0 0 ±0

AUC 0.9999 0.9020

Table 2: Results on the UCF101 dataset. Note: A stands for

accuracy, DR stands for detection rate, FR stands for false

positive rate, AUC stands for area under the (ROC) curve.

In bold are shown the best accuracies for each classiﬁer.

Features Classiﬁer Measure Dataset

BoW(STIP)

SVM

A 72 ±1.78

DR 86.2 ±1.83

FR 42.2 ±3.26

AUC 0.7352

Adaboost

A 63.4 ±2.39

DR 75.3 ±3.60

FR 48.5 ±4.40

AUC 0.6671

BoW(MoSIFT)

SVM

A 81.3 ±0.78

DR 90.8 ±1.34

FR 28.1 ±1.88

AUC 0.8715

Adaboost

A 51.3 ±0.32

DR 100 ±0

FR 97.4 ±0.64

AUC 0.5340

Proposed

SVM

A 93.4 ±6.09

DR 87.3 ±11.12

FR 0.45 ±1.28

AUC 0.9439

Adaboost

A 92.8 ±6.29

DR 85.7 ±12.53

FR 0.02 ±0.06

AUC 0.9379

For BoW(MoSIFT), and even with the use of

parallel K-means, extracting vocabularies from the

whole dataset was unfeasible. Therefore, a random

subset of samples was ﬁrst selected (600 of each

class) and then a vocabulary of size 500 (the best

vocabulary size in (Bermejo et al., 2011)) was com-

puted. The results are shown on Table 2. Figure

7 shows the ROC curve obtained for both methods

with the SVM classiﬁer. These results suggest that

the method may effectively work as a ﬁght detector

for generic settings. Global motion estimation exper-

iments did not seem to improve results signiﬁcantly

in this case either.

Note that the results show the higher detection rate

already hypothesized in Section 2. This is evidenced

by the resulting ROC curves closer to the vertical axis

for the proposed method.

Table 3 shows the number of features used for

classiﬁcation and the computational cost measured

(for feature extraction). The code for both STIP

and MoSIFT was compiled. The code for the pro-

posed method was interpreted and used no paralleliza-

tion. These results show an improvement in speed

of roughly 15 times with respect to the best previous

method (MoSIFT). The fact that only 14 features are

necessary (MoSIFT used 500) is an additional advan-

tage for practical implementations.

Table 3: Feature extraction times. Average times measured

with the non-ﬁght videos in the UCF101 dataset, on an Intel

Xeon computer with 2 processors at 2.90Ghz.

Method Secs/frame

MoSIFT 0.6615

STIP 0.2935

Proposed Method 0.0419

4 CONCLUSIONS

Based on the observation that kinematic information

may sufﬁce for human perception of other’s actions,

in this work a novel detection method is proposed

which uses extreme acceleration patterns as the main

discriminating feature. The method shows promis-

ing features for surveillance scenarios and it also per-

forms relatively well when considering challenging

actions such as those that occur in sports. Accu-

racy improvements of up to 12% with respect to state-

of-the-art generic action recognition techniques were

achieved. We hypothesize that when motion is sufﬁ-

cient for recognition, appearance not only takes sig-

niﬁcant additional computation but it also may con-

fuse the detector. Another interpretation is that a sort

of overﬁtting may be occurring in that case. In any

case, the extreme acceleration estimation proposed

seems to perform well, given that other methods may

fail because of the associated image blur.

The proposed method makes no assumptions on

number of individuals (it can be also used to detect

vandalism), body part detection or salient point track-

FastViolenceDetectioninVideo

483

Figure 6: The 101 actions in UCF101 shown with one sample frame.

Figure 7: ROC curve with the SVM classiﬁer. Average of

10 experimental runs.

ing. Besides, it is at least 15 times faster and uses only

14 features, which opens up the possibility of prac-

tical implementations. When maximum accuracy is

needed, the method could also act as a ﬁrst attentional

stage in a cascade framework that also uses STIP or

MoSIFT features.

Future work will seek to perfect the method by ap-

proximating the Radon transform, which is the most

time-consuming stage. On a more basic level, we

shall investigate the implications with regards to the

relative importance of motion and appearance infor-

mation for the recognition of certain actions.

ACKNOWLEDGEMENTS

This work has been supported by research project

TIN2011-24367 from the Spanish Ministry of Econ-

omy and and Competitiveness.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

484

REFERENCES

Barlow, H. B. and Olshausen, B. A. (2004). Convergent ev-

idence for the visual analysis of optic ﬂow through

anisotropic attenuation of high spatial frequencies.

Journal of Vision, 4(6):415–426.

Bermejo, E., Deniz, O., Bueno, G., and Sukthankar, R.

(2011). Violence detection in video using computer

vision techniques. In 14th Int. Congress on Computer

Analysis of Images and Patterns, pages 332–339.

Blake, R. and Shiffrar, M. (2007). Perception of Human

Motion. Annual Review of Psychology, 58(1):47–73.

Bobick, A. and Davis, J. (1996). An appearance-based rep-

resentation of action. In Pattern Recognition, 1996.,

Proceedings of the 13th International Conference on,

volume 1, pages 307–312 vol.1.

Castellano, G., Villalba, S., and Camurri, A. (2007). Recog-

nising human emotions from body movement and ges-

ture dynamics. In Paiva, A., Prada, R., and Picard, R.,

editors, Affective Computing and Intelligent Interac-

tion, volume 4738 of Lecture Notes in Computer Sci-

ence, pages 71–82. Springer Berlin Heidelberg.

Chen, D., Wactlar, H., Chen, M., Gao, C., Bharucha, A.,

and Hauptmann, A. (2008). Recognition of aggressive

human behavior using binary local motion descrip-

tors. In Engineering in Medicine and Biology Society,

pages 5238–5241.

Chen, L.-H., Su, C.-W., and Hsu, H.-W. (2011). Violent

scene detection in movies. IJPRAI, 25(8):1161–1172.

Chen, M.-y., Mummert, L., Pillai, P., Hauptmann, A., and

Sukthankar, R. (2010). Exploiting multi-level paral-

lelism for low-latency activity recognition in stream-

ing video. In MMSys ’10: Proceedings of the ﬁrst

annual ACM SIGMM conference on Multimedia sys-

tems, pages 1–12, New York, NY, USA. ACM.

Cheng, W.-H., Chu, W.-T., and Wu, J.-L. (2003). Semantic

context detection based on hierarchical audio models.

In Proceedings of the ACM SIGMM workshop on Mul-

timedia information retrieval, pages 109–115.

Clarin, C., Dionisio, J., Echavez, M., and Naval, P. C.

(2005). DOVE: Detection of movie violence using

motion intensity analysis on skin and blood. Techni-

cal report, University of the Philippines.

Clarke, T. J., Bradshaw, M. F., Field, D. T., Hampson,

S. E., and Rose, D. (2005). The perception of emo-

tion from body movement in point-light displays of

interpersonal dialogue. Perception, 34:1171–1180.

Datta, A., Shah, M., and Lobo, N. D. V. (2002). Person-on-

person violence detection in video data. In Pattern

Recognition, 2002. Proceedings. 16th International

Conference on, volume 1, pages 433–438.

Demarty, C., Penet, C., Gravier, G., and Soleymani, M.

(2012). MediaEval 2012 affect task: Violent scenes

detection in Hollywood movies. In MediaEval 2012

Workshop Proceedings, Pisa, Italy.

Giannakopoulos, T., Kosmopoulos, D., Aristidou, A., and

Theodoridis, S. (2006). Violence content classiﬁca-

tion using audio features. In Advances in Artiﬁcial In-

telligence, volume 3955 of Lecture Notes in Computer

Science, pages 502–507.

Giannakopoulos, T., Makris, A., Kosmopoulos, D., Peran-

tonis, S., and Theodoridis, S. (2010). Audio-visual fu-

sion for detecting violent scenes in videos. In 6th Hel-

lenic Conference on AI, SETN 2010, Athens, Greece,

May 4-7, 2010. Proceedings, pages 91–100, London,

UK. Springer-Verlag.

Gong, Y., Wang, W., Jiang, S., Huang, Q., and Gao, W.

(2008). Detecting violent scenes in movies by audi-

tory and visual cues. In Proceedings of the 9th Pa-

ciﬁc Rim Conference on Multimedia, pages 317–326,

Berlin, Heidelberg. Springer-Verlag.

Hidaka, S. (2012). Identifying kinematic cues for action

style recognition. In Proceedings of the 34th Annual

Conference of the Cognitive Science Society, pages

1679–1684.

Lin, J. and Wang, W. (2009). Weakly-supervised violence

detection in movies with audio and video based co-

training. In Proceedings of the 10th Paciﬁc Rim Con-

ference on Multimedia, pages 930–935, Berlin, Hei-

delberg. Springer-Verlag.

Nam, J., Alghoniemy, M., and Tewﬁk, A. (1998). Audio-

visual content-based violent scene characterization. In

Proceedings of ICIP, pages 353–357.

Oshin, O., Gilbert, A., and Bowden, R. (2011). Capturing

the relative distribution of features for action recog-

nition. In Automatic Face Gesture Recognition and

Workshops (FG 2011), 2011 IEEE International Con-

ference on, pages 111–116.

Poppe, R. (2010). A survey on vision-based human action

recognition. Image and Vision Computing, 28(6):976

– 990.

Saerbeck, M. and Bartneck, C. (2010). Perception of affect

elicited by robot motion. In Proceedings of the 5th

ACM/IEEE international conference on Human-robot

interaction, HRI ’10, pages 53–60, Piscataway, NJ,

USA. IEEE Press.

Soomro, K., Zamir, A., and Shah, M. (2012). UCF101: A

dataset of 101 human action classes from videos in the

wild. CRCV-TR-12-01. Technical report.

Wang, D., Zhang, Z., Wang, W., Wang, L., and Tan, T.

(2012). Baseline results for violence detection in still

images. In AVSS, pages 54–57.

Zajdel, W., Krijnders, J., Andringa, T., and Gavrila, D.

(2007). CASSANDRA: audio-video sensor fusion for

aggression detection. In Advanced Video and Signal

Based Surveillance, 2007. AVSS 2007. IEEE Confer-

ence on, pages 200–205.

Zou, X., Wu, O., Wang, Q., Hu, W., and Yang, J. (2012).

Multi-modal based violent movies detection in video

sharing sites. In IScIDE, pages 347–355.

FastViolenceDetectioninVideo

485