Crowd Event Detection in Surveillance Video

An Approach based on Optical Flow High-frequency Feature Analysis

Ana Paula G. S. de Almeida, Vitor de Azevedo Faria and Flavio de Barros Vidal

Department of Computer Science, University of Brasilia, Brasilia, Distrito Federal, 70.910-900, Brazil

Keywords:

Crowd Event Detection, Optical Flow, High-frequency Feature.

Abstract:

Many real-world actions occur often in crowded and dynamic environments. Video surveillance application

uses crowd analysis for automatic detection of anomalies and alarms. In this position paper we propose a

crowd event detection technique based on optical ﬂow high-frequency feature analysis to build a robust and

stable descriptor. The proposed system is designed to be used in surveillance videos to automatic violence acts

detection. Preliminary results show that the proposed methodology is able to perform the detection process

with success and allows the development of an efﬁcient recognition stage in further works.

1 INTRODUCTION

Many real-world actions occur often in crowded and

dynamic environments. Video surveillance applica-

tion uses crowd analysis for automatic detection of

anomalies and alarms (Ke et al., 2007). In order, be-

havior of the crowd attracts many researchers interest

because of its complexity and abstraction (Husni and

Suryana, 2010). There are several obstacles, such as,

occlusion, illumination changes and other obstacles

that could inﬂuence detecting process, also there are

some difﬁculties in analyzing crowd event.

Nowadays, the use of surveillance systems are

higher in outside and inside environments, in pub-

lic or private buildings, to maintain the safety of the

users. But, considering this increase of equipment

(cameras in general), it is difﬁcult to check up all im-

ages captured from the many cameras and it requires

many people to monitor all these images. Therefore,

techniques that develop an accurate and efﬁcient ap-

plication to perform crowd event detection are very

important for this high-growth market.

Supported by previous arguments, in this position

paper we propose a initial development of a robust

system able to crowd event detection based on fea-

tures analysis from optical ﬂow high-frequency com-

ponents. The Section 2 describes the main related

works about crowd event detection. In Section 3 and 4

the proposed methodology and initial results are pre-

sented, respectively. Conclusions and further work

are discussed in Section 5.

2 CROWD EVENT DETECTION

In according to (Liao et al., 2011; Ke et al., 2007),

crowd event detection is an important task in public

security. It has become a major issue in public places

such as subway stations, banks, squares, etc. (Li et al.,

2012) describes that in crowd surveillance videos,

event analysis is a critical research point. In this area,

it mainly contains the following issues: Firstly, it is

difﬁcult to detect and track every identity indepen-

dently, because huge numbers of motion objects are

seriously occluded; Secondly, it is difﬁcult to describe

the spatial relations of the objects.

In (Garate et al., 2009), when a single object is

difﬁcult to be precisely detected and tracked in the

crowd scene, the common solutions directly depend

on the low-level video features analysis, through mo-

tion region detection and motion feature extraction.

For crowd event detection, there are some recent

works that describe automatic techniques to detect ex-

act timing when an event (as violence actions) occurs.

For example, in (Xu et al., 2014) it is proposed a Bag-

of-Words (BoW) classiﬁcation model and a fusion be-

tween BoW and Motion SIFT (MoSIFT) algorithm to

improvethe previous method results showed in (Wang

et al., 2012). In (Esen et al., 2013), a new model of

motion feature, called Motion Co-Occurrence Feature

(MCF) and an energy model based, using informa-

tion from the velocity of the moving target estimated

by optical ﬂow and entropy approach for the disorder

features.

629

G. S. de Almeida A., de Azevedo Faria V. and de Barros Vidal F..

Crowd Event Detection in Surveillance Video - An Approach based on Optical Flow High-frequency Feature Analysis.

DOI: 10.5220/0005355306290634

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 629-634

ISBN: 978-989-758-091-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

3 PROPOSED METHODOLOGY

For ﬁght detection process in surveillance videos, we

performed to each frame form of the video theses

steps, as described in Figure 1: First, the Horn and

Schunck optical ﬂow (Horn and Schunck, 1981) is ap-

plied; From the dense estimated optical ﬂow, the ver-

tical and horizontal components are evaluated; The

two-dimensional Discrete Fourier Transform is used

in the estimated motion ﬁeld (horizontal and vertical).

The Discrete Fourier Transform divides motion ﬁeld

into components of low and high frequencies in spe-

ciﬁc regions.; Knowing that information, we propose

two different spatial slicing methods (rectangular and

squared) to extract the high and low frequencies com-

ponents from the estimated motion ﬁeld.

Figure 1: Proposed Methodology ﬂuxogram.

In the following Sections the main ideas are de-

scribed for each stage of the proposed methodology

to be able to achieve results from the proposed ap-

proach.

3.1 Image Sequence Input

In this approach, it was used only images from a gen-

eral surveillance video. A general surveillance video

is captured from a ﬁxed camera found at many cities

nowadays (Kruegle, 2011). No other a priori infor-

mation about the scenes is used and all image infor-

mation processing is made in current video frame, de-

ﬁned in temporal index by k + 1 and compared with

the anterior frame in time k.

3.2 Optical Flow

The optical ﬂow approximates the image motion ﬁeld

by representing the apparent motion of the image

brightness pattern on the image plane. In determin-

ing the optical ﬂow, two aspects must be taken into

account. One is related to the accuracy level of data

concerning motion direction and intensity. The other

aspect encompasses certain properties related to the

computational load required for optical ﬂow determi-

nation under minimal conditions of accuracy. The

compromisebetween these aspects depends on the sit-

uation and the expected results. The trade-offs be-

tween efﬁciency and accuracy in optical ﬂow algo-

rithms are discussed by (Liu et al., 1998).

The methods of determining the optical ﬂow can

be divided (Barron et al., 1994) in: a) differen-

tial techniques; b) region-based matching; c) energy-

based methods; and d) phase-based techniques. In

this approach we considered the differential tech-

niques. Among them, one has a particular interest;

it uses spatiotemporal derivatives of the image bright-

ness intensity deﬁned by (Horn and Schunck, 1981).

3.2.1 Horn & Schunck Optical Flow Method

According to (Horn and Schunck, 1981), the optical

ﬂow cannot be calculated at a point in the image inde-

pendently of neighboring points without introducing

additional constraints. This happens because the ve-

locity ﬁeld at each image point has two components

while the change in brightness at that point due to mo-

tion yields only one constraint. Before describing the

method, certain conditions must be satisﬁed.

For convenience, it is assumed that the apparent

velocity of brightness patterns can be directly identi-

ﬁed with the movement of surfaces in the scene. This

implies that, according the object surface that moves,

it does not exist (or there is a little) brightness varia-

tion. This happens, for example, with objects of radial

symmetry, low global contrast and high specular re-

ﬂectance level. It is further assumed that the incident

illumination is uniform across the surface.

Denoting I(x, y, t) as the image brightness at time t

of the image point (x, y). During motion, it is assumed

that the brightness of a particular point is constant,

that means

dI(x, y, t)

= 0 (1)

Expanding and rewriting the equation 1

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

630

u+ I

v+ I

= 0 (2)

where: I

, I

and I

represent partial derivatives of

brightness in x, y and t respectively; u and v are the

x− and y−velocity components.

Considering that, the brightness pattern can move

smoothly and independently of the rest of the scene,

there is a possibility to recover velocity information.

The partial derivatives of image brightness are es-

timated from the discrete set of image brightness mea-

surements. To avoid problems caused by zero values

for the derivatives in the spatio-temporal directions,

the point of interest is located at the center of a cube

formed by eight measurements as shown in ﬁgure 2

(Horn and Schunck, 1981).

Figure 2: Estimating image partial derivates.

Each of the partial derivatives is estimated as the

average of the four ﬁrst differences taken over adja-

cent measurements

≈



i, j+1,k

− I

i, j,k

+ I

i+1, j+1,k

− I

i+1, j,k

i, j+1,k+1

− I

i, j,k+1

+ I

i+1, j+1,k+1

− I

i+1, j,k+1



≈



i+1, j,k

− I

i, j,k

+ I

i+1, j+1,k

− I

i, j+1,k

i+1, j,k+1

− I

i, j,k+1

+ I

i+1, j+1,k+1

− I

i, j+1,k+1



≈



i, j,k+1

− I

i, j,k

+ I

i+1, j,k+1

− I

i+1, j,k

i, j+1,k+1

− I

i, j+1,k

+ I

i+1, j+1,k+1

− I

i+1, j+1,k



(3)

The additional constraint for the velocity calcula-

tion results from the assumption of smoothness of the

velocity ﬁeld.

A weighting factor α

is introduced to associate

the error magnitude with quantization errors and

noise.

The estimated values for velocities components to

k+1

and v

k+1

are obtained from

k+1

−



+ I





+ I



(4)

k+1

−



+ I





+ I



(5)

In equations (4) and (5)

and v

are the average

velocities estimated from the Laplacian of the bright-

ness pattern in iteration k, in which the neighboring

pixels values are weighted with the mask shown in

(Horn and Schunck, 1981).

3.3 Difference Between Estimated

Motion Field

After evaluating the Horn and Schunck optical

ﬂow, two velocities components are estimated in

horizontal(Vel

) and vertical(Vel

) direction. These

components have the same size as the input image and

describe the motion ﬁeld in the image domain.

To crowd ﬁght detection process, we choose to

use the absolute difference between estimated ﬂow

in k + 1 and k because in many situations, high lu-

minance changes in the surveillance camera systems

affect directly the proposed optical ﬂow technique.

3.4 Discrete Fourier Transform

For high-frequency feature extraction we introduce

information from two-dimensional Discrete Fourier

Transform. The Discrete Fourier transform (DFT)

(Oppenheimet al., 1999) represents any discrete func-

tion with sums of sines and cosines. Using the inverse

discrete Fourier transform, it is possible to return to

the original function without losing information. In

image domain, it changes the domains from spatial

to frequency domain. The two-dimensional Discrete

Fourier Transform is described in Equation 6.

F(k, l) =

N−1

∑

i=0

N−1

∑

j=0

f(i, j)e

−i2π(

l j

)

(6)

In according to Equation 6 whole frequencies (low

and high) spectrum are evaluated after transformation

and achieved from each sinusoidal components and

invariant from spatial image information. The high

frequencies are mostly concentrated in the borders,

while the low frequencies are in the center. For im-

proved classiﬁcation purposes stage, after DFT trans-

formation all Vel

and Vel

are shifted to the center

(see Figure 3).

Figure 3: An example of vertical (Vel

) magnitude compo-

nents from DFT of estimated motion ﬁeld.

CrowdEventDetectioninSurveillanceVideo-AnApproachbasedonOpticalFlowHigh-frequencyFeatureAnalysis

631

When shifting all magnitudes to the center, it al-

lows to invert the positions of high frequencies and

low frequencies spatially. In this way, high frequen-

cies components are in the borders of the magnitude

spectrum and low frequencies are in the middle of the

magnitude spectrum. The DFT shift procedure is a

very important stage for the classiﬁcation based on

spatial feature from high-frequencies component.

3.5 Extraction of High-frequencies

Features from an Estimated Motion

Field

To perform crowd ﬁghting detection in surveillance

video systems, we assume an hypothesis that when

in case of crowd ﬁght occurrences many high-

frequencies components from the motion ﬁeld have

amplitude changes (or have detectable amplitude

changes) in spatial dispersion, if compared to the orig-

inal spatial dispersion of theses components.

So in this stage we propose a spatial feature ap-

proach based on analysis only of the spatial disper-

sion, accounting in preliminary attempt changes in re-

gions where the high-frequency is located, discarding

regions that have elements of low frequencies proba-

bly are located in the spectrum.

In order to perform that, one method for spatial

high-frequencies features analysis is prepared based

on a division of the amplitude spectral region, is in

Vel

and Vel

directions, in an odd number of slices,

as described in Figure 4.

When great agitation (or any abnormal behav-

ior), high-frequencies amplitude coefﬁcients should

increase and low-frequencies ones would became

weaker.

Figure 4: Vertical Slice Method.

For the vertical slice proposed method, Figure 4,

we have H for image height, W that indicates im-

age width, and NoS represents the quantity of desired

slices, or the number of slices.

4 PRELIMINARY RESULTS

The proposed methodology was tested in twenty dif-

ferent video ﬁles, lasting thirty seconds each, in two

classes: A class called violence acts (ten videos); A

class called non-violence acts (ten videos). For exam-

ple, a non-violence class is a situation in which there

are many people crossing the street. Ten videos con-

taining both class types, violence and non-violence

acts, are used to validate the proposed methodology.

The Figures 5-(a) and (b) are screenshots images from

a violence and non-violence acts videos, respectively,

used to build the video database. All ﬁles were cap-

tured from public videos available on internet

(a) Violence acts video sequence.

(b) Non-violence acts video sequence.

Figure 5: Screenshot images from the used video database.

In this work it is easily noticed that the proposed

methodology, for a complete crowd event detection

in surveillance video, requires to develop a method-

ology of detecting and recognition. Therefore, the

preliminary results reported in this Section cover only

the minimum necessary conditions in towards to build

a stable feature descriptor able to be useful for the

recognition process.

Applying the methodology proposed in the pre-

vious Section 3, we can see variations in the high-

frequencycomponents from videos with violence acts

when compared to non-violence videos.

The videos used in Figure 5-(a) are available in https://

www.youtube.com/watch?v=dGGvCL9x6m8 and Figure 5-

(b) in https://www.youtube.com/watch?v=8Q6NmFb0KuU.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

632

(a) Original non-violence video (screenshot).

(b) Horizontal DFT amplitude components in

non-violence video.

Figure 6: Frequencies component extracted from a non-

violence video.

In order, when comparing the two Figures 6 and

7 (violence and non-violence), occurs many ampli-

tude variations and spatial dispersion of the high-

frequencycoefﬁcients (border regions of the DFT am-

plitude image).

As indicated in Figures 6 and 7 that the high-

frequency components from the motion ﬁeld can be

used to perform detection, and after the event recogni-

tion in crowds, two approaches are developed: One by

evaluate the relation about number of slices are used

and another by slices shape (rectangular and square).

4.1 Slice Number Tests

To evaluate the number of slices test, we calculate the

mean of high-frequency energy (horizontal and verti-

cal directions) of DFT components using 3, 5 and 11

rectangular slices, as described in Figure 8.

In Figure 8 the blue-mark is non-violence and red-

mark is violence video classes. The x and y axes de-

scribe the horizontal and vertical directions, respec-

tively.

Due to the spatial distribution feature of high-

frequency components of the motion ﬁeld, after DFT

shift process, it is the main indicative information for

(a) Original violence video (screenshot).

(b) Horizontal DFT amplitude components in

violence video.

olence video.

Figure 7: Frequencies component extracted from a violence

video.

the event detection in crowds. The results in Figure

8 describes clearly a grouped cloud of points of the

same class (non-violence), while the other cloud of

points are spread in the graphic (violence class).

It is observed that in according to the slices shape

are obtained (rectangular), we can build a descriptor

that allows to discriminate videos between the vio-

lence and non-violence acts class in crowds. From

this evaluated descriptor we have evidences to be able

to classify using simple classiﬁers, as a Support Vec-

tor Machine or using an Artiﬁcial Neural Network.

5 CONCLUSIONS AND FURTHER

WORK

In this position paper we presented new strategies to

analyze howthe high-frequenciesof estimated motion

ﬁeld and Discrete Fourier transform can give informa-

tion about crowd event detection. Using the graphics

and information showed in section 4, there are many

evidences that it is possible to create a stable and reli-

able classiﬁer to solve the proposed problem.

The next research steps will be given in towards

of the exploration of slices shape and slice quantities

CrowdEventDetectioninSurveillanceVideo-AnApproachbasedonOpticalFlowHigh-frequencyFeatureAnalysis

633

(a) 3 rectangular slices energy.

(b) 5 rectangular slices energy.

Figure 8: Slice number test - Horizontal and vertical DFT

energy.

to achieve better results for the proposed descriptor.

Then all efforts will be concentrate on designing a ro-

bust and efﬁcient classiﬁer for a full crowd event de-

tection system.

Others works may include the implementation

of the proposed strategies on a high level program-

ming language in order to enable its operation in real

time scenarios (including timing analysis) using a real

videos and also perform more comparisons with the

latest techniques available in crowd event detection

and recognition.

REFERENCES

Barron, J. L., Fleet, D. J., and Beauchemin, S. S. (1994).

Performance of optical ﬂow techniques. In Inter-

national Journal of Computer Vision, number 12:1,

pages 43–77.

Esen, E., Arabaci, M., and Soysal, M. (2013). Fight

detection in surveillance videos. In Content-Based

Multimedia Indexing (CBMI), 2013 11th International

Workshop on, pages 131–135.

Garate, C., Bilinsky, P., and Bremond, F. (2009). Crowd

event recognition using hog tracker. In Performance

Evaluation of Tracking and Surveillance (PETS-

Winter), 2009 Twelfth IEEE International Workshop

on, pages 1–6.

Horn, B. K. P. and Schunck, B. G. (1981). Determining op-

tical ﬂow. In Artiﬁcial Intelligence, number 17, pages

185–204.

Husni, M. and Suryana, N. (2010). Crowd event detec-

tion in computer vision. In Signal Processing Systems

(ICSPS), 2010 2nd International Conference on, vol-

ume 1, pages V1–444–V1–447.

Ke, Y., Sukthankar, R., and Hebert, M. (2007). Event de-

tection in crowded videos. In Computer Vision, 2007.

ICCV 2007. IEEE 11th International Conference on,

pages 1–8.

Kruegle, H. (2011). CCTV Surveillance: Video Practices

and Technology. CCTV Surveillance Series. Elsevier

Science.

Li, G., Chen, J., Sun, B., and Liang, H. (2012). Crowd

event detection based on motion vector intersection

points. In Computer Science and Information Pro-

cessing (CSIP), 2012 International Conference on,

pages 411–415.

Liao, H., Xiang, J., Sun, W., Feng, Q., and Dai, J. (2011).

An abnormal event recognition in crowd scene. In

Image and Graphics (ICIG), 2011 Sixth International

Conference on, pages 731–736.

Liu, H., Hong, T., Herman, M., Camus, T., and Chellappa,

R. (1998). Accuracy vs efﬁciency trade-offs in opti-

cal ﬂow algorithms. In Computer Vision and Image

Understanding, number 72:3, pages 271–286.

Oppenheim, A. V., Schafer, R. W., and Buck, J. R. (1999).

Discrete-time Signal Processing (2Nd Ed.). Prentice-

Hall, Inc., Upper Saddle River, NJ, USA.

Wang, D., Zhang, Z., Wang, W., Wang, L., and Tan, T.

(2012). Baseline results for violence detection in still

images. In Advanced Video and Signal-Based Surveil-

lance (AVSS), 2012 IEEE Ninth International Confer-

ence on, pages 54–57.

Xu, L., Gong, C., Yang, J., Wu, Q., and Yao, L. (2014).

Violent video detection based on mosift feature and

sparse coding. In Acoustics, Speech and Signal Pro-

cessing (ICASSP), 2014 IEEE International Confer-

ence on, pages 3538–3542.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

634