Behavior Recognition in Mouse Videos using Contextual Features

Encoded by Spatial-temporal Stacked Fisher Vectors

Zheheng Jiang

, Danny Crookes

, Brian Desmond Green

, Shengping Zhang

and Huiyu Zhou

School of EEECS, Queen’s University Belfast, Belfast, U.K.

School of Biological Sciences, Queen’s University Belfast, Belfast, U.K.

School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China

{zjiang01, d.crookes, b.green}@qub.ac.uk, shengping.zhang@gmail.com, h.zhou@ecit.qub.ac.uk

Keywords: Mouse Behavior Recognition, Spatial-temporal Stacked Fisher Vector, Gaussian Mixture Model, Contextual

Features, Spatio-temporal Interest Points.

Abstract: Manual measurement of mouse behavior is highly labor intensive and prone to error. This investigation aims

to efficiently and accurately recognize individual mouse behaviors in action videos and continuous videos. In

our system each mouse action video is expressed as the collection of a set of interest points. We extract both

appearance and contextual features from the interest points collected from the training datasets, and then

obtain two Gaussian Mixture Model (GMM) dictionaries for the visual and contextual features. The two

GMM dictionaries are leveraged by our spatial-temporal stacked Fisher Vector (FV) to represent each mouse

action video. A neural network is used to classify mouse action and finally applied to annotate continuous

video. The novelty of our proposed approach is: (i) our method exploits contextual features from spatio-

temporal interest points, leading to enhanced performance, (ii) we encode contextual features and then fuse

them with appearance features, and (iii) location information of a mouse is extracted from spatio-temporal

interest points to support mouse behavior recognition. We evaluate our method against the database of Jhuang

et al. (Jhuang et al., 2010) and the results show that our method outperforms several state-of-the-art

approaches.

1 INTRODUCTION

Mice are extensively employed in biomedical science

and their responses to disease or therapy are

frequently detected by measurement of their behavior

patterns. In most cases this monitoring is performed

manually using video recordings. Recording of

diverse behaviors of home-cage mice generates a

large amount of information for researchers (Steele et

al., 2007; Roughan et al., 2009) in pathology,

psychology, ethology, neuroscience and medicine.

However, manual annotation of mouse recordings is

a highly labor intensive process which is error-prone

and subject to individual interpretation. Furthermore,

human observers may fail to detect behavioral events

that are very quick or too slow, and humans may miss

events because of dwindling attention span.

In the literature some systems which automatical-

ly recognize animal behaviors have been described.

For instance, Rousseau et al. (Rousseau et al., 2000)

were the first to show that the detection of specific

behaviors was possible. They applied neural network

techniques to recognize 9 solitary rat behaviors from

body shape and position, recorded from the side-

view. However, their method of tracking the nose is

not sufficiently developed to draw conclusions

concerning its sensitivity and reliability. In 2005

Dollár et al. (Dollár et al., 2005) recognized mouse

behavior using the classification of sparse spatio-

temporal features. However, they only considered

visual features of the interest points (e.g. image

gradient) without the contextual information such as

the spatial relationship between different interest

points. In 2010 Jhuang et al. (Jhuang et al., 2010) used

background subtraction to get a subwindow of the

mouse in each frame from the side-view. From the

mouse subwindow, the features that they used were

generated based on a computational model of motion

processing in the human brain (Jhuang et al., 2007),

followed by classification using a Hidden Markov

Model Support Vector Machine (SVMHMM). Their

method to locate the mouse is dependent on a good

background model, which it turns out can be

problematic. Recently, Burgos-Artizzu et al. (Burgos-

Jiang, Z., Crookes, D., Green, B., Zhang, S. and Zhou, H.

Behavior Recognition in Mouse Videos using Contextual Features Encoded by Spatial-temporal Stacked Fisher Vectors.

DOI: 10.5220/0006244602590269

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 259-269

ISBN: 978-989-758-222-6

259

Artizzu et al., 2012) created a system for recognizing

the social behavior of mice, both from the top and side

views. They applied AdaBoost with spatio-temporal

and trajectory features to classify mouse behaviors.

As with the method of Dollár et al. (Dollár et al.,

2005), this method also ignored the spatio-temporal

contextual features. Furthermore, their trajectory

features are based on a tracking algorithm which was

not detailed in their paper.

A common feature of all of the above studies is

that the location information of mice is computed by

tracking (Rousseau et al., 2000); (Burgos-Artizzu et

al., 2012) or detection (Jhuang et al., 2007)

algorithms. Also, their extracted features are derived

from studies of human behavior recognition, such as

spatio-temporal, trajectory and shape features. Low-

level local features have become popular in action

recognition due to their robustness to background

noise and independence of the detection and tracking

algorithms. Among these local features, spatio-

temporal interest points (Dollár et al., 2005); (Laptev,

2005) and Improved Dense Trajectories (IDT) (Wang

and Schmid, 2013); (Wang et al., 2015) are widely

used because of their ease of use and good

performance. Spatio-temporal interest points are used

by some systems (Dollár et al., 2005); (Laptev, 2005)

to extract visual features around interest points, but

contextual features also imply a large amount of

information about spatial location and temporal

changes of the mouse.

In our system, we propose to exploit contextual

features of interest points, which also potentially

describe mouse location without using an

independent tracking or detecting algorithm. These

features are then encoded as spatial-temporal stacked

Fisher vectors which are the input to the neural

network. The main contributions of this study are:

1. We improve upon the performance of Dollar’s

interest point detector especially under illumine-

tion using frame differencing and Laplacian of

Gaussian filtering.

2. We explore new contextual features from the

spatio-temporal interest points for behavior

recognition. It is the first attempt to encode this

contextual feature rather than simply concatenate

them after appearance features like (Jhuang et al.,

2010), (Burgos-Artizzu et al., 2012) and (Laptev,

2005). Our contextual features are an important

feature which can characterize both spatial loca-

tion and temporal changes in mice. We compute

the absolute and relative positions of each interest

point and then concatenate them to form the

contextual features.

3. We compute spatial-temporal stacked Fisher

vectors for both contextual and visual features that

help improve behavior recognition accuracy. We

generate two GMM dictionaries for contextu-al

and visual features respectively and then compute

spatial-temporal stacked Fisher vectors for each of

them.

4. We conduct a comprehensive evaluation of the

proposed algorithm, and compare it with several

state-of-the-art techniques.

Figure 1: The proposed framework.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

260

2 FRAMEWORK OF OUR

APPROACH

As shown in Fig. 1 the pipeline for our method

consists of five steps: (i) detection of interest points,

(ii) description of interest points, (iii) generation of

two Gaussian Mixture Model (GMM) dictionaries,

(iv) feature encoding with spatial-temporal stacked

Fisher vector (FV), and (v) classification with a

neural network (NN). In the following sections, we

will describe each step in more detail.

2.1 Detection of Interest Points

Interest points are local spatio-temporal features

considered to be salient or characteristic of the action

captured in a 3D spatio-temporal volume (see Fig. 2).

Spatio-temporal interest points are those points where

the local neighborhood has a significant variation in

both the spatial and the temporal domains. Laptev

(Laptev, 2005) extended the 2D Harris corner

detector to 3D. However the main drawback of this

method is the relatively small number of stable

interest points. Willems et al. (Willems et al., 2008)

identify saliency as the determinant of a 3D Hessian

matrix, which is faster and denser than Harris 3D but

less dense than Dollar’s detector. Another trend is to

use dense sampling (Wang and Schmid, 2013), which

extracts video blocks at regular positions and scales

in space and time. Obviously dense sampling is able

to produce many more interest points than the above

detector. However, it is more difficult to ensure that

all interest points are on the object. Among various

interest point detection methods, the one proposed by

Dollar et al. (Dollár et al., 2005) is perhaps the most

suitable for mouse action recognition. They calculate

a response function to locate interest points. Their

response function has the form:

∗∗







∗∗







(1)

Figure 2: Some examples of detected interest points

(depicted by red dots) in a 3D spatio-temporal volume.

Figure 3: Comparison between interest points detected

using our detector (bottom) and the Dollar detector (top)

under illumination change.

where g is the 2D Gaussian smoothing kernel which

is applied only along the spatial dimensions, and 



and 



are a quadrature pair of 1D Gabor filters

applied temporally, defined as: 





;,











/



, and 





;,











/



Despite this method’s popularity, since it uses

solely local information within a small region, it is

prone to false detection under illumination variation;

it also tends to generate spurious interest points

around highly textured background areas. Some

drawbacks are highlighted in the examples in Fig. 3.

To overcome these shortcomings, we propose

here a different interest point detector. In particular,

although the 1-D Gabor filter applied in the temporal

domain is effective for capturing the dynamics of

actions, it is sensitive to both illumination and highly

textured background. To overcome this problem, the

proposed detector explores different filters for

detecting salient spatio-temporal interest points.

More specifically, our detector consists of two steps:

1) Laplacian of the Gaussian filtering in the spatial

domain replacing single Gaussian in Dollar et al.

(Dollár et al., 2005) for reducing the influence of

illumination and 2) frame differencing for eliminating

spurious interest points on the background. This two-

step approach facilitates saliency detection in both the

temporal and spatial domains to give a combined

filter response. Hence our response function has the

form:

∗∗∗







∗∗∗







(2)

Behavior Recognition in Mouse Videos using Contextual Features Encoded by Spatial-temporal Stacked Fisher Vectors

261

in which L is the operator of Laplace used on the

space. Due to introduction of the Laplace operator,

our interest points detector can effectively reduce the

influence of illumination, followed by frame

differencing for ensuring all interest points are on the

mouse. Fig. 3 also shows that, under the same

illumination change, our detector can extract more

precise interest points on the mouse.

2.2 Contextual and Visual Features

2.2.1 Spatio-temporal Contextual Feature

Extraction

Most existing mouse behavior recognition systems

(Jhuang et al., 2010; Rousseau et al., 2000; Burgos-

Artizzu et al., 2012) extract position features from

established trackers. However these tracking

algorithms do not seem to be very reliable. For

example, Jhuang et al. (Jhuang et al., 2010) used

background subtraction to get a subwindow of the

mouse in each frame, but their foreground detection

algorithm assumes the background is constant, which

cannot be guaranteed in a real experiment. Besides,

trajectory features extracted by Burgos-Artizzu et al.

(Burgos-Artizzu et al., 2012) are from their

undetailed tracking algorithm, and the mouse nose

tracking algorithm applied by Rousseau et al.

(Rousseau et al., 2000) seems insensitive and

unreliable. Unlike their approach, we propose a novel

method to extract contextual information from the

detected interest points, which also imply the location

of the mouse without using any extra mouse tracking

or detection algorithm.

Our spatio-temporal contextual information of

interest points is an important action representation,

because they characterize both spatial location and

temporal changes of the mouse. There are two types

of features that are computed: the relative position,

and absolute spatial position of interest points. The

position of each interest point in the 3D spatio-

temporal volume is represented by its XYT

coordinates. Fig. 2 intuitively shows the distribution

of interest points. Suppose there are R interest points

detected in an action video. In order to compute

relative positions, we firstly compute a center interest

point defined by: 



;



;









∗

∑





;



;









where 



;



;



 and 



;



;



 represent the

coordinates of the center and the i

interest point

respectively in an action video. Consequently, the

relative position of interest points is represented by

the coordinates of R interest points relative to the

center interest point: 















;







;











,,,…,.

Using relative position efficiently describes the

distribution in the 3D spatio-temporal volume,

because it concentrates on different behavior patterns

while ignoring outliers. The absolute spatial position

of each interest point is able to characterize the place

where the action happens (which can be important for

location-based behaviors such as drinking). To

capture this information, we concatenate the XY

coordinates to the relative position. Overall, the

contextual feature vector has the form: 

















;







;







;



;





,,,…,.

2.2.2 Spatio-temporal Visual Feature

Extraction

After detecting the interest points, we extract the

visual features (see Fig. 4) from the cuboids around

the interest points in the 3D spatio-temporal volume.

For simplicity, we extract the brightness gradients

with three channels (G_x, G_y, G_t) from each

cuboid and flatten the cuboid into a vector as (Dollár

et al., 2005). To eliminate noise and retain some

principle information, Principle Component Analysis

(PCA) is used to reduce the dimensionality of the

visual feature vector.

2.3 Generation of GMM Dictionaries

for Contextual and Visual Features

The aim of dictionary generation is to describe the

local feature space and provide a partition for local

descriptors (Peng et al. 2014). In some existing mouse

behavior systems (Dollár et al., 2005; Burgos-Artizzu

et al., 2012), a mouse action is modeled as a bag of

independent and unordered visual words; however,

the spatio-temporal contextual information of interest

points is ignored. In these approaches, the k-means

clustering algorithm is used to construct the

dictionary. In our work, instead of k-means, we use

Gaussian Mixture Model (GMM), which is a

probabilistic model to characterize the distribution of

the given feature space.

For each type of dictionary, we suppose a K-

component GMM, and each Gaussian k has the form

(

Perronnin et al. 2010):













































(3)

where



and



aretheDdimensionofmean

vector and diagonal covariance matrix

respectively,  ,,…,.ThentheGMMcanbe

definedas:

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

262





;













;



,











(4)

where 



,



,



,,…, , 



is the

mixture weight of Gaussian 



and subject to

∀



:



,

∑











Given the feature set 







,…,





, we apply

the Expectation-Maximization (EM) algorithm to

optimize parameters of GMM, which is learned

through maximum likelihood (Bishop, 2006).

There are two benefits in our approach with two

dictionaries: (1) The two dictionaries consider both

contextual and visual features of interest points. (2)

Unlike k-means, GMM delivers not only the mean

information of code words, but also the shape of their

distribution.

2.4 Feature Encoding and Fusion

Feature encoding aims to leverage the dictionary to

integrate all local descriptors into a feature vector,

which can ensure all video clips have the same

dimension of feature vector, and efficiently improve

classification performance. Although feature enco-

ding and fusion are very important procedures in

mouse action recognition, related papers discussing

this are rare. For example, (Dollár et al., 2005) and

(Laptev, 2005) only use the traditional encoding

method of Vector Quantization. For feature fusion,

some existing mouse behavior recognition systems

(Jhuang et al., 2010; Rousseau et al., 2000; Burgos-

Artizzu et al., 2012) simply append positional featu-

res after appearance features without encoding. In our

opinion, appearance features and contextual features

are two different kinds of feature and vary in value

range. So it is more reasonable to encode them

separately. In recent evaluations (Peng et al., 2014;

Chatfield et al., 2011; Wang et al., 2009), the Fisher

Vector performs consistently better than bag of

features, where it is popular to encode features for

both image and video classification. We also apply

this encoding method and show that it can improve

the performance of our features as well (see section

4.1). Unlike bag of features, Fisher Vector leverages

GMM as its dictionary for encoding more information

than the mean of code words. It calculates the

gradient of the log-likelihood with respect to a

parameter of GMM, which can describe how that

parameter contributes to the process of generating a

particular example (Perronnin et al. 2010). Let





,n1… be the set of N descriptors of

interest points in an action video. Then this video can

be represented by the gradient vector of log likelihood

(Jaakkola et al. 1999):





















;



(5)

where 



;





∏





;





and  is the

parameter of this function. This is a generative model

to characterize an action video with a gradient vector

derived from a probability density function. On the

basis of this generative model, Perronnin et al.

(

Perronnin et al. 2010) introduced the GMM to replace

the probability density function 







;



and

developed an improved Fisher vector as follows:



,













































(6)



,



















































(7)

where 





has D dimensions and represents the

diagonal covariance matrices, i.e. the diagonal of 



In other words, 

,



and 

,



are the D-dimensional

gradients with respect to the mean 



and standard

deviation 



of Gaussian k. Eqs. (6) and (7) are the

mathematical derivations of Eq. (4) replacing the





; of GMM. In addition, 









is the weight

of 



to the k

Gaussian:

























;



,





∑













;



,









(8)

If we suppose there are K Gaussians and D

dimensions of a descriptor after performing PCA in

our system, then the Fisher vector is the concatenation

of 

,



and 

,



with a total of 2KD vector

dimensions, which describes how the parameters of

the generative model 



;



should be modified to

better fit the data X.

Figure 4: Spatial-temporal stacking.

In our approach, as mentioned in section 2.2, we have

two GMM dictionaries, one for visual and the other

for contextual features, so we can compute two Fisher

vectors for both of them. Note that local sum-pooling,

which is in the form of (6) and (7), is agnostic to the

relative location of aggregated features. To capture

Behavior Recognition in Mouse Videos using Contextual Features Encoded by Spatial-temporal Stacked Fisher Vectors

263

the spatial-temporal structure within each feature’s

neighborhood, inspired by spatial stacking of

(Simonyan et al., 2013), we incorporate the stacking

sub-layer, which concatenates the spatial-temporal

adjacent features in the 2*2*2 cuboid which

encompasses all the detected interest points (Fig. 4).

After normalizing these spatial-temporal stacked

Fisher vector by power and L2 normalization, we fuse

contextual and appearance Fisher vectors to give the

input to the classifier. In particular, contextual and

appearance are complementary features, and they

jointly boost the performance of the classifier (see

Section 4.2).

2.5 Classification with a Neural

Network

In our study, the fusion FV of contextual and visual

features is the final feature vector which needs to be

classified. Although FV are designed to work well

with a linear classifier (because they correspond to

the explicit feature map of the Fisher Kernel), we still

wanted to understand whether FV classification could

be improved with non-linear classifiers. Perronnin et

al. (Perronnin et al., 2015) recently gave a positive

answer for image classification. In our experiment,

we also apply a non-linear classifier and find that a

combination of neural networks and Fisher vectors

can give better results than the conventional

combination of Fisher vector and linear support

vector machine (SVM). In our approach, a feed-

forward neural network was constructed with two

layers of sigmoid hidden neurons and softmax output

neurons. The feed-forward neural network allows the

one-way transmission of the data from input to

output. The hidden layer was fed with the fused and

normalized Fisher Vector of contextual and visual

features. In each layer, the input is weighted and

transformed by an activation function (sigmoid in the

hidden layer and softmax in the output layer) and is

then passed to the neurons in the next layer.

3 EXPERIMENTAL SETUP

3.1 Benchmark Experiment

To quantify the improvement obtained by our spatio-

temporal visual and contextual features, we compared

our method first to the state-of-the-art approach using

Trajectory Features proposed by Wang et al. (Wang

et al. 2015), because it has been a popular action

representation in recent years, albeit for human

behaviors.

3.1.1 Spatio-temporal Visual and

Contextual Features

The interest point detector used in this experiment

was proposed by Dollar et al. (Dollár et al., 2005). For

parameter setting, the spatial and temporal scale

parameters σ and τ are empirically set to 2 and 3,

respectively. After detecting the interest points, we

extract XYT relative and absolute locations of each

interest point. Afterwards we construct a visual

feature vector using brightness gradients from

cuboids, which are centered on the interest points and

have default size 13*13*19. To eliminate noise and

retain some principle information, Principle

Component Analysis (PCA) is then used to reduce the

dimensionality of visual feature vector by preserving

98% of the energy.

3.1.2 Trajectory Features

Improved Dense Trajectories (IDT) (Wang et al.,

2015) is another widely used local feature. This

approach densely samples points in each frame.

Tracking points are achieved using optical flow. We

used the default trajectory length of 15 frames. For

each trajectory, we computed descriptors of

Trajectory, HOG, HOF and MBH (Wang et al.,

2013). The Trajectory descriptor describes its shape

by a sequence of displacement vectors. The other

descriptors are computed in the spatio-temporal volu-

me aligned with the trajectory. HOG represents the

static appearance information by the orientation of

image gradients. Both HOF and MBH measure mo-

tion information, and are based on optical flow. HOF

directly quantizes the orientation of flow vectors.

MBH quantizes the derivatives by splitting the optical

flow into horizontal and vertical components. The

final dimensions of the descriptors are 30 for Traje-

ctory, 96 for HOG, 108 for HOF and 192 for MBH.

3.2 Feature Encoding and

Classification

To encode features, we compared bag of features and

Fisher vector. We used 1500 randomly sampled

features with k-means to train a codebook for each

descriptor type including HOG, HOF, MBH, spatio-

temporal visual and contextual features. The size of

the codebook is set to K=50. Unlike bag of features,

Fisher vector (Perronnin et al. 2010) encodes both

first and second order statistics between the video

descriptors and a Gaussian Mixture Model (GMM).

In order to estimate the GMM for each descriptor, we

randomly sample 1500 features from the training set

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

264

and set the number of Gaussians to K=20. Each

descriptor type has 2KD dimensional Fisher vector as

described in (Perronnin et al. 2010). To normalize a

Fisher vector, we apply power and L2 normalization

as in (Perronnin et al. 2010). Finally we concatenate

normalized Fisher vectors of different descriptor

types and compare the performance of different

combinations of them.

For classification, we use neural network (NN),

linear SVM, radial basis function SVM and K-nearest

neighbor (kNN) for comparing the performance of the

trajectory features with our spatio-temporal features.

For parameter settings of each classification method,

we fix the number of hidden nodes in NN to 100, use

a one-against-the-rest strategy for designing multi-

class classification of SVM and set K=1 in kNN. For

the other parameters, we follow the default setup in

Matlab. After the experiments, we choose the best

results as the evidence of the comparison and

analysis. In all experiments we divided all datasets

into two parts: half is used for training and half for

testing. Additionally, to evaluate our system on

continuous videos, we used leave-one-out procedure

on a frame-by-frame comparison with human ground

truth. During the leave-one-out procedure, all except

one video are used to train a neural network and the

trained neural network was used to test the one

remaining video. The procedure is repeated n times

for all videos and the average performance is reported.

3.3 Datasets

The Jhuang database (Jhuang et al., 2010) was used

for our experimental test. The first type of database

called the ‘clipped database’ contains 4200 clips in

which only the best instances of specific behaviors

are included. This dataset is the largest of the current

publicly available datasets. It consists of eight mouse

behavior classes: rear (399 cases), groom (1477), eat

(374), drink (61), hang (521), rest (868), walk (233)

and head (180). Each clip records a single mouse

from a side-view camera. The second database

denoted as the ‘full database’ involves 12 frame-by-

frame labeled videos lasting over 10 hours in total. In

order to make the recognition system more robust

during the training process, they varied the camera

angles and lighting conditions. They also used many

mice of different size, gender, and coat color in

experiments. In this paper, experiments of 4.1, 4.2,

4.3 and 4.4 are measured on the ‘clipped database’

using a half-by-half cross-validation procedure. The

‘full database’ is used to train and test our system

evaluated by a leave-one-out strategy in the last

experiment.

4 EXPERIMENTAL RESULTS

4.1 Comparison with Trajectory

Features

In this section, we evaluated the performance of our

visual features (VF) and contextual features (CF)

using different feature encoding methods, compared

with the state-of-the-art IDT features approach.

Table 1 compares the final performance of the

different features. In Table 1, we can observe that the

combined features have better accuracy than just one.

However, for IDT features, trajectory shapes seem

not to be suitable for mouse behavior recognition. The

reason may be that differences between behaviors can

be subtle, and the trajectory shape may not give

enough fine detail. The results also show that IDT

features without trajectory shapes have better

performance than with trajectory shapes (93.4% vs

92.6%). Furthermore, a Fisher vector representation

always results in a better performance than bag of

features for each type of feature and combined

features. Taking all the results together it is clear that

visual features and contextual features give best

results and their combination provides the best overall

accuracy (95.9% compared with 93.4% for IDT

features.

Table 1: Comparison of the performance (accuracy %) of

IDT features and spatio-temporal features.

Features BOF+NN FV+linear SVM FV+NN

IDT

Trajectory 69.1% 73.6% 73.3%

HOG 84.8% 91.6% 91.9%

HOF 77.2% 83.2% 84.9%

MBH 79.3% 87.9% 89.5%

Combined

with trajectory 85.5% 91.9% 92.6%

Combined

without

trajectory 88.5% 92.3% 93.4%

Spatio-temporal

Visual features 87.3% 91.4% 91.3%

Contextual

features 89.4% 92.2% 93.0%

Combined 93.1% 95.4% 95.9%

Behavior Recognition in Mouse Videos using Contextual Features Encoded by Spatial-temporal Stacked Fisher Vectors

265

4.2 Evaluation of Spatio-temporal

Visual and Contextual Features on

Specific Behaviors

Table 2 compares the performance of spatio-temporal

visual and contextual features for specific behaviors.

This experiment is tested on the same feature

encoding and classification (FV+NN). In Table 2, we

see that, except for “walk”, “head” and “groom”, the

contextual features seem to result in better accuracy.

The possible explanation is that the contextual

features are more effective for distinguishing

behaviors which are more localized, such as “eat”,

“drink”, “rear” and “hang”. These often happen near

the feeder, tube, wall and ceiling respectively.

Although the interest points of “groom”, “walk”,

“rest”, “groom” and “head” can happen at any place

except the ceiling, each behavior has a particular

distribution in both the spatial and temporal domains.

So this contextual distribution can also contain

evidence to help distinguish behaviors. However in

the ROC curve (see Fig. 6) of contextual features, the

performance for “head” is obviously worse than for

other behaviors. “Head” is easily confused with

similar spatio-temporal contextual and visual

information. The small proportion of “drink” in the

dataset also influences the final accuracy; it is

reasonable to suppose that if we had more “drink”

action videos for training (see section 4.4), the

accuracy would be greatly improved. We also note

that the combined features are able to achieve

significantly higher accuracy for each behavior than

either the contextual and visual features on their own.

Fig. 5 shows the confusion matrix for the combined

features for more detail.

Table 2: Comparison of the performance (accuracy %) of

visual features, contextual features and their combination.

Action visual contextual combined

features features features

rear 83.1% 84.0% 94.9%

groom 96.2% 96.2% 97.4%

eat 76.8% 87.5% 95.7%

drink 56.3% 84.8% 72.4%

hang 93.6% 96.3% 97.6%

rest 98.8% 99.1% 99.5%

walk 98.2% 96.5% 98.3%

head 64.5% 61.5% 69.8%

all 91.3% 93.0% 95.9%

Figure 5: The confusion matrix for the combination of visual and contextual features. The diagonal cells show the number

and percentage of correct classifications. The non-diagonal cells contain the number and percentage of incorrectly classified

behaviors. The proportion of each actual behavior that were correctly or incorrectly predicted is shown in the bottom row.

The proportion of each predicted behavior that were correct or incorrect is shown in the rightmost column. Overall, the

proportion of correct predictions is shown in the bottom right corner.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

266

Figure 6: The ROC curve of visual features visual features,

contextual features and their combination.

4.3 Evaluation of Classification

In this section we compare our combined spatio-

temporal (ST) features with combined IDT features

using different encoding methods and classifiers.

Table 3 shows that our combined ST features always

outperform the IDT features using different

combinations of encoding methods and classifiers.

The table also shows that the combination of FV and

NN or linear SVM appears to achieve higher accuracy

for both combined IDT features and combined ST

features. Moreover, the combination of FV+NN,

FV+linear SVM and BOF+NN have better results

than the others. The results also suggest that the

performance of FV+NN is a little better than

FV+linear SVM which is used to classify IDT

features in (Wang et al. 2015). Typically the selection

of the SVM kernel is based on experience.

Table 3: Comparison of results (accuracy %) using different

encoding methods and classifiers.

Classification IDT IDT our ST

(no trajectory) (with trajectory)

FV + NN 93.4% 92.6% 95.9%

FV +

linear SVM 92.3% 91.9% 95.4%

FV +

RBF SVM 90.7% 88.6% 91.1%

BOF+NN 88.5% 85.5% 93.1%

BOF+kNN 79.4% 78.4% 90.9%

BOF+

linear SVM 87.1% 86.8% 92.4%

BOF +

RBF SVM 85.9% 84.7% 92.7%

However, NN seems to be more robust to different

encoding methods, because regardless of the features

and encoding method used in our experiment the NN

generally outperforms the other classifiers.

4.4 Comparison with State-of-the-art

In this section we compare our method to the method

proposed by Dollar et al. (Dollár et al., 2005) and

Wang et al. (Wang et al. 2015) for each specific

mouse behavior. We use the same validation strategy

(half-by-half) for each state-of-the art method and

compare the results in Table 4. Interestingly, all

methods, including ours, struggle to recognize “drink”

and “head”. In particular, the method proposed by

(Wang et al. 2015) achieves very low accuracy (5%).

The most likely reason is that “drink” and “head”

have only a small proportion of the training set (1.5%

and 4.3% respectively). We also see that the

trajectory features including trajectory shapes and

descriptors used by (Wang et al. 2015) cannot

correctly represent “drink” behavior, because their

interest points detecting method (Improving Dense

trajectory) struggles to detect useful feature points

Table 4: Comparison of accuracy with state-of-the-art

methods.

Action Dollar Wang our method Jhuang

rear 57.9% 89.7% 94.9% -

groom 88.4% 96.2% 97.4% -

eat 69.0% 88.8% 95.7% -

drink 41.0% 5.0% 72.4% -

hang 80.8% 96.9% 97.6% -

rest 98.8% 95.8% 99.5% -

walk 96.1% 97.0% 98.3% -

head 32.2% 64.8% 69.8% -

all 82.2% 92.3% 95.9% 93%

Behavior Recognition in Mouse Videos using Contextual Features Encoded by Spatial-temporal Stacked Fisher Vectors

267

from the drinking mouse which maintains its posture

but uses only its mouth (see Fig. 2). Overall, our

method significantly outperforms the current state-of-

the-art methods for each specific mouse behavior. In

terms of the final accuracy our method has an

improvement of 13.7%, 3.6% and 2.9% over (Dollár

et al., 2005), (Wang et al. 2015) and (Jhuang et al.,

2010), respectively.

4.5 Continuous Video Annotation

To annotate continuous videos, sliding windows are

centered at each frame and both appearance features

and contextual features are computed inside them.

Once spatio-temporal features are computed for all

the sliding windows, Fisher vector is then computed

for each frame by focusing on a sliding window

centered in the current frame. These fisher vectors are

finally classified by a trained neural network and their

classification results are regarded as labels of all the

frames. To explore an optimal sliding window size,

we establish an experiment to compare the percentage

agreements with human annotation using different

sliding window sizes, illustrated in Fig. 7.

Figure 7: Continuous video annotation with different

window sizes.

5 CONCLUSION

This paper has presented a new approach to

automatically recognizing specific mouse behaviors.

We show that our interest detector is stable under

illumination. Our appearance and contextual fusion

features encoded by spatial-temporal stacked fisher

vector significantly outperform the other state-of-the-

art features. Also, the combination of Fisher vector

and neural networks improves the performance of our

system and gives higher accuracy than the other state-

of-the art systems. Overall, our method achieves an

average of 95.9% accuracy compared to the previous

best test of 93%. Final experiments on annotation of

continuous video also obtain results (72.9%) that are

on a par with human annotation, which is evaluated

as 71.6% in (Jhuang et al., 2010). Future work will

include exploring more distinguishing features,

combining temporal model and extending the range

of behaviors. We also plan to study social behavior

between multiple mice.

ACKNOWLEDGEMENT

This project is supported by UK EPSRC under Grant

EP/N011074/1 and National Natural Science

Foundation of China under Grant 61300111.

REFERENCES

Bishop, C. M. (2006). In Pattern Recognition and Machine

Learning.

Burgos-Artizzu, X. P., Dollár, P., Lin, D., Anderson, D. J.,

and Perona, P. (2012, June). In Social behavior

recognition in continuous video. IEEE Conference on

Computer Vision and Pattern Recognition.

Chatfield, K., Lempitsky, V. S., Vedaldi, A., and

Zisserman, A. (2011, September). In The devil is in the

details: an evaluation of recent feature encoding

methods. British Machine Vision Conference (Vol. 2,

No. 4, p. 8).

Dankert, H., Wang, L., Hoopfer, E. D., Anderson, D. J., and

Perona, P. (2009). In Automated monitoring and

analysis of social behavior in Drosophila. Nature

methods, 6(4), 297-303.

Dollár, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005,

October). In Behavior recognition via sparse spatio-

temporal features. 2nd Joint IEEE International

Workshop on Visual Surveillance and Performance

Evaluation of Tracking and Surveillance.

Jaakkola, T. S., and Haussler, D. (1999). In Exploiting

generative models in discriminative classifiers.

Advances in neural information processing systems,

487-493.

Jhuang, H., Garrote, E., Yu, X., Khilnani, V., Poggio, T.,

Steele, A. D., and Serre, T. (2010). In Automated home-

cage behavioural phenotyping of mice. Nature

communications, 1, 68.

Jhuang, H., Serre, T., Wolf, L., and Poggio, T. (2007,

October). In A biologically inspired system for action

recognition. IEEE 11th International Conference on

Computer Vision (pp. 1-8).

Laptev, I. (2005). In On space-time interest points.

International Journal of Computer Vision, 64(2-3), 107-

123.

Roughan, J. V., Wright-Williams, S. L., and Flecknell, P.

A. (2009). In Automated analysis of postoperative

behaviour: assessment of HomeCageScan as a novel

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

268

method to rapidly identify pain and analgesic effects in

mice. Laboratory animals, 43(1), 17-26.

Rousseau, J. B. I., Van Lochem, P. B. A., Gispen, W. H.,

and Spruijt, B. M. (2000). In Classification of rat

behavior with an image-processing method and a

neural network. Behavior Research Methods,

Instruments, and Computers,32(1), 63-71.

Sánchez, J., Perronnin, F., Mensink, T., and Verbeek, J.

(2013). In Image classification with the fisher vector:

Theory and practice. International journal of computer

vision, 105(3), 222-245.

Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). In

Deep fisher networks for large-scale image

classification. Advances in neural information

processing systems (pp. 163-171).

Steele, A. D., Jackson, W. S., King, O. D., and Lindquist,

S. (2007). In The power of automated high-resolution

behavior analysis revealed by its application to mouse

models of Huntington's and prion diseases. Proceedings

of the National Academy of Sciences, 104(6), 1983-

1988.

Perronnin, F., and Larlus, D. (2015). In Fisher vectors meet

neural networks: A hybrid classification architecture.

Proceedings of the IEEE conference on computer vision

and pattern recognition (pp. 3743-3752).

Perronnin, F., Sánchez, J., and Mensink, T. (2010). In

Improving the fisher kernel for large-scale image

classification. European conference on computer vision

(pp. 143-156). Springer Berlin Heidelberg.

Peng, X., Wang, L., Wang, X., and Qiao, Y. (2014). In Bag

of visual words and fusion methods for action

recognition: Comprehensive study and good practice.

arXiv preprint arXiv:1405.4506.

Peng, X., Wang, L., Qiao, Y., and Peng, Q. (2014, August).

In A joint evaluation of dictionary learning and feature

encoding for action recognition. 22nd International

Conference on Pattern Recognition (pp. 2607-2612).

Wang, H., Kläser, A., Schmid, C., and Liu, C. L. (2013). In

Dense trajectories and motion boundary descriptors for

action recognition. International journal of computer

vision, 103(1), 60-79.

Wang, H., and Schmid, C. (2013). In Action recognition

with improved trajectories. Proceedings of the IEEE

International Conference on Computer Vision (pp.

3551-3558).

Wang, H., Ullah, M. M., Klaser, A., Laptev, I., and Schmid,

C. (2009). In Evaluation of local spatio-temporal

features for action recognition. British Machine Vision

Conference (pp. 124-1).

Wang, H., Oneata, D., Verbeek, J., and Schmid, C. (2015).

In A robust and efficient video representation for action

recognition. International Journal of Computer Vision,

1-20.

Willems, G., Tuytelaars, T., and Van Gool, L. (2008). In An

efficient dense and scale-invariant spatio-temporal

interest point detector. European conference on

computer vision (pp. 650-663). Springer Berlin

Heidelberg.

Behavior Recognition in Mouse Videos using Contextual Features Encoded by Spatial-temporal Stacked Fisher Vectors

269