Long-term Correlation Tracking using Multi-layer Hybrid Features in

Dense Environments

Nathanael L. Baisa

, Deepayan Bhowmik

and Andrew Wallace

Department of Electrical, Electronic and Computer Engineering, Heriot Watt University, Edinburgh, U.K.

Department of Computing, Shefﬁeld Hallam University, Shefﬁeld, U.K.

Keywords:

Visual Tracking, Correlation Filter, CNN Features, Hybrid Features, Online Learning, GM-PHD Filter.

Abstract:

Tracking a target of interest in crowded environments is a challenging problem, not yet successfully addressed

in the literature. In this paper, we propose a new long-term algorithm, learning a discriminative correlation

ﬁlter and using an online classiﬁer, to track a target of interest in dense video sequences. First, we learn a trans-

lational correlation ﬁlter using a multi-layer hybrid of convolutional neural networks (CNN) and traditional

hand-crafted features. We combine the advantages of both the lower convolutional layer which retains better

spatial detail for precise localization, and the higher convolutional layer which encodes semantic information

for handling appearance variations. This is integrated with traditional features formed from a histogram of ori-

ented gradients (HOG) and color-naming. Second, we include a re-detection module for overcoming tracking

failures due to long-term occlusions by training an incremental (online) SVM on the most conﬁdent frames us-

ing hand-engineered features. This re-detection module is activated only when the correlation response of the

object is below some pre-deﬁned threshold to generate high score detection proposals. Finally, we incorporate

a Gaussian mixture probability hypothesis density (GM-PHD) ﬁlter to temporally ﬁlter high score detection

proposals generated from the learned online SVM to ﬁnd the detection proposal with the maximum weight

as the target position estimate by removing the other detection proposals as clutter. Extensive experiments on

dense data sets show that our method signiﬁcantly outperforms state-of-the-art methods.

1 INTRODUCTION

Visual target tracking is one of the most important

and active research areas in computer vision with a

wide range of applications like surveillance, robotics

and human-computer interaction (HCI). Although it

has been studied extensively during past decades as

recently surveyed in (Smeulders et al., 2014), ob-

ject tracking is still a difﬁcult problem due to many

challenges that cause signiﬁcant appearance changes

of targets such as illumination changes, occlusion,

pose variation, deformation, abrupt motion, and back-

ground clutter. Tracking an interested target in dense

or crowded environments is an important task in some

security applications. However, it is very challenging

due to heavy occlusions, high target densities, clut-

tered scenes and signiﬁcant appearance variations of

targets. Robust representation of target appearance is

important to overcome these challenges.

Recently, convolutional neural networks (CNN)

features have demonstrated outstanding results on

various recognition tasks (Girshick et al., 2014; Si-

monyan and Zisserman, 2015). Motivated by this,

a few deep learning based trackers (Wang and Ye-

ung, 2013; Ma et al., 2015a; Wang et al., 2015) have

been developed. In addition, discriminative correla-

tion ﬁlters-based trackers have achieved state-of-the-

art results as surveyed in (Chen et al., 2015) in terms

of both efﬁciency and robustness due to three rea-

sons. First, efﬁcient correlation operations are per-

formed by replacing exhausted circular convolutions

with element-wise multiplications in the frequency

domain which can be computed using the fast fourier

transform (FFT) with very high speed. Second, thou-

sands of negative samples around the target’s environ-

ment can be efﬁciently incorporated through circular-

shifting with the help of a circulant matrix. Third,

training samples are regressed to soft labels of a Gaus-

sian function (Gaussian-weighted labels) instead of

binary labels alleviating sampling ambiguity. In fact,

regression with class labels can be seen as classiﬁca-

tion.

In addition, the Gaussian mixture probability hy-

pothesis density (GM-PHD) ﬁlter (Vo and Ma, 2006)

192

L. Baisa N., Bhowmik D. and Wallace A.

Long-term Correlation Tracking using Multi-layer Hybrid Features in Dense Environments.

DOI: 10.5220/0006117301920203

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 192-203

ISBN: 978-989-758-227-1

has the in-built capability of removing clutter while

ﬁltering targets with very efﬁcient speed without the

need for explicit data association. Though this ﬁlter

is designed for multi-target ﬁltering, it is even prefer-

able for single target ﬁltering in scenes with challeng-

ing background clutter, as well as clutter that comes

from other targets not currently of interest.

In this work, we mainly focus on long-term track-

ing of a target of interest in crowded environments

where an unknown target is initialized by a bounding

box and then is tracked in subsequent frames. Without

any constraint on the video scene of application, we

develop an online tracking algorithm that can track a

target of interest in dense scenes using the advantages

of the correlation ﬁlter, hybrid of multi-layer CNN

and hand-crafted features, an online support vector

machine (SVM) classiﬁer and Gaussian mixture prob-

ability hypothesis density (GM-PHD) ﬁlter.

We make the following three contributions. First,

we integrate hybrid of multi-layer CNN and tradi-

tional features for learning translation correlation ﬁl-

ter by extending a ridge regression for multi-layer fea-

tures. Second, we include a re-detection module by

learning an incremental (online) SVM for generat-

ing high score detection proposals. Third, we tem-

porally ﬁlter the generated high score detection pro-

posals using GM-PHD ﬁlter to ﬁnd the detection pro-

posal with maximum weight as the target position es-

timate, removing clutter in dense environments and

re-initializing the tracker in case of tracking failures.

2 RELATED WORK

Various visual tracking algorithms have been pro-

posed over the past decades to cope with tracking

challenges, and they can be categorized into gen-

erative and discriminative methods depending on

the learning strategy. Generative methods describe

the target appearance using generative models and

search for target regions that best-match the mod-

els. Various generative target appearance modelling

algorithms have been proposed such as online den-

sity estimation (Han et al., 2008), sparse represen-

tation (Zhang et al., 2012), and incremental sub-

space learning (Ross et al., 2008). On the other

hand, discriminative methods build a model that dis-

tinguishes the target from the background. These al-

gorithms typically learn classiﬁers based on online

boosting (Grabner et al., 2008), multiple instance

learning (Babenko et al., 2011), P-N learning (Kalal

et al., 2012), structured output SVMs (Hare et al.,

2011) and combining multiple classiﬁers with differ-

ent learning rates (Zhang et al., 2014). Discriminative

methods are most competitive to the work presented

here since they include background information, al-

though hybrid generative and discriminative models

can also be used (Dinh et al., 2014). However, sam-

pling ambiguity in discriminative tracking methods

results in drifting, which is a signiﬁcant problems.

Recently, correlation ﬁlters (Henriques et al., 2012;

Henriques et al., 2015; Danelljan et al., 2014) have

beenn introduced for online target tracking that can

alleviate this sampling ambiguity.

There are about three tracking scenarios that are

important to consider: short-term tracking, long-term

tracking, and tracking in a crowded scene. If ob-

jects are visible over the whole course of the se-

quences, short-term model-free tracking algorithms

are sufﬁcient to track a single object though they

can not re-initialize the trackers once they fail due to

long-term occlusion and confusion from background

clutter (Han et al., 2008; Danelljan et al., 2014).

Long-term tracking algorithms are important for tar-

get tracking in a video stream that runs for indeﬁnitely

long handling long-term occlusions. A Tracking-

Learning-Detection (TLD) algorithm has been devel-

oped in (Kalal et al., 2012) which explicitly decom-

poses the long-term tracking task into tracking, learn-

ing and detection. However, it is sensitive to back-

ground clutter although it works well in very sparse

video. Long-term correlation tracking (LCT), devel-

oped in (Ma et al., 2015b), learns three different dis-

criminative correlation ﬁlters: translation, appearance

and scale correlation ﬁlters using hand-crafted fea-

tures, however, it is not robust to long-term occlusions

and background clutter.

Tracking of a target of interest in a crowded scene

is very challenging due to heavy occlusion, high tar-

get densities and clutter, and signiﬁcant appearance

variations. Person detection and tracking in crowds is

formulated as a joint energy minimization problem by

combining crowd density estimation and localization

of an individual person in (Rodriguez et al., 2011).

Though this approach doesn’t require manual initial-

ization, it has low performance for tracking a generic

target of interest as it was mainly developed for track-

ing human heads. The method developed in (Kratz

and Nishino, 2012) trained Hidden Markov Models

(HMMs) on motion patterns within a scene to capture

spatial and temporal variations of motion in the crowd

which is used for tracking individuals. However, this

approach is limited to a crowd with a structured pat-

tern. The algorithm developed in (Idrees et al., 2014)

used visual information (prominence) and spatial con-

text (inﬂuence from neighbours) to develop online

tracking in a crowded scene. This algorithm performs

well in crowded scene but has low performance in

Long-term Correlation Tracking using Multi-layer Hybrid Features in Dense Environments

193

medium density scenes as inﬂuence from neighbours

(spatial context) decreases in such scene.

Our proposed tracking algorithm tracks a target of

interest in dense environments without using any con-

straint from the video scene using a correlation ﬁlter,

sophisticated features and a re-detection scheme, and

is robust to occluded and densely cluttered scenes.

3 OVERVIEW OF OUR

ALGORITHM

CNN features have recently demonstrated outstand-

ing results on various recognition tasks though tra-

ditional hand-engineered features are still important.

Similarly, correlation ﬁlters are giving better results

for online tracking problems in both efﬁciency and

accuracy. Besides, the GM-PHD ﬁlter is efﬁcient in

removing clutter that originates from both the back-

ground scene and other targets not of interest. Having

observed these factors, we develop a long-term online

tracking algorithm that can be applied to track a target

of interest in densely cluttered environments by learn-

ing a correlation ﬁlter using a hybrid of multi-layer

CNN and hand-crafted features as well as including

a re-detection module using an incremental SVM and

GM-PHD ﬁlter.

Accordingly, ﬁrst we learn a translation correla-

tion ﬁlter (w

) using a hybrid of multi-layer CNN

features from VGG-Net (Simonyan and Zisserman,

2015) and robust traditional hand-crafted features.

For the CNN part, we combine features from both

a lower convolutional layer which retains more spa-

tial detail for precise localization and a higher con-

volutional layer which encodes semantic information

for handling appearance variations. This forms layers

1 and 2 in multi-layer features with multiple channels

(512 dimensions) in each layer. Since the spatial reso-

lution of the extracted features gradually reduces with

the increase of the depth of CNN layers due to pool-

ing operators, it is crucial to resize each feature map

to a ﬁxed size using bilinear interpolation.

For the traditional features part, we use the his-

togram of oriented gradients (HOG), in particular

Felzenszwalb’s variant (Felzenszwalb et al., 2010)

and color-naming (van de Weijer et al., 2009) features

for capturing image gradients and color information,

respectively. Color-naming is the linguistic color la-

bel assigned by a human to describe the color, hence,

the mapping method in (van de Weijer et al., 2009)

is employed to convert the RGB space into the color

name space which is an 11 dimensional color repre-

sentation providing the perception of a target color.

By aligning the feature size of the HOG variant with

31 dimensions and color-naming with 11 dimensions,

they are integrated to make a 42 dimensional feature

which forms the 3rd layer in our hybrid multi-layer

features.

Second, we incorporate a re-detection module by

learning incremental SVM from the most conﬁdent

frames determined by the maximal value of the corre-

lation response map. This uses HOG, LUV color and

normalized gradient magnitude features for generat-

ing high-score detection proposals which are ﬁltered

using the GM-PHD ﬁlter to re-acquire the target in

case of tracking failures. The ﬂowchart of our method

is given in Figure 1 and the outline of our proposed al-

gorithm is given in Algorithm 1.

4 PROPOSED ALGORITHM

This section describes our proposed tracking algo-

rithm which has three distinct functional parts: 1) cor-

relation ﬁlters formulated for multi-layer hybrid fea-

tures, 2) online SVM detector developed for generat-

ing high score detection proposals, and 3) GM-PHD

ﬁlter for ﬁnding the detection proposal with max-

imum weight to re-initialize the tracker in case of

tracking failures by removing the other detection pro-

posals as clutter.

4.1 Correlation Filters for Multi-layer

Features

To track a target using correlation ﬁlters, the appear-

ance of the target should be modeled using a correla-

tion ﬁlter w which can be trained on a feature vector

x of size M × N × D extracted from an image patch

where M, N, and D indicate the width, height and

number of channels, respectively. This feature vec-

tor x can be extracted from multiple layers, for exam-

ple CNN features and/or traditional hand-crafted fea-

tures, therefore, we denote it as x

(l)

to designate from

which layer l it is extracted. All the circular shifts of

(l)

along the M and N dimensions are considered as

training examples where each circularly shifted sam-

ple x

(l)

m,n

,m ∈ {0, 1,...,M −1}, n ∈ {0, 1,...,N −1} has

a Gaussian function label y(m,n) given by

y(m,n) = e

−

(m−M/2)

+(n−N/2)

2σ

, (1)

where σ is the kernel width, hence, y(m, n) is a soft la-

bel rather than a binary label. To learn the correlation

ﬁlter w

(l)

for layer l with the same size as x

(l)

, we

extend ridge regression (Rifkin et al., 2003), devel-

oped for a single-layer feature vector, to be used for a

multi-layer hybrid feature vector with layer l, x

(l)

, as

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

194

Figure 1: The ﬂowchart of the proposed algorithm. It consists of two main parts: translation estimation and re-detection. Given

a search window, we extract multi-layer hybrid features (in the frequency domain) and then estimate target position (x

) using

a translation correlation ﬁlter (w

). This estimated position (x

) is used as a measurement (z

) for updating GM-PHD ﬁlter

without reﬁning x

, just to update its weight for later use during re-detection. Re-detection is activated if the maximum of

the response map (R

) falls below a pre-deﬁned threshold (T

). Then, we generate high score detection proposals which are

ﬁltered by the GM-PHD ﬁlter to estimate the detection with the maximum weight as target position (x

) removing the others

as clutter. If the response map around x

) is greater than T

, the target position x

is updated by a re-detected position

. In frame 1, we only train the correlation ﬁlter and SVM classiﬁer using the initialized target; no detection is performed.

min

(l)

∑

m,n

|Φ(x

(l)

).w

(l)

− y(m,n)|

+ λ|w

(l)

, (2)

where Φ denotes the mapping to a kernel space and

λ is a regularization parameter (λ ≥ 0). The solution

(l)

can be expressed as

(l)

∑

m,n

(l)

(m,n)Φ(x

(l)

m,n

), (3)

This alternative representation makes the dual space

(l)

the variable under optimization instead of the pri-

mal space w

(l)

Training phase: The training phase is performed

in the Fourier domain using the fast Fourier transform

(FFT) to compute the coefﬁcient A

(l)

= F (a

(l)

) =

F (y)



Φ(x

(l)

).Φ(x

(l)

)



+ λ

, (4)

where F denotes the FFT operator.

Detection phase: The detection phase is per-

formed on the new frame given an image patch

(search window) which is used as spatial context i.e.

the search window is larger than the target. If feature

vector z

(l)

of size M × N × D is extracted from this

image patch, the response map (r

(l)

) is computed as

(l)

= F

−1



(l)

 F (Φ(z

(l)

).Φ(

(l)

))



, (5)

where

(l)

and

(l)

= F

−1

(

(l)

) denote the learned

target appearance model for layer l, operator  is the

Hadamard (element-wise) product, and F

−1

is the in-

verse FFT. Now, the response maps of all layers are

summed according to their weight γ(l) element-wise

r(m,n) =

∑

γ(l)r

(l)

(m,n), (6)

Long-term Correlation Tracking using Multi-layer Hybrid Features in Dense Environments

195

The new target position is estimated by ﬁnding the

maximum value of r(m,n) as

( ˆm, ˆn) = argmax

m,n

r(m,n), (7)

Model update: The model is updated by train-

ing a new model at the new target position and then

linearly interpolating the obtained values of the dual

space coefﬁcients A

(l)

and the base data template

(l)

= F (x

(l)

) with those from the previous frame to

make the tracker more adaptive to target appearance

variations.

(l)

= (1 −η)

(l)

k−1

+ ηX

(l)

, (8a)

(l)

= (1 −η)

(l)

k−1

+ ηA

(l)

, (8b)

where k is the index of the current frame, and η is the

learning rate.

The mappings to the kernel space (Φ) used in (4)

and (5) can be expressed using the kernel function as

K(x

(l)

) = Φ(x

(l)

).Φ(x

(l)

) = Φ(x

(l)

)

Φ(x

(l)

). If

the computation is performed in frequency domain,

the normal transpose should be replaced by the Her-

mitian transpose i.e. Φ(X

(l)

)

= (Φ(X

(l)

)

∗

)

where

the star (∗) denotes the complex conjugate. A linear

kernel is used and is given as

K(x

(l)

) = (x

(l)

)

(l)

= F

−1

(

∑

(l)

i,d

)

∗

 X

(l)

j,d

(9)

where X

(l)

= F (x

(l)

This formulation is a generic formulation for mul-

tiple channel features from multiple layers as in the

case of our multi-layer hybrid features, i.e. where

(l)

i,d

, d ∈ {1,...,D}, l ∈ {1, ...,L}. This is an ex-

tended version of the one given in (Henriques et al.,

2015) that takes into account features from multiple

layers. The linearity of the FFT allows us to sim-

ply sum the individual dot-products for each channel

d ∈ {1, ...,D} in each layer l ∈ {1, ...,L}.

4.2 Online Detection

We include a re-detection module, D

, to generate

high score detection proposals in case of tracking fail-

ures due to long-term occlusion. Instead of using a

correlation ﬁlter to scan across the entire frame which

is computationally expensive, we learn an incremen-

tal (online) SVM (Diehl and Cauwenberghs, 2003)

by generating a set of samples in the search window

around the estimated target position from the most

conﬁdent frames and scan through the window when

it is activated to generate high score detection pro-

posals. These most conﬁdent frames are determined

by the maximum translation correlation response in

the current frame, i.e. if the maximum correlation re-

sponse of an image patch is above the trained detector

threshold (T

), we generate samples around this im-

age patch and train the detector. This detector is acti-

vated to generate high score detection proposals if the

maximum of the correlation response becomes below

activate detector threshold (T

). We use HOG (par-

ticularly Felzenszwalbs variant (Felzenszwalb et al.,

2010)), LUV color and normalized gradient magni-

tude features to train this online SVM classiﬁer. We

use different visual features from the ones we use

to learn the correlation ﬁlter. Since we can select

the feature representation for each module indepen-

dently (Danelljan et al., 2014; Ma et al., 2015b), this

greatly reduces the computational cost.

We want to update the weight vector w of the

SVM by providing a set of samples with associated la-

bels, {(

)}, obtained from the current results. The

label

of a new example

is given by

(

+1, if IOU (

) ≥ δ

−1, if IOU (

) < δ

(10)

where IOU(.) is the intersection over union (overlap

ratio) of a new example

and the estimated target

bounding box in the current most conﬁdent frame

SVM classiﬁers of the form f (x) = w.Φ(x)+b are

learned from the data {(x

) ∈ ℜ

× {−1, +1}∀i ∈

{1,...,N}} by minimizing

min

w,b,ξ

||w||

∑

i=1

(11)

for p ∈ {1,2} subject to the constraints

(w.Φ(x

)+b) ≥ 1 −ξ

,ξ

≥ 0 ∀i ∈ {1, ...,N}. (12)

Hinge loss (p = 1) is preferred over quadratic loss

(p = 2) due to its improved robustness to outliers.

Thus, the ofﬂine SVM learns a weight vector w =

,....,w

)

by solving this quadratic convex

optimization problem (QP) which can be expressed

in its dual form as

min

0≤a

≤C

W =

∑

i, j=1

i j

−

∑

i=1

+ b

∑

i=1

, (13)

where {a

} are Lagrange multipliers, b is bias, C is a

regularization parameter, and Q

i j

= y

K(x

). The

kernel function K(x

) = Φ(x

).Φ(x

) is used to im-

plicitly map into a higher dimensional feature space

and compute the dot product. It is not straightforward

for conventional QP solvers to handle the optimiza-

tion problem in (13) for online tracking tasks as the

training data are provided sequentially, not all at once.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

196

Incremental SVM (Diehl and Cauwenberghs, 2003) is

tailored for such cases which retain the Karush-Kuhn-

Tucker (KKT) conditions on all the existing exam-

ples while updating the model with a new example so

that the exact solution at each increment of the dataset

can be guaranteed. KKT conditions are the ﬁrst-order

necessary conditions for the optimal unique solution

of dual parameters {a, b} which minimizes (13) and

are given by

∂W

∂a

∑

j=1

i j

+ y

b − 1











> 0, if a

= 0

= 0, if 0 ≤ a

≤ C

< 0, if a

= C,

(14)

∂W

∂b

∑

j=1

= 0, (15)

Based on the partial derivative m

∂W

∂a

which is re-

lated to the margin of the i-th example, each train-

ing example can be categorized into three: S

support

vectors lying on the margin (m

= 0), S

support vec-

tors lying inside the margin (m

< 0), and the remain-

ing R reserve vectors (non-support vectors). Dur-

ing incremental learning, new examples with m

≤ 0

eventually become margin (S

) or error (S

) support

vectors. However, the rest of the new training ex-

amples become reserve vectors as they do not enter

the solution so that lagrangian multipliers (a

) are es-

timated while retaining the KKT conditions. Given

the updated Lagrangian multipliers, the weight vector

w is given by

w =

∑

i∈S

∪S

Φ(x

), (16)

It is important to keep only a ﬁxed number of sup-

port vectors with the smallest margins for efﬁciency

during online tracking.

Thus, using the trained incremental SVM, we gen-

erate high score detections as detection proposals dur-

ing the re-detection stage which are ﬁltered using the

GM-PHD ﬁlter to ﬁnd the best possible detection that

can re-initialize the tracker.

4.3 Temporal Filtering using GM-PHD

Filter

Once we generate high score detection proposals us-

ing the online SVM classiﬁer during the re-detection

stage, we need to ﬁnd the most probable detection

proposal for the target state (position) estimate by

ﬁnding the detection proposal with maximum weight

using the GM-PHD ﬁlter (Vo and Ma, 2006). Though

the GM-PHD ﬁlter is designed for multi-target ﬁlter-

ing with the assumptions of a linear Gaussian sys-

tem, in our problem (re-detecting a target in cluttered

scene), it is used for removing clutter that come from

the background and other targets not of interest as it is

equipped with such a capability. Besides, it provides

motion information for the tracking algorithm. More

importantly, using the GM-PHD ﬁlter to ﬁnd the de-

tection with the maximum weight from the generated

high score detection proposals is more robust than re-

lying only on the maximum score of the classiﬁer.

The detected position of the target in each frame

is ﬁltered using the GM-PHD ﬁlter, but without re-

ﬁning the position states until the re-detection mod-

ule is activated. This updates the weight of the GM-

PHD ﬁlter corresponding to a target of interest giv-

ing sufﬁcient prior information to be picked up dur-

ing re-detection among candidate high score detec-

tion proposals. If the re-detection module is activated

(correlated response of the target becomes below a

pre-deﬁned threshold), we generate high score detec-

tion proposals (in this case 5) from the trained SVM

classiﬁer which are ﬁltered using the GM-PHD ﬁl-

ter. The Gaussian component with maximum weight

is selected as the position estimate, and if the cor-

related response of this estimated position is greater

than the pre-deﬁned threshold, the estimated position

of the target is reﬁned.

The GM-PHD ﬁlter has two steps: prediction and

update. Before stating these two steps, certain as-

sumptions are needed. 1) Each target follows a linear

Gaussian model:

k|k−1

(x|ζ) = N (x; F

k−1

ζ,Q

k−1

) (17)

(z|x) = N (z; H

x,R

) (18)

where N (.; m,P) denotes a Gaussian density with

mean m and covariance P; F

k−1

and H

are the state

transition and measurement matrices, respectively.

k−1

and R

are the covariance matrices of the pro-

cess and the measurement noise respectively.

2) A current measurement driven birth intensity

inspired by but not identical to (Ristic et al., 2012) is

introduced at each time step, removing the need for

prior knowledge (speciﬁcation of birth intensities) or

a random model, with a non-informative zero initial

velocity. The intensity of the spontaneous birth RFS

is a Gaussian mixture of the form

(x) =

γ,k

∑

v=1

(v)

γ,k

N (x; m

(v)

γ,k

(v)

γ,k

)

(19)

where V

γ,k

is the number of birth Gaussian compo-

nents, w

(v)

γ,k

is the weight accompanying the Gaussian

Long-term Correlation Tracking using Multi-layer Hybrid Features in Dense Environments

197

component v, m

(v)

γ,k

is the current measurement and

zero initial velocity used as mean, and P

(v)

γ,k

is the birth

covariance for Gaussian component v. In our case,

γ,k

equals 1 unless in the re-detection stage, at which

it becomes 5 as we generate 5 high score detection

proposals to be ﬁltered.

3) The survival and detection probabilities are in-

dependent of the target state: p

s,k

) = p

s,k

and

D,k

) = p

D,k

Prediction: It is assumed that the posterior inten-

sity at time k −1 is a Gaussian mixture of the form

k−1

(x) =

k−1

∑

v=1

(v)

k−1

N (x; m

(v)

k−1

(v)

k−1

(20)

where V

k−1

is the number of Gaussian components

of D

k−1

(x). This is equal to the number of Gaussian

components after pruning and merging at the previ-

ous iteration. Under these assumptions, the predicted

intensity at time k is given by

k|k−1

(x) = D

S,k|k

(x) + γ

(x), (21)

where

S,k|k−1

(x) = p

s,k

∑

k−1

v=1

(v)

k−1

N (x; m

(v)

S,k|k−1

(v)

S,k|k−1

(v)

S,k|k−1

= F

k−1

(v)

k−1

(v)

S,k|k−1

= Q

k−1

+ F

k−1

(v)

k−1

where γ

(x) is given by (19).

Since D

S,k|k−1

(x) and γ

(x) are Gaussian mix-

tures, D

k|k−1

(x) can be expressed as a Gaussian mix-

ture of the form

k|k−1

(x) =

k|k−1

∑

v=1

(v)

k|k−1

N (x; m

(v)

k|k−1

(v)

k|k−1

(22)

where w

(v)

k|k−1

is the weight accompanying the pre-

dicted Gaussian component v, and V

k|k−1

is the num-

ber of predicted Gaussian components, equal to the

number of born targets (1 unless in case of re-

detection in which case it is 5) added to the number of

persistent components, actually the number of Gaus-

sian components after pruning and merging in the pre-

vious iteration.

Update: The posterior intensity (updated PHD) at

time k is also a Gaussian mixture and is given by

k|k

(x) = (1 − p

D,k

k|k−1

(x) +

∑

z∈Z

D,k

(x;z),

(23)

where

D,k

(x;z) =

k|k−1

∑

v=1

(v)

(z)N (x; m

(v)

k|k

(z),P

(v)

k|k

(v)

(z) =

D,k

(v)

k|k−1

(v)

(z)

(z) + p

D,k

∑

k|k−1

l=1

(l)

k|k−1

(l)

(z)

(v)

(z) = N (z; H

(v)

k|k−1

+ H

(v)

k|k−1

(v)

k|k

(z) = m

(v)

k|k−1

+ K

(v)

(z − H

(v)

k|k−1

(v)

k|k

= [I − K

(v)

k|k−1

(v)

= P

(v)

k|k−1

(v)

k|k−1

+ R

]

−1

The clutter intensity due to the scene, c

(z), in (23) is

given by

(z) = λc(z) = λ

Ac(z), (24)

where c(.) is the uniform density over the surveillance

region A, and λ

is the average number of clutter re-

turns per unit volume i.e. λ = λ

A. We set the clutter

rate or false positive per image (fppi) λ = 4 in our

experiment.

After update, weak Gaussian components with

weight w

< 10

−5

are pruned, and Gaussian com-

ponents with Mahalanobis distance less than U = 4

pixels from each other are merged. These pruned

and merged Gaussian components are predicted as

existing (persistent) targets in the next iteration. Fi-

nally, the Gaussian component of the posterior inten-

sity with mean corresponding to the maximum weight

is selected as a target position estimate when the re-

detection module is activated.

5 IMPLEMENTATION DETAILS

The main steps of our proposed algorithm are pre-

sented in Algorithm 1. Parameter settings are given as

follows. To learn the translation correlation ﬁlter, we

extract features from VGG-Net (Simonyan and Zis-

serman, 2015) trained on a large set of object recogni-

tion data from (ImageNet) (Deng et al., 2009) by ﬁrst

removing the fully convolutional layers. Particularly,

we use the outputs of conv4-4 and conv5-4 convolu-

tional layers as features (l ∈ {1, 2} and d ∈ {1,...,D})

i.e. the outputs of rectilinear units (inputs of pool-

ing) layers must be used to keep more spatial resolu-

tion. Hence, the CNN features we use have 2 layers

(L = 2) and multiple channels (D = 512) for conv4-

4 and conv5-4 layers. For hand-crafted features, the

HOG variant with 31 dimensions and color-naming

with 11 dimensions are integrated to make a 42 di-

mensional feature which makes the 3rd layer in our

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

198

Algorithm 1: Proposed tracking algorithm.

Input: Image I

, previous target position x

k−1

, previous correlation

ﬁlter w

(l)

t,k−1

, previous SVM detector D

Output: Estimated target position x

= (x

), updated correlation

ﬁlter w

(l)

t,k

, updated SVM detector D

repeat

Crop out the searching window in frame k according to

k−1

) and extract multi-layer hybrid features and

resize them to a ﬁxed size;

// Translation estimation

foreach layer l do

compute response map r

(l)

using w

(l)

t,k−1

and (5);

end

Sum up the response maps of all layers element-wise

according to their weight γ(l) to get r(m,n) using (6);

Estimate the new target position (x

) by ﬁnding the

maximum response of r(m,n) using (7);

// Apply GM-PHD ﬁlter

Update GM-PHD ﬁlter using the estimated target position

) as measurement but without re-ﬁning it, just to

update weight of GM-PHD ﬁlter for later use;

// Target re-detection

if max



r(m,n)



< T

then

Use the detector D

to generate detection proposals Z

from high scores of incremental SVM;

// Filtering using GM-PHD ﬁlter

Filter the generated candidate detections Z

using

GM-PHD ﬁlter and select the detection with maximum

weight as a re-detected target position (x

). Then

crop out the searching window at this re-detected

position and compute its response map using (5)

and (6), and call it r

(m,n);

if max



(m,n)



≥ T

then

) = (x

) i.e. re-ﬁne by the re-detected

position;

end

// Translation correlation model update

Crop out new patch centered at (x

) and extract multi-layer

hybrid features and resize them to a ﬁxed size;

foreach layer l do

Update translation correlation ﬁlter w

(l)

t,k

using (8);

end

// Update detector D

if max



r(m,n)



≥ T

then

Generate positive and negative samples around (x

)

and then extract HOG, LUV color and normalized

gradient magnitude features to train incremental SVM

for updating its weight vector using (16);

end

until End of video sequences;

hybrid multi-layer representation. Given an image

frame with a search window size of

M ×

N which is

about 2.8 times the target size to provide some con-

text, we resize the multi-layer hybrid features to a

ﬁxed spatial size of M ×N where M =

and N =

These hybrid features from each layer are weighted by

a cosine window (Henriques et al., 2015) to remove

the boundary discontinuities, and then combined later

in (6) for which we set γ as 1, 0.4 and 0.1 for the

conv5-4, conv4-4 and hand-crafted features, respec-

tively. We set the regularization parameter of the ridge

regression in (2) to λ = 10

−4

, and a kernel bandwidth

of the Gaussian function label in (1) to σ = 0.1. The

learning rate for model update in (8) is set to η = 0.01.

We use a linear kernel (9) to learn the translation cor-

relation ﬁlter.

HOG, LUV color and normalized gradient magni-

tude features are used to train an incremental (online)

SVM classiﬁer for the re-detection module. For the

objective function given in (13), we use a Gaussian

kernel, particularly for Q

i j

= y

K(x

), and the

regularization parameter C is set to 2. Empirically,

we set the activate detector threshold to T

= 0.15

and the train detector threshold to T

= 0.40. The pa-

rameters in (10) are set as δ

= 0.9 and δ

= 0.3. For

negative samples, we randomly sampled 3 times the

number of positive samples satisfying δ

= 0.3 within

the maximum search area of 4 times the target size. In

the re-detection phase, we generate 5 high-score de-

tection proposals from the trained online SVM around

the estimated position within the maximum search

area of 6 times the target size which are ﬁltered us-

ing the GM-PHD ﬁlter to ﬁnd the detection with the

maximum weight removing the others as clutter.

6 EXPERIMENTAL RESULTS

We evaluate our proposed tracking algorithm on

dense environments (medium and dense PETS 2009

data sets

), and compare its performance with state-

of-the-art trackers using the same parameter values

for all the sequences. We quantitatively evaluate the

robustness of the trackers using two metrics, average

precision and success rate based on center location

error and bounding box overlap ratio respectively, us-

ing the one-pass evaluation (OPE) setting, running the

trackers throughout a test sequence with initialization

from the ground truth position in the ﬁrst frame. The

center location error computes the average Euclidean

distance between the center locations of the tracked

targets and the manually labeled ground truth posi-

tions of all the frames whereas the bounding box over-

lap ratio computes the intersection over union of the

tracked target and ground truth bounding boxes.

We label the upper part (head + neck) of rep-

http://www.cvg.reading.ac.uk/PETS2009/a.html

Long-term Correlation Tracking using Multi-layer Hybrid Features in Dense Environments

199

resentative targets in both medium and dense PETS

2009 data sets to analyze our proposed tracking al-

gorithm. In this experiment, our goal is to deter-

mine whether our and other methods can successfully

be applied to track a target of interest in occluded

and cluttered environments. Accordingly, we com-

pare our proposed tracking algorithm with 6 state-

of-the-art trackers including CF2 (Ma et al., 2015a),

LCT (Ma et al., 2015b), MEEM (Zhang et al., 2014),

DSST (Danelljan et al., 2014), KCF (Henriques et al.,

2015) and SAMF (Li and Zhu, 2015), as well as 4

more top trackers included in the Benchmark (Wu

et al., 2013), particularly SCM (Zhong et al., 2012),

ASLA (Jia et al., 2012), CSK (Henriques et al., 2012)

and IVT (Ross et al., 2008) both quantitatively and

qualitatively.

Quantitative Evaluation: The precision (top)

and success plots (bottom) based on center location

error and bounding box overlap ratio, respectively,

are shown in Figure 2. Our proposed tracking algo-

rithm, denoted by LCMHT, outperforms the state-of-

the-art trackers by large margin on PETS 2009 data

sets in both precision and success rate measures. The

rankings are given in distance precision of threshold

scores at 20 pixels and overlap success of AUC score

for each tracker as given in the legends.

The second and third ranked trackers are CF2 (Ma

et al., 2015a) and MEEM (Zhang et al., 2014) for pre-

cision plots, respectively, and vice versa for success

plots. Attention is focussed on the performance of

LCT. It performs least well on the precision plots and

second from the lowest on success plots on these data

sets. Surprisingly, this algorithm was developed by

learning three different discriminative correlation ﬁl-

ters and even included a re-detection module for long-

term tracking problems. Its performance on occluded

and cluttered environments such as the PETS 2009

data sets is poor due to using less robust visual fea-

tures in such environments. Even CF2 which uses

CNN features has low performance compared to our

proposed algorithm on these data sets. However, since

our proposed tracking algorithm integrates a hybrid

of multi-layer CNN and traditional features to learn

the translation correlation ﬁlter and a GM-PHD ﬁlter

for temporally ﬁltering generated high score detection

proposals during a re-detection phase for removing

clutter, it outperforms all the available trackers sig-

niﬁcantly.

Qualitative Evaluation: Figure 3 presents the

performance of our proposed tracker qualitatively

compared to the state-of-the-art trackers. In this case,

we show the comparison of four representative track-

ers in addition to our proposed algorithm: CF2 (Ma

et al., 2015a), MEEM (Zhang et al., 2014), LCT (Ma

Figure 2: Distance precision (top) and overlap success (bot-

tom) plots on PETS 2009 data sets using one-pass eval-

uation (OPE). The legend for distance precision contains

threshold scores at 20 pixels while the legend for overlap

success contains the AUC score of each tracker; the larger,

the better.

et al., 2015b), and KCF (Henriques et al., 2015).

On the medium density data set (left column), LCT

and KCF lose the target even in the ﬁrst 16 frames.

Though CF2 and MEEM trackers track the target

well, they couldn’t re-detect the target after occlusion

i.e. only our proposed tracking algorithm tracks the

target till the end of the sequence by re-initializing

the tracker after the occlusion. We show the cropped

and enlarged re-detection just after occlusion in Fig-

ure 4. On the dense data set (right column), all track-

ers track the target for the ﬁrst 20 frames but LCT and

KCF lose the target before 73 frames. Similar to the

medium density data set, the CF2 and MEEM track-

ers track the target before they lose it due to occlu-

sion. Only our proposed tracking algorithm, LCMHT,

re-detects the target and tracks it till the end of the

sequence in such a dense environment due to two

reasons. First, it incorporates both lower and higher

CNN layers in combination with traditional features

(HOG and color-naming) in a multi-layer framework

to learn the translation correlation ﬁlter that is ro-

bust to appearance variations of targets. Second, it

includes a re-detection module which generates high

score detection proposals during a re-detection phase

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

200

Figure 3: Qualitative results of our proposed LCMHT algorithm, CF2 (Ma et al., 2015a), MEEM (Zhang et al., 2014),

LCT (Ma et al., 2015b) and KCF (Henriques et al., 2015) on PETS 2009 medium (left column) and dense (right column) data

sets.

Long-term Correlation Tracking using Multi-layer Hybrid Features in Dense Environments

201

Figure 4: Qualitative results of our proposed LCMHT algorithm, CF2 (Ma et al., 2015a), MEEM (Zhang et al., 2014),

LCT (Ma et al., 2015b) and KCF (Henriques et al., 2015) on PETS 2009 medium (left, frame 78) and dense (right, frame 85)

data sets, just after occlusion by cropping and enlarging.

and then ﬁlter them using the GM-PHD ﬁlter to re-

move clutter due to background and other targets so

that it can re-detect the target of interest.

Our proposed tracking algorithm is implemented

in MATLAB on 4 cores of a 3.0 GHz Intel Xeon CPU

E5-1607 with 16 GB RAM. We also use the MatCon-

vNet toolbox (Vedaldi and Lenc, 2015) for CNN fea-

ture extraction where its forward propagation com-

putation is transferred to a NVIDIA Quadro K5000,

and our tracker runs at 1 fps on this setting. The

re-detection and forward propagation for feature ex-

tractions step are the main computational loads of our

tracking algorithm.

7 CONCLUSIONS

We have developed a novel long-term visual tracking

algorithm by learning a discriminative correlation ﬁl-

ter and an incremental SVM classiﬁer for tracking a

target of interest in dense environments. We learn

the translation correlation ﬁlter for which we combine

a hybrid of multi-layer CNN (both lower and higher

convolutional layers) and traditional (HOG and color-

naming) features in proper proportion. We also in-

clude a re-detection module using HOG, LUV color

and normalized gradient magnitude features for re-

initializing the tracker in the case of tracking failures

due to long-term occlusion by training an incremen-

tal SVM from the most conﬁdent frames. When ac-

tivated, the re-detection module generates high score

detection proposals which are temporally ﬁltered us-

ing a GM-PHD ﬁlter for removing clutter. Extensive

experimental results on PETS 2009 data sets show

that our proposed algorithm outperforms the state-of-

the-art trackers in terms of both accuracy and robust-

ness. We conclude that learning a correlation ﬁlter

using an appropriate combination of CNN and tra-

ditional features as well as including a re-detection

module using incremental SVM and GM-PHD ﬁlter

can give better results than many existing approaches.

ACKNOWLEDGMENT

We would like to acknowledge the support of the

Engineering and Physical Sciences Research Council

(EPSRC), grant references EP/K009931, EP/J015180

and a James Watt Scholarship.

REFERENCES

Babenko, B., Yang, M. H., and Belongie, S. (2011). Robust

object tracking with online multiple instance learning.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 33(8):1619–1632.

Chen, Z., Hong, Z., and Tao, D. (2015). An experimen-

tal survey on correlation ﬁlter-based tracking. CoRR,

abs/1509.05520.

Danelljan, M., Hager, G., Shahbaz Khan, F., and Felsberg,

M. (2014). Accurate scale estimation for robust visual

tracking. In Proceedings of the British Machine Vision

Conference. BMVA Press.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei,

L. (2009). ImageNet: A large-scale hierarchical im-

age database. In Computer Vision and Pattern Recog-

nition, 2009. CVPR 2009. IEEE Conference on, pages

248–255.

Diehl, C. P. and Cauwenberghs, G. (2003). SVM in-

cremental learning, adaptation and optimization. In

Neural Networks, 2003. Proceedings of the Interna-

tional Joint Conference on, volume 4, pages 2685–

2690 vol.4.

Dinh, T. B., Yu, Q., and Medioni, G. (2014). Co-trained

generative and discriminative trackers with cascade

particle ﬁlter. Comput. Vis. Image Underst., 119:41–

56.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and

Ramanan, D. (2010). Object detection with discrim-

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

202

inatively trained part based models. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

32(9):1627–1645.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In Computer Vision and

Pattern Recognition.

Grabner, H., Leistner, C., and Bischof, H. (2008). Semi-

supervised on-line boosting for robust tracking. In

Proceedings of the 10th European Conference on

Computer Vision: Part I, ECCV ’08, pages 234–247.

Han, B., Comaniciu, D., Zhu, Y., and Davis, L. S. (2008).

Sequential kernel density approximation and its ap-

plication to real-time visual tracking. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

30(7):1186–1197.

Hare, S., Saffari, A., and Torr, P. H. S. (2011). Struck:

Structured output tracking with kernels. In 2011 Inter-

national Conference on Computer Vision, pages 263–

270.

Henriques, J. a. F., Caseiro, R., Martins, P., and Batista, J.

(2012). Exploiting the circulant structure of tracking-

by-detection with kernels. In Proceedings of the 12th

European Conference on Computer Vision - Volume

Part IV, ECCV’12, pages 702–715.

Henriques, J. F., Caseiro, R., Martins, P., and Batista, J.

(2015). High-speed tracking with kernelized corre-

lation ﬁlters. Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on.

Idrees, H., Warner, N., and Shah, M. (2014). Tracking

in dense crowds using prominence and neighborhood

motion concurrence. Image and Vision Computing,

32(1):14 – 26.

Jia, X., Lu, H., and Yang, M. H. (2012). Visual tracking

via adaptive structural local sparse appearance model.

In Computer Vision and Pattern Recognition (CVPR),

2012 IEEE Conference on, pages 1822–1829.

Kalal, Z., Mikolajczyk, K., and Matas, J. (2012). Tracking-

learning-detection. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 34(7):1409–1422.

Kratz, L. and Nishino, K. (2012). Tracking pedestrians us-

ing local spatio-temporal motion patterns in extremely

crowded scenes. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 34(5):987–1002.

Li, Y. and Zhu, J. (2015). A Scale Adaptive Kernel Corre-

lation Filter Tracker with Feature Integration, chapter

Computer Vision - ECCV 2014 Workshops: Zurich,

Switzerland, September 6-7 and 12, 2014, Proceed-

ings, Part II, pages 254–265. Cham.

Ma, C., Huang, J. B., Yang, X., and Yang, M. H. (2015a).

Hierarchical convolutional features for visual track-

ing. In 2015 IEEE International Conference on Com-

puter Vision (ICCV), pages 3074–3082.

Ma, C., Yang, X., Zhang, C., and Yang, M. H. (2015b).

Long-term correlation tracking. In 2015 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 5388–5396.

Rifkin, R., Yeo, G., and Poggio, T. (2003). Regularized

least-squares classiﬁcation. Nato Science Series Sub

Series III Computer and Systems Sciences, 190:131–

154.

Ristic, B., Clark, D. E., Vo, B.-N., and Vo, B.-T. (2012).

Adaptive target birth intensity for PHD and CPHD ﬁl-

ters. IEEE Transactions on Aerospace and Electronic

Systems, 48(2):1656–1668.

Rodriguez, M., Sivic, J., Laptev, I., and Audibert, J.-Y.

(2011). Density-aware person detection and tracking

in crowds. In Proceedings of the International Con-

ference on Computer Vision (ICCV).

Ross, D. A., Lim, J., Lin, R.-S., and Yang, M.-H. (2008).

Incremental learning for robust visual tracking. Inter-

national Journal of Computer Vision, 77(1):125–141.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

ICLR.

Smeulders, A. W. M., Chu, D. M., Cucchiara, R., Calder-

ara, S., Dehghan, A., and Shah, M. (2014). Visual

tracking: An experimental survey. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

36(7):1442–1468.

van de Weijer, J., Schmid, C., Verbeek, J., and Larlus, D.

(2009). Learning color names for real-world applica-

tions. Trans. Img. Proc., 18(7):1512–1523.

Vedaldi, A. and Lenc, K. (2015). MatConvNet – convo-

lutional neural networks for matlab. In Proceedings

of the 25th annual ACM international conference on

Multimedia.

Vo, B.-N. and Ma, W.-K. (2006). The Gaussian mixture

probability hypothesis density ﬁlter. Signal Process-

ing, IEEE Transactions on, 54(11):4091–4104.

Wang, L., Ouyang, W., Wang, X., and Lu, H. (2015). Visual

tracking with fully convolutional networks. In 2015

IEEE International Conference on Computer Vision

(ICCV), pages 3119–3127.

Wang, N. and Yeung, D.-Y. (2013). Learning a deep

compact image representation for visual tracking. In

Burges, C. J. C., Bottou, L., Welling, M., Ghahra-

mani, Z., and Weinberger, K. Q., editors, Advances

in Neural Information Processing Systems 26, pages

809–817.

Wu, Y., Lim, J., and Yang, M. H. (2013). Online object

tracking: A benchmark. In Computer Vision and Pat-

tern Recognition (CVPR), 2013 IEEE Conference on,

pages 2411–2418.

Zhang, J., Ma, S., and Sclaroff, S. (2014). MEEM: robust

tracking via multiple experts using entropy minimiza-

tion. In Proc. of the European Conference on Com-

puter Vision (ECCV).

Zhang, T., Ghanem, B., Liu, S., and Ahuja, N. (2012).

Robust visual tracking via multi-task sparse learning.

In Computer Vision and Pattern Recognition (CVPR),

2012 IEEE Conference on, pages 2042–2049.

Zhong, W., Lu, H., and Yang, M. H. (2012). Robust ob-

ject tracking via sparsity-based collaborative model.

In Computer Vision and Pattern Recognition (CVPR),

2012 IEEE Conference on, pages 1838–1845.

Long-term Correlation Tracking using Multi-layer Hybrid Features in Dense Environments

203