REAL-TIME TEMPLATE BASED TRACKING WITH GLOBAL

MOTION COMPENSATION IN UAV VIDEO

Yuriy Luzanov and Todd Howlett

Air Force Research Laboratory, USAF, 525 Brooks Rd, Rome NY, 13441, USA

Mark Robertson

Keywords:

Kalman ﬁlter, frame alignment, target tracking, UAV video.

Abstract:

In this paper we describe a combination of Kalman ﬁlter with global motion estimation, between consecutive

frames, implemented to improve target tracking in the presence of rapid motions of the camera encountered

in human operated UAV based video surveillance systems. The global motion estimation allows to retain the

localization of the tracked targets provided by the Kalman ﬁlter. The original target template is selected by the

operator. SSD error measure is used to ﬁnd the best match for the template in video frames.

1 INTRODUCTION

Unmanned Aerial Vehicles (UAVs) are used in mili-

tary and law enforcement for surveillance and recon-

naissance missions. There are several types of sensors

available on UAV platforms. But, video cameras are

the most common type of information gathering sen-

sors. At this time, real-time detection and tracking of

ground targets in UAV video is accomplished man-

ually. Automatic target detection and tracking algo-

rithms do not possess the level of reliability required

for real-time operation in the ﬁeld.

There are many challenges in automatic real-time

tracking of ground targets in video from UAVs. To

name a few, the background is continuously changing

due to the motion of the airframe and motion of the

pan/tilt/zoom camera; the quality of the video is poor;

camera assemblies tend to undergo sudden rapid mo-

tions, primarily due to human operators, but also tur-

bulence and air frame maneuvering.

A popular technique for tracking targets in a video

stream is template matching. This paper discusses

an improvement to the real-time, template match-

ing based, ground target tracking algorithm in UAV

video.

A typical target tracking system consisting of a

template matching module and a Kalman ﬁlter was

implemented. The template matching module was

used to locate the target in each successive video

frame. The Kalman ﬁlter was used to predict the lo-

cation of the target based on its previous location in

the image and a target’s motion model. The search

region was restricted to a small window, centered at

the predicted location. This allows for a tight local-

ization of the target and the reduction of false positive

matches. But, when sudden and rapid changes occur

in the video the predictions of the targets’ locations

can have signiﬁcant errors, often resulting in the fail-

ure of the tracking algorithm.

To negate the effects and account for such rapid

motions in the video it is proposed to use a frame

alignment algorithm to determine the global motion

between consecutive frames. The global motion is

applied as a control input to the Kalman ﬁlter. As

a result, the prediction of the location of the target

is signiﬁcantly improved, increasing the tracking per-

formance. The system presented in this paper is able

to achieve a real-time performance of 24 ms of com-

putational time per frame, with 640x480 video, on a

Pentium IV class workstation.

The paper is organized as follows. First, tem-

plate matching techniques used in the implementation

of the tracking algorithm are discussed in section 2.

Then, an implementation of the Kalman ﬁlter is dis-

cussed in section 3. The next section, 4, presents the

details of the frame alignment algorithm, describing

the integration with the Kalman ﬁlter. Finally, in sec-

tion 5, the results and conclusions are presented. Val-

515

Luzanov Y., Howlett T. and Robertson M. (2007).

REAL-TIME TEMPLATE BASED TRACKING WITH GLOBAL MOTION COMPENSATION IN UAV VIDEO.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 515-518

Copyright

c

SciTePress

ues for all of the parameters used in the implementa-

tion are presented in the Appendix.

2 TEMPLATE MATCHING

Template matching is used to determine the location

of the target in each frame of the video sequence. The

main assumptions made with this technique are that

there are no signiﬁcant occlusions of the tracked tar-

get and the target’s appearance changes gradually.

The objects in the video change appearance over

time. The initial template selected by the operator

in the ﬁrst frame will become inaccurate after a pe-

riod of time. A simple approach to try and overcome

this problem is to update the template after a certain

amount of frames have been processed. The main dif-

ﬁculty here is to ﬁnd the proper update rate. If it is too

high, the template is being updated too often, then the

error will accumulate quickly resulting in a drift off of

the object originally selected by the operator. On the

other hand when the update rate is too slow, the tar-

get will be lost due to dissimilarities between its cur-

rent appearance and the template. There were several

techniques proposed in order to determine when and

how to update the template (Matthews et al., 2004).

Here the template is simply replaced every p frames

by the best matched region. The update rate,

1

p

, is de-

termined experimentally.

Let I(x, y, t) denote a pixel intensity at location

(x, y) in the video frame of size N × M pixels at time

t, where x ∈ [0, N − 1] and y ∈ [0, M − 1]. The tem-

plate is initialized by the operator at time t = 0. Let

S(i, j, l) be a pixel intensity at location (i, j) of a tem-

plate extrated from (lp)

th

frame, I(x, y, lp), where l

and p are nonnegative integers. The size of the tem-

plate S(i, j,l) is K × K pixels, so that i, j ∈ [0, K − 1].

To eliminate any ambiguities in determining the cen-

ter of S(i, j, l), K is picked to be an odd positive inte-

ger.

Using Sum of Squared Difference (SSD) error

measure, the error between the template S(i, j, l) and

the image I(x, y,t) at point (a, b) can be written as fol-

lows:

e =

K−1

∑

i=0

K−1

∑

j=0

(S(i, j, l) − I(a−

K

2

+ i, b−

K

2

+ j,t))

2

(1)

where

K

2

is rounded down to the nearest integer.

By computing the error in equation (1) for every

pixel of the frame at time t, and ﬁnding the minimum

error, the best matched region could be determined.

This approach would result in a longer computation

time and poor localization of the target due to false

positive matches. One solution is to deﬁne a region

R

s

of size W ×W pixels, centered at the most proba-

ble location of the target. Then the search for the best

match would be carried out only within R

s

. In the im-

plementation W ranges betweenW

min

and W

max

and is

determined by how fast the target moves in the image.

3 KALMAN FILTER

In order to approximate the next location of the target

in the video a Kalman ﬁlter is employed (Welch and

Bishop, 2004). A basic Kalman ﬁlter was found to

be sufﬁcient for this application. When the operator

initializes the template in the ﬁrst frame, by select-

ing a point inside the target region, a Kalman ﬁlter

is also initialized. The point, originally selected by

the operator, will be tracked through the video. The

discrete-time state-space representation of the linear

process, the state of which is being estimated, can be

expressed as:

x

k

= Ax

k−1

+ Bu

k−1

+ w

k−1

(2)

z

k

= Hx

k

+ v

k

(3)

In equation (2) the state vector, x ∈ R

4

, contains the

position and the velocity of a 2D point being tracked.

State transition matrix, A ∈ R

4×4

, relates the previ-

ous state, x

k−1

, to the current state, x

k

, where, k, is a

nonnegative integer. Control input matrix, B ∈ R

4×2

,

provides coupling between the control input vector,

u ∈ R

2×1

, and the state. The control input is the 2D

global displacement of the image. In equation (3),

z ∈ R

2

, is the measurement vector. The measurement

is a 2D position of the target’s point. The observation

matrix, H ∈ R

2×4

, relates the state to the measure-

ment. The random variable, w, represents the process

noise and the random variable, v, represents the mea-

surement noise. They are independent, white and nor-

mally distributed:

w ∼ N(0, Q) (4)

v ∼ N(0, R), (5)

where Q ∈ R

4×4

is the process noise covariance ma-

trix and R ∈ R

2×2

is the measurement noise covari-

ance matrix. Both Q and R are assumed to be con-

stant.

A Kalman ﬁlter consists of two stages: the pre-

diction and the correction. In the prediction stage an

estimate of the current state, ˆx

k

, and estimate of the

current error covariance matrix,

ˆ

P

k

, are made:

ˆx

k

= Ax

k−1

+ Bu

k−1

(6)

ˆ

P

k

= AP

k−1

A

T

+ Q, (7)

where error covariance matrix P ∈ R

4×4

. The correc-

tion stage uses the measurement z

k

to reﬁne the pre-

diction and compute the Kalman gain K

k

, the current

state x

k

and error covariance P

k

:

K

k

=

ˆ

P

k

H

T

(H

ˆ

P

k

H

T

+ R)

−1

(8)

x

k

= ˆx

k

+ K

k

(z

k

− H ˆx

k

) (9)

P

k

= (I − K

k

H)

ˆ

P

k

(10)

For each frame the following computations are per-

formed. First a prediction of the target’s location is

made using (6), (7). Then the search region R

s

is ini-

tialized with predicted target’s position from ˆx

k

. Now,

the template matching is performed within the region

R

s

and the center of the best match is used as the mea-

surement z

k

to correct the ﬁlter with equations (8), (9)

and (10).

The motion of the object in the video is usually

linear, except when the camera undergoes sudden and

rapid movements. If that happens the assumption of

linearity is violated and the Kalman ﬁlter fails.

4 ROBUST FRAME ALIGNMENT

Frame alignment is used to compute the global mo-

tion and supply the control input to the Kalman ﬁlter.

Section 4.1 provides the formulation of the alignment

algorithm, after which Section 4.2 shows how to make

the formulation robust to outliers and data not well

modeled by the original formulation.

4.1 Aligning Background

We denote the intensity of pixel (x, y) at time t as

I(x, y,t). To relate the pixels of two frames together,

one can apply the intensity constraint

I(x, y,t) = I(x+ ∆x, y+ ∆y, t − 1), (11)

which if we approximate with the linear terms of a

Taylor approximation we obtain

∆x

∂

∂x

I(x, y,t) + ∆y

∂

∂y

I(x, y,t) −

∂

∂t

I(x, y,t) ≈ 0 (12)

The motion at pixel (x, y) is (∆x, ∆y). For notational

convenience, we deﬁne f

x

, f

y

, and f

t

as the par-

tial derivatives in (12). These partial derivatives are

implemented as digital approximations of horizontal,

vertical, and temporal gradients. The various quanti-

ties are subscripted with an “i” to denote their values

at the i

th

pixel.

By imposing a global motion model at each pixel

location (x

i

, y

i

) in the image, we can parameterize

the motion to make a solution at each pixel more

tractable. A two-parameter motion model that ac-

counts for translation in the horizontal and vertical

directions would be

∆x

i

= a

13

, ∆y

i

= a

23

(13)

A weighted LS derivation yields the following system

of equations:

∑

w

i

f

2

x

i

∑

w

i

f

x

i

f

y

i

∑

w

i

f

x

i

f

y

i

∑

w

i

f

2

y

i

a

13

a

23

=

∑

w

i

f

x

i

f

t

i

∑

w

i

f

y

i

f

t

i

(14)

where w

i

is the weight for the i

th

pixel position. These

equations are easily solved for the two motion param-

eters. More complex global motion models are also

possible, but at the cost of additional computational

time.

4.2 Making Alignment Robust

Frame alignment as described in the previous subsec-

tion works well when the two frames are well mod-

eled by the particular global motion model in use.

However, even when the underlying background is

well modeled, sometimes large errors can still result.

For example, if there are objects moving indepen-

dently in the scene, their motion will not match that

of the background, and large errors will occur in these

positions. Similarly, if there are burned-in metadata

on the frames, they will not move according to the

background, and large errors will result as well. To

make the frame alignment procedure more robust to

such errors, we employ an iterative reweighted LS

scheme (Odobez and Bouthemy, 1994). On iteration

k = 0, when there is no knowledge about errors, all

weights are chosen as one, w

(0)

i

= 1. The motion

parameters (for whichever model is being used) are

solved to yield motion estimates for the k

th

iteration,

(∆x

(k)

i

, ∆y

(k)

i

). The residual error for location i at iter-

ation k is then computed as

r

(k)

i

= I(x

i

, y

i

,t) − I(x

i

+ ∆x

(k)

i

, y

i

+ ∆y

(k)

i

,t − 1) (15)

The residual error represents the difference between

the i

th

pixel value of frame t, and the pixel value at

location in frame t − 1 to which we believe (at iter-

ation k) it has moved. If we use the Tukey biweight

function (Odobez and Bouthemy, 1994), we then pick

the weight for the next iteration k+ 1 as:

q

w

(k+1)

i

=

(

C

2

−

h

r

(k)

i

i

2

|r

(k)

i

| < C

0 otherwise

(16)

This weight is then used in the weighted LS formula-

tion to get an estimate of the motion at iteration k+ 1,

after which the procedure continues iteratively until a

Figure 1: Tracking a car in consecutive frames, left to right.

convergence criterion is met or else some ﬁxed num-

ber of iterations have completed. A similar weighting

scheme to (16) is given as

q

w

(k+1)

i

=

C− |r

(k)

i

| |r

(k)

i

| < C

0 otherwise

, (17)

which makes use of an absolute value rather than a

square, and may be slightly faster on some architec-

tures.

Reweighting according to (16) or (17) is equiva-

lent to a formulation that minimizes not the sum of

the squares of the error as in LS, but rather the sum of

a function of the errors.

Once the global motion, (∆x, ∆y), between the last

and the current frames is estimated it is applied to

the Kalman ﬁlter prediction step as a control input,

u

k−1

= [∆x, ∆y]

T

for k ≥ 1, in equation (6).

5 RESULTS AND CONCLUSIONS

The algorithm was implemented on a 3.0 GHz Intel

Xeon workstation in a single process, without using

any special instructions. A real-time performance was

achieved. On average, 24 ms was required to process

a single 640x480 frame of video. This includes the

computation of the global motion between the current

and the previous frame, Kalman ﬁlter prediction and

correction, and template matching.

As part of the test, sample UAV videos from

the DARPA sponsored VIVID program were used,

(Collins et al., 2005). Figure 1 illustrates the algo-

rithm tracking a car in three consecutive frames in the

presence of a signiﬁcant camera motion.

As a result of using the global motion compensa-

tion the tracker was able to maintain the track of the

target using a small search region.

There are certain advantaged of employing the

global motion compensation. This is apparent in situ-

ations with complex scenes, i.e. urban scenes, where

multiple objects in the video frame have similar ap-

pearance. In order to successfully track a target a

small size search window is used to avoid false posi-

tive matches.

The disadvantage of using the global motion com-

pensation is that it takes a very long time to compute.

If the complexity of the scene permits, the target has

a unique appearance, it may be faster to use a larger

search area, rather then to compute the global motion.

REFERENCES

Collins, R., Zhou, X., and Teh, S. K. (2005). An open

source tracking testbed and evaluation web site. In

IEEE International Workshop on Performance Evalu-

ation of Tracking and Surveillance.

Matthews, I., Ishikawa, T., and Baker, S. (2004). The tem-

plate update problem. IEEE Transactions on Pattern

Analisys and Machine Intelligence, 26(4):810–815.

Odobez, J.-M. and Bouthemy, P. (1994). Robust multireso-

lution estimation of parametric motion models applied

to complex scenes. Publication Interne IRISA 788,

Institut de Recherche en Informatique et Syst

`

emes

Al

´

eatoires.

Welch, G. and Bishop, G. (2004). An introduction to the

kalman ﬁlter. Technical Report TR 95-041, University

of North Carolina at Chapel Hill.

APPENDIX

The following are the values for the parameters used

in the implementation of the algorithms.

For template matching 11 × 11 templates were

used, thus K = 11 and

K

2

= 5. The size of the search

area R

s

was between W

min

= 10 and W

max

= 32 pix-

els. The template was updated every 15 frames, thus

p = 15.

In the implementation of the Kalman ﬁlter the fol-

lowing state-space matrices were employed:

A =

1 0 1 0

0 1 0 1

0 0 1 0

0 0 0 1

, B =

1 0

0 1

0 0

0 0

(18)

H =

1 0 0 0

0 1 0 0

(19)

The process noise and measurement noise covariance

matrices were:

Q = 0.01I

4×4

, R = I

2×2

(20)

The ﬁlter is initialized with x

k−1

, P

k−1

and u

k−1

at

k = 0, thus the initial values are x

−1

, P

−1

and u

−1

and

they are:

x

−1

=

x

init

y

init

0 0

T

(21)

P

−1

= 10I

4×4

(22)

u

−1

= 0 (23)

In equation (21) the values x

init

and y

init

are provided

by the operator’s initial selection of the target.

The Tukey biweight parameter was C = 40.