Robust Real-time Tracking Guided by Reliable Local Features
Marcos D. Zuniga
1
and Cristian M. Orellana
2
1
Electronics Department, Universidad Tecnica Federico Santa Maria, Av Espana 1680, 2390123, Valparaiso, Chile
2
Department of Computer Science, Universidad Tecnica Federico Santa Maria, Av Espana 1680, 2390123, Valparaiso,
Chile
Keywords:
Multi-target Tracking, Feature Tracking, Local Descriptors, Segmentation, Background Subtraction, Reliabil-
ity Measures.
Abstract:
This work presents a new light-weight approach for robust real-time tracking in difficult environments, for
situations including occlusion and varying illumination. The method increases the robustness of tracking
based on reliability measures from the segmentation phase, for improving the selection and tracking of re-
liable local features for overall object tracking. The local descriptors are characterised by colour, structural
and segmentation features, to provide a robust detection, while their reliability is characterised by descriptor
distance, spatial-temporal coherence, contrast, and illumination criteria. These reliability measures are utilised
to weight the contribution of the local features in the decision process for estimating the real position of the
object. The proposed method can be adapted to any visual system that performs an initial segmentation phase
based on background subtraction, and multi-target tracking using dynamic models. First, we present how to
extract pixel-level reliability measures from algorithms based on background modelling. Then, we present
how to use these measures to derive feature-level reliability measures for mobile objects. Finally, we describe
the process to utilise this information for tracking an object in different environmental conditions. Preliminary
results show good capability of the approach for improving object localisation in presence of low illumination.
1 INTRODUCTION
Real problems often lack on the possibility of obtain-
ing manual initialisation for properly obtaining a re-
liable first model of an object. Many tracking algo-
rithms require a robust initial object model to per-
form tracking, often obtained with manual procedures
(Kalal et al., 2011; Yang et al., 2014). These methods
often fail in dealing with problems as severe illumina-
tion changes or lack of contrast, or perform expensive
procedures to keep the coherence of tracking in these
complex situations. Also, these tracking approaches
are focused on moving camera applications, so they
neglect the utilisation of background subtraction to
determine the regions of interest in the scene.
A wide variety of applications can be solved util-
ising a fixed camera setup (e.g. video-surveillance,
health-care at distance, behaviour analysis, traffic
monitoring). This kind of setup allows the consider-
ation of inexpensively utilising background subtrac-
tion approaches to detect potential regions of interest
in the scene. This work focuses on this kind of ap-
plications, focusing in solving the problem of robust
tracking of multiple unknown (uninitialised) objects,
independently of the scene illumination conditions, in
real-time. Then, tracking is performed without man-
ual intervention.
Segmentation is commonly the early stage of any
vision system, prior to tracking and higher level anal-
ysis stages, where regions of interest are extracted
from the video sequence. Background subtraction
approaches present several issues as: low contrast,
poor illumination, gradual and sudden illumination
changes, superfluous movement, shadows, among
others (Toyama et al., 1999). Any error emerging
from this stage would be propagated to the subsequent
stages. A way to deal with these issues is to determine
the quality of the segmentation process in order to ac-
tivate control mechanisms to mitigate those errors on
later stages.
Assuming that we do not know the model of ob-
jects present in the scene, we initially use a bounding
box representation extracted from segmented blobs
using background subtraction methods. This repre-
sentation is general enough to track any object in real-
time, and serves as the initial region of interest for
applying more complex object models. Nevertheless,
as the the segmented blobs are obtained from back-
Zuniga, M. and Orellana, C.
Robust Real-time Tracking Guided by Reliable Local Features.
DOI: 10.5220/0005727600590069
In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 59-69
ISBN: 978-989-758-175-5
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
59
ground subtraction, they are sensitive to changes in
contrast and illumination. This sensitivity affects the
object tracking process incorporating noise (in terms
of false positive and negative) to the system.
In order to control the effect of noisy informa-
tion in tracking, we propose a local feature track-
ing approach, which reinforces the tracking of the
bounding box associated to the object. We extract
a contrast map from segmentation, to obtain reliabil-
ity measures which allow us to characterise the local
features in terms of illumination and contrast condi-
tions. The local descriptors are obtained from a multi-
criteria approach, considering colour (through HSV
histograms), structural (through a binary descriptor),
and segmentation region (through foreground mask
and contrast maps) features. Then, the most reliably
tracked local features are utilised, together with the
tracked bounding box and the foreground information
associated to the tracked object in the current frame,
to adjust the estimation of the bounding box in the
current frame.
This paper is organised as follows. First, Sec-
tion 2 presents the state-of-the-art in order to clearly
establish the contribution of the proposed approach.
Then, Section 3 performs a complete description of
the approach. Next, Section 4 presents the results ob-
tained on several benchmark videos. Finally, Section
5 presents the conclusion and future work.
2 STATE OF THE ART
In the context of segmentation quality measures, the
most recent approach is presented in (Troya-Galvis
et al., 2015). The authors propose a metric to quan-
tify the segmentation quality for remote sensing seg-
mentation, in terms of over-segmentation and under-
segmentation. In order to detect under or over-
segmentation, they use a similarity function to eval-
uate the quality of the segmentation. A good seg-
mentation is obtained if a segment is well sepa-
rated from its neighbouring segments. Errors can oc-
cur, like splitting a segment in two similar segments
(over-segmentation) or merging two distinct segments
(under-segmentation). Using the similarity function,
the authors are able to measure over-segmentation and
under-segmentation for each segment in the image.
That information then is utilised to improve the seg-
mentation applying the corresponding mechanisms to
the erroneous segment (e.g. splitting a segment with
under-segmentation problem).
In (Correia and Pereira, 2003) the authors make a
review of video segmentation quality. They identify
that quality measurements can be object-based (indi-
vidually) or globally (as meaning of overall segmen-
tation). These measurements can also be classified
as relative, when the segmentation mask is compared
with ground-truth or as stand-alone, when the evalu-
ation is made without using a reference image. Other
classifications are subjective evaluation using human
judgement or objective evaluation, using a set of a
priori expected properties. For our scope, we are in-
terested on a individual stand-alone objective qual-
ity measurement. In the same article, the features de-
scribing this kind of measures are intra-object metrics
such as shape regularity, spatial uniformity, temporal
stability and motion uniformity; or inter-object met-
rics like local contrast or neighbouring objects feature
difference. The authors propose measures for each
two classes of content, the stable content and the mov-
ing content. The first one is temporally stable and has
regular shape, while the second one has strong and
uniform motion. These measures take into account
the characteristics of each content to make an unique
quality value for the object.
In (Erdem et al., 2004) the authors proposed
three disparity metrics: local bound contrast, tempo-
ral color histogram difference and motion difference
along object boundary. The local bound contrast is
focused on determining the quality of the bounds by
comparing internal features (inside of the object) with
external features (outside of the object). The next im-
age depicts this metric:
Figure 1: Spatial color contrast along boundary metric from
(Erdem et al., 2004). (a) image: Object detected, (b) image:
Boundary with normal lines, (c) image: A zoom-in of a nor-
mal line where each cross represents a pixel inside (P
I
) or
outside the object (P
O
).
To determine the quality of the boundary, a pixel
P
I
from the object is compared with a pixel of its
neighbourhood P
O
, both at distance L of the bound-
ary. The comparison considers the average color in
the square of size M, centered in the pixel P
as shown
in the figure 1 (c). In this sense, good quality seg-
mentation is achieved when there is a high difference
between internal and external features. Special care
must be taken with the meaning of the value, because
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
60
a good boundary can be represented by a high quality
value, but a high quality value does not necessarily
mean a good quality boundary. The second metric
tries to measure the temporal stability of color his-
togram distribution by comparing current object his-
togram with a smoothed version generated as an av-
erage of k previous histograms. A good temporal
color stability is obtained if both histograms are sim-
ilar. The third metric models the quality of the move-
ment by estimating how the points P
change from
one frame to another. The movement metric consid-
ers the difference of motion vectors from both points
(P
i
and P
O
) and a reliability factor defined as the preci-
sion of the estimation compare the measurement and
the color consistency of the points in the square. The
authors proposed a combined metric to determine the
quality of the object segmentation. As well, they can
determine if a particular segment of the boundary has
poor quality using a combination of local bound con-
trast and motion metrics. If the combined value is
higher than a predefined threshold, the related seg-
ment is considered as low quality. This threshold is
obtained as a factor of the standard deviation of the
mean object quality.
In the context of, local descriptor-based track-
ers, some similar approaches are presented in the
literature. In (Lee and Horio, 2013) a reliable ap-
pearance model (RAM) that uses local descriptor
(HOG) to learn the object shape and histogram is pro-
posed. This appearance model effectively incorporate
color and edge information as discriminative features.
However, it is necessary to get a reliable first model
to perform the training of the Adaboost learner, leav-
ing this approach as semi-automatic, as well as many
other approaches (Yang et al., 2014; Wang et al.,
2013; Sun and Liu, 2011; Kalal et al., 2011; Adam
et al., 2006).
In (Wang et al., 2013) the authors proposed a
weighted histogram that gives a higher weight to fore-
ground pixel in order to make target features more
prominent. The weighted component is based on the
pixel’s degree of belonging to the foreground. The
way of producing the weighted histogram is very sim-
ilar to our weighted histogram from Equation ( 3), but
it does not incorporate the reliability of illumination
R
i
(y), that defines how illumination affect color-based
features.
The authors in (Sun and Liu, 2011) combine a
local descriptor (SIFT) with a global representation
(PCA). In contrast to classical PCA, where pixels are
weighted uniformly, they add a higher weight to pix-
els close to SIFT descriptor’s position. The track-
ing phase depends on how reliable are the descrip-
tors matching. This reliability is obtained based on
how well the descriptor has been matched previously.
Also the amount of reliable descriptors is used to de-
termine if the occlusion is present in the frame. There
are three modes of tracking, 1) if there are enough de-
scriptor matched and they are reliable, then the track-
ing is perform by approximating the affine matrix
that described the movement of the previous frame’s
descriptors with the current descriptors. 2) if there
are reliable matched descriptor but they are scarce, a
translation model (position and velocity) is calculated
instead. 3) is there no reliable matches, previous in-
formation is used to estimate the object’s movement.
In our case, the reliability of the descriptors comes
from the reliability map, but the idea of use previous
information when there is no reliable match of the de-
scriptors remains. Another tracker that use reliabil-
ity is presented in (Breitenstein et al., 2009). In this
case, the reliability is based on self-incorporated ob-
ject detector (that is trained off-line). In order to get a
good tracking performance, it is necessary to weight
properly the information of tracking history and the
classifier, otherwise drifting problems may arise.
Fragtrack is proposed in (Adam et al., 2006). It
uses local patches to avoid partial occlusion problems.
If a patch is occluded, other patches can be used to
predict the bounding box position (they assume that at
least 25% of patches are visible). Each of this patches
has associated a histogram and the relative position of
its bounding box. The estimation of the bounding box
in the next frame is done by a voting scheme. Each
patch’s histogram is searched in a neighbourhood and
votes for a possible position of the bounding box. So,
the estimated bounding box’s position is whose has
more votes. As the method rely heavily on the use
of histogram, they use integral matching to perform
real time tracking. This also allows search in differ-
ent scales at without increasing so much the compu-
tational cost.
We summarise the contributions of the proposed
approach as:
- A reliability model for background subtrac-
tion methods (or methods with similar behaviour:
background modelling, comparing current frame with
background model and applying a threshold to clas-
sify pixels into foreground or background). This is a
pixel-level reliability model, which we refer as relia-
bility map.
- A way to convert a reliability map to attribute-
level reliability. The attributes depend on the object
representation. In our case, we will use a 2D bound-
ing box and local features as object representation.
- A multi-target tracking approach incorporating
attribute-level reliability measures for weighting the
contribution of detected local features to the object
Robust Real-time Tracking Guided by Reliable Local Features
61
Figure 2: General schema of the proposed tracking approach.
model. The idea is to prevent the incorporation of
information that could negatively affect the estimation
of the object model, and focus on the most reliable
information to reduce the effect of noise.
3 RELIABLE LOCAL FEATURE
TRACKING
The proposed tracking approach is depicted in Figure
2.
For each new frame of the video sequence, a back-
ground subtraction algorithm is applied for obtaining
the foreground mask, the reliability map (see Section
3.1, for details), and the regions of interest (ROI), rep-
resented as a set of bounding boxes, using a connected
components algorithms. Also, the new frame is con-
verted to YUV color space.
For the first frame where a new object appears
(new bounding box not associated to any other pre-
viously tracked object), a set of tracked patches is ini-
tialised, according to the procedure described in Sec-
tion 3.2.
For the next frames, a ROI (or merge of partial
ROIs), determined with a Multi-Hypothesis Tracking
(MHT) algorithm (Zuniga et al., 2011), is associated
to the object as input to the robust patch tracking ap-
proach, and the following procedure is applied:
1. If a patch is considered unreliable in terms of
positioning. Then, an optimal association to the
patch is searched in the current frame considering
the information of the ROI displacement and di-
mension change, compared to the previously as-
sociated ROI. This optimal association is deter-
mined using a global reliability measure, which
integrates temporal coherence, structural, colour,
and contrast measures (see Section 3.4). If a set of
patches has been reliably tracked from previous
frames, this information is utilised to determine
the displacement of all the patches for the current
frame, according to the procedure detailed in Sec-
tion 3.3.
2. Then, according to the global reliability measure
calculated at the previous step, the highest relia-
bility patches can be classified as highly reliable,
the patches with low reliability are classified as
unreliable and marked for elimination (see Sec-
tion 3.3, for details).
3. Next, unreliable patches are eliminated and new
patches are added in positions not properly cov-
ered by the remaining tracked patches. The con-
struction of these patches follows the same proce-
dure as the patch initialisation phase (Section 3.2).
4. If a significant number of patches is classified as
reliable, they are utilised for adjusting the estima-
tion of the object model bounding box for the cur-
rent frame. If this number is not significant, the
object model bounding box is obtained from the
input ROI and the estimated bounding box from
the object model dynamics (see Section 3.5, for
details).
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
62
5. Finally, the dynamics object model is updated
with the current object model bounding box (see
Section 3.5, for details). Bottom image of Figure
3 depicts the result of the tracking process.
Figure 3: Top figure shows the current frame. Center fig-
ure depicts the reliability map, with a thermal map, where
high reliability is red. Bottom figure shows the result of the
tracking process; red boxes represent the bounding boxes
from segmentation, the blue box represents the estimated
bounding box of the tracked object, the dots represent the
tracked patches coloured according to reliability in thermal
scale, and blue segments represent the object trajectory.
3.1 Reliability Map from Background
Subtraction
The key factor for a good tracking is how distinguish-
able is the object of interest from its surroundings. If
we are working in a background subtraction scheme,
we are going to interpret the surrounding of the object
as the background model and how distinguishable is
as the degree of difference between the current image
and the background model. If we have a significant
difference, we have certain margin of error on defin-
ing the threshold and the segmentation algorithm will
still be able to perform a good classification. Never-
theless, if that difference is low, we have to accurately
define the threshold value to avoid a misclassification.
In this sense, the last example is less reliable, because
it is more prone to make a wrong classification.
Based on the previous idea, we propose a method
that can model the reliability of any background sub-
traction technique through the following steps:
1. Determine the key parameter (threshold) of the
background subtraction algorithm, utilised to per-
form segmentation at each pixel. Applying a clas-
sification threshold to this value, we can classify it
into foreground or background. Some algorithms
use more than a single difference criteria to per-
form segmentation, so we are interested in the
mixture of these differences, just before applying
the classification threshold.
2. Define a range [in f ,sup] for the difference. We
are interested in generating a reliability image rep-
resentation with different degrees of reliability. If
we consider all the range, sometimes it can gener-
ate a binary image (just low and high reliability)
that is not useful for our interest. This range is
defined empirically.
3. Apply the scaling function, from Equation (1), to
every pixel distance determine from step 1, to con-
vert difference values into reliability measures:
S(D) =
0% if D < in f
f (x) if in f D sup
100% if D > sup
, (1)
where D is the pixel distance, in f and sup are val-
ues defined in step 2 and f (x) is a increasing func-
tion (we use a linear function).
At the end of these steps we can generate a pixel-
level representation of the reliability which we
named as reliability map. This map is internally
represented as a gray-scale image, but for proper
visualisation we transform it into thermal scale, as
shown in figure 4.
Usually, several post-processing functions are ap-
plied to the segmentation mask in order to reduce
the noise. This operation also should be applied to
the reliability map to maintain the coherence of its
representation with the foreground mask. Figure
5 is an example of applying morphology opera-
tions to the foreground image and the reliability
map (considering gray-scale morphological oper-
ators).
Robust Real-time Tracking Guided by Reliable Local Features
63
Figure 4: Reliability map visualisation. Left image: current
image frame, right image: thermal scale reliability map.
Blue color means a low difference between modelled back-
ground and current frame. Red color means a high differ-
ence.
Figure 5: Example of applying morphology operations to
foreground mask and reliability map. The top images show
the foreground mask and the reliability map with noise. The
bottom images show the results after applying the morpho-
logical operation (binary morphology for foreground mask
and gray-scale morphology for reliability map).
We illustrate how this method works using naive
background subtraction (McIvor, 2000): This model
performs difference of current image with a back-
ground subtraction image (image without any object
interest). Our implementation uses the sum of square
differences as distance value before applying the clas-
sification threshold. The sum of square difference
shown in the (Equation 2) is a common metric to
measure the distance between current pixel and back-
ground pixel in a RGB color space:
D = (R
bg
R
i
)
2
+ (G
bg
G
i
)
2
+ (B
bg
B
i
)
2
, (2)
where subindex (·)
i
refers to current image pixel and
(·)
bg
refers to background pixel.
Applying the proposed scheme to this method us-
ing a range of [1,400], we can obtain image shown in
Figure 6.
Figure 6: Reliability map using naive background subtrac-
tion. Left image: current image, right image: reliability
map from naive background subtraction.
3.2 Patch Initialisation Phase
The first step is to find patches of size patchSize ×
patchSize in the contour of the object (defined by the
foreground mask) in such way that any two patches do
not overlap between each other. Then, the strongest
point inside of the patch, obtained by FAST algorithm
(Rosten and Drummond, 2006) from the Y-channel
of the current frame converted to YUV color space,
is added as a new patch position if no other existing
patch is near this position.
Then, each candidate patch stores the following
information:
The central patch position (x,y).
The 512 bits FREAK descriptor (Alahi et al.,
2012), generated using the reliability map, repre-
senting the structural information of the patch.
A normalised colour histogram, using chroma
channels U and V from the YUV current frame,
considering only pixels belonging to the fore-
ground mask in the analysed patch. Considering
H
UV
(i, j) as the bin of a 2D histogram of the UV
channels, with i, j [0..BinsNumber], The Equa-
tion 3 represents the way this histogram is calcu-
lated.
H
UV
(i, j) =
pQ
F(p)R
m
(p)R
i
(Y (p))
pP
F(p)R
m
(p)R
i
(Y (p))
, (3)
with
Q =
(
p P
:
U(p)
binSize
= i
V (p)
binSize
= j
)
, (4)
where Y (p), U(p), and V (p) correspond to the chan-
nel level in [0..255] in pixel position p of the current
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
64
frame in YUV color space, P is the set of pixel po-
sitions inside the analysed patch, and Q is the set of
patch positions, where values U(p) and V (p) fall in-
side the bin H
UV
(i, j). For each pixel a weighted value
is added, where: F(p) = 1 if the pixel p corresponds
to the foreground, and 0 otherwise; R
m
(p) [0; 1] is
the reliability map value in position p, where a value
of 1 corresponds to maximum contrast reliability (see
Section 3.4, for details); and R
i
(Y (p)) corresponds
to the illumination reliability, accounting the perti-
nence of colour information given different illumina-
tion levels, according to the gray-scale level in chan-
nel Y [0..255] at pixel position p. The reliability
measure R
i
considers maximum reliability near 128
value (medium illumination) and decays to 0 near the
extremes of the interval. Equation 5 formulates this
reliability and Figure 7 depicts the reliability function.
R
i
(Y ) =
0 if Y 128 γ
Y +γ128
β
if 128 γ < Y < 128 α
1 if 128 α Y 128 + α
128+γY
β
if 128 + α < Y < 128 + γ
0 otherwise
(5)
where α and β are predefined parameters, and γ = α+
β.
Figure 7: Illumination reliability function.
A colour histogram reliability measure accounting
for the reliability of colour information (Equation
6).
R
colour
=
pP
F(p)R
m
(p)R
i
(Y (p))
/N
pix
,
(6)
where N
pix
is the number of foreground pixels in
the patch.
A normalised gray-scale histogram of NumBins
bins, accumulating channel Y of the current im-
age in YUV color space, for those pixels inside
the patch which belong to the foreground.
All this information is utilised to properly char-
acterise the patch, in order to match with potential
patches in future frames. These patches then initialise
patch tracking buffers for future processing.
3.3 Patch Tracking Phase
Given a set of patches S from the previous frame, the
patch tracking process follows the process described
bellow:
Consider S
H
as the set of tracked patches consid-
ered as highly reliable from the previously pro-
cessed frame. A reliably tracked frame is a frame
of high reliability, which has a coherent move-
ment with the mobile object and high contrast,
colour, and structural accumulated reliabilities (as
described in Section 3.4). Then, these patches
are considered able to estimate the behaviour of
less reliable patches near to them. For this rea-
son, tracking becomes more exhaustive for these
patches, but in a reduced region. Then, the reli-
able patches are tracked in the following way:
1. Displacement vector (dx, dy) is determined
from the displacement vector inferred from
their associated patch tracking buffer.
2. Search window is determined from the accu-
mulated difference (x
d
,y
d
) between the accu-
mulated object center movement vector with
the accumulated movement vector of the patch,
considering all the patches in the tracking
buffer. The window is centered in (x
W
,y
W
) =
(x
p
+dx, y
p
+dy), where (x
p
,y
p
) is the position
of the patch in the previous frame.
3. Then, the patch position with minimal global
distance D
global
to the previous patch is associ-
ated to the current reliable patch position, fol-
lowing the Equation 7.
(x
,y
) = arg max
(x,y)W
H
D
global
(p
t
(x,y), p
t1
),
(7)
with
W
H
=
(x,y)
:
|
x x
W
|
x
d
|
y y
W
|
y
d
, (8)
where p
t
(x,y) is the current patch at position
(x,y), and p
t1
is the patch at previous frame.
The distance measure D
global
globally calcu-
lates the patch distance, considering the struc-
tural, colour, segmentation and gray-scale in-
formation. This measure is described in detail,
in Section 3.4.
If the patch buffer has been built just in the pre-
vious frame (previous initialisation step) or the
patch is not highly reliable, the positioning of the
patch is determined in the following way:
Robust Real-time Tracking Guided by Reliable Local Features
65
1. If set S
H
size is adequate, the displacement vec-
tor (dx,dy) for the patch is determined from the
displacement vectors of highly reliable patches,
each weighted by the position of the highly reli-
able patch to the analysed patch in the previous
frame and the R
global
reliability measure.
2. The window is determined in a similar way as
for highly reliable patches, but, as the patch is
less reliable, it would normally have a bigger
search window. For this reason, FAST algo-
rithm is applied to the search window for can-
didate positions.
3. Then, maximal reliability patch is determined
in a similar way as in Equation 7, but from the
set of FAST points detected on the window.
Then, according to the global reliability measure
R
global
, the tracked patches are classified as highly
reliable if they pass a high threshold T
H
. Patches
with reliability below a low threshold T
U
are clas-
sified as unreliable and eliminated.
As the object can be represented by less patches,
new patches are added in positions not properly
covered by the remaining tracked patches, using
the same procedure described in Section 3.2.
3.4 Patch Distance and Reliability
Measures
To match two patches, the distance between them in
terms of their different attributes must be calculated.
We propose the distance measure D
global
, described
in Equation 9.
D
global
=
w
st
D
st
+ w
f g
D
f g
+ w
co
D
co
+ w
gs
D
gs
w
st
+ w
f g
+ w
co
+ w
gs
, (9)
with
D
st
(p
1
, p
2
) =
Freak[p
1
];Freak[p
2
])
H
512
, (10)
D
f g
(p
1
, p
2
) =
|#FG[p
1
] #FG[p
2
]|
max(#FG[p
1
],#FG[p
2
])
, (11)
D
co
(p
1
, p
2
) = D
rcol
(p
1
, p
2
)
H
UV
[p
1
];H
UV
[p
2
]
B
, (12)
D
rcol
(p
1
, p
2
) =
R
colour
(p
1
) R
colour
(p
2
)
,and (13)
D
gs
(p1, p2) =
H
Y
[p1], H
Y
[p2]
B
, (14)
where
k
·;·
k
H
is the distance of Hamming for binary
descriptors, and
k
·;·
k
B
is the Bhattacharyya distance
(Bhattacharyya, 1943) for histograms. Freak [p] cor-
responds to the FREAK descriptor, #FG[p] is the
number of foreground pixels, H
UV
[p] is the colour
histogram, and H
Y
[p] is the gray-scale histogram, of
patch p. D
rcol
(·, ·) accounts for the difference in
R
colour
, considering that histograms are more compa-
rable under similar conditions in terms of illumination
and contrast reliability.
It has been previously discussed that we need a
measure to account for the reliability of the tracked
patches in the scene, in order to determine the use-
fulness of the patch information on contributing to
a more robust object tracking. This reliability mea-
sure is R
global
, described in Equation 15, considering
a tracked patch buffer B
p
= {p
1
,.., p
N
}, where p
1
is
the current patch, and N is the buffer size, and the ob-
ject bounding box buffer B
I
= {I
1
,..,I
N
}, where I
j
is
the bounding box in buffer position j.
R
gl obal
(B
p
) =
(R
pos
(B
p
) + R
c
(B
p
) + R
g
(B
p
)
3
, (15)
with
R
pos
(B
p
) =
N1
i=1
(N i)
c[p
i
] c[p
i+1
];c[I
i
] c[I
i+1
]
M
N1
i=1
(N i)
, (16)
k
c1;c2
k
M
=
x[c1] x[c2]
+
y[c1] y[c2]
(17)
R
c
(B
p
) =
N
i=1
(N i + 1)C(x[p
i
],y[p
i
])
N
i=1
(N i + 1)
, (18)
C(x
p
,y
p
) =
x
p
+
L
2
x=x
p
L
2
y
p
+
L
2
y=y
p
L
2
G(x x
p
,y y
p
)FG(x,y)R
m
(x,y)
x
p
+
L
2
x=x
p
L
2
y
p
+
L
2
y=y
p
L
2
G(x x
p
,y y
p
)FG(x,y)
, (19)
R
g
(B
p
) = 1
N1
i=1
(N i)D
gl obal
(p
i
, p
i+1
)
N1
i=1
(N i)
. (20)
The three components of R
global
are calculated
weighting by the novelty of the information. R
p
os
is the position coherence reliability, which takes into
account the displacement coherence between the his-
tory of the patch (measured as the displacement vec-
tor of the patch centers c[p
i
]c[p
i+1
] ) and the history
of the central position of the object model bounding
box (c[I
i
]c[I
i+1
]|), using the Manhattan distance be-
tween displacement vectors at the different frames. R
c
accumulates the contrast reliability measure C(x , y),
which accumulates the values of the reliability map
R
m
, weighted by a Gaussian function G centred at
(x,y) and only accumulating foreground pixels (con-
sidering FG(x,y) as the foreground image, with value
1 for foreground pixels and 0 for background). R
g
accumulates the reliability on the similarity of the
patches in the buffer.
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
66
3.5 Adjustment of Object Model
Finally, if the current input bounding box is signifi-
cantly different in dimensions compared to the pre-
vious frame or several reliable patches present a low
contrast reliability for the current frame and a relevant
change on patch mean illumination from previous
frame (inferred from Y channel), the bounding box
is recalculated based on the information provided by
the remaining reliable patches. The displacement of
each bound of the bounding box (Left, Right, Bottom,
Top) is obtained from the weighted mean of patches
displacement from the previous frame, weighted by
the distance to the bound and the reliability of the
patches.
If no reliable patches are available, the bounding
box projected from the object dynamics model is con-
sidered as input. We utilise a dynamics model simi-
lar to Kalman Filter (Zuniga et al., 2011). If the cur-
rent input bounding box is similar in size to the pre-
vious frame, this bounding box is considered as the
object model for the current frame. Then, the dynam-
ics model is updated with the current object model.
4 EXPERIMENTAL VALIDATION
The visual coherence of the estimation has been first
tested in three short sequences of diverse contrast.
The results are shown in Figure 8.
For evaluating the approach, two videos of chang-
ing contrast situations have been tested. Both videos
have ground-truth segmentation, in order to obtain
the ideal track of the analysed objects. The first
video consists in a single football player sequence
(27 frames), where a player goes from a light to
a dark zone of the pitch. This video is a zoomed
short sequence extracted from the Alfheim Stadium
dataset
1
. The second video consists in a sequence (51
frames) where a rodent is exploring a confined space
with better illumination in the center. The sequence
is part of a set of sequences provided by the Inter-
disciplinary Center of Neuroscience of Valparaiso
2
.
This sequences are intended to study the behavior
of the degu, a rodent which commonly presents the
Alzheimer disease.
The experiment consists in performing object
tracking using the new dynamics model with and
without considering the proposed reliability measu-
1
Open dataset extracted from Alfheim Stadium, the home
arena for TromsøIL (Norway). Available from:
http://home.ifi.uio.no/paalh/dataset/alfheim/
2
Interdisciplinary Center of Neuroscience of Valparaiso,
Chile http://cinv.uv.cl/en/
res, and compare the obtained tracks with the ideal
tracks obtained from the ground-truth segmentation.
The results were summarised in Table 1.
Table 1: Results for evaluation sequences with respect to
ground-truth sequences. The column Imp.% is the percent
of improvement utilising the proposed approach.
Distance (pixels)
Sequence No Rel. Patch Rel. Imp.%
Football (T=15) 602.2 579.5 3.8 %
Football (T=20) 640.7 570.8 10.9 %
Rodent (T=10) 600.4 581.6 3.1 %
Rodent (T=15) 506.7 491.4 3.0 %
Rodent (T=20) 1086.8 1011.5 6.9 %
Rodent (T=25) 1071.1 1023.0 4.5 %
The results for the first experiment are exempli-
fied in Figure 9. Figures 9 (b) and (c) show the core
motivation of this work: the effect of considering dif-
ferent measures for tracked attributes allows a finer
control of the trade off between the estimated state
and the measurement in the update process. In the ex-
ample, the patch tracking algorithm was able to prop-
erly weight unreliable data to not affect considerably
the dynamics model, and the legs of the player were
not lost (Figure (c)).
For the second experiment, the challenge is to fol-
low a rodent of quick acceleration changes and not
homogeneous illumination conditions. Also, poor
segmentation occurs due to the sudden changes of
speed. The sequence was tested for different segmen-
tation thresholds (T {10,15,20, 25}). From these
results, we are able to state that a more robust tracking
can be achieved utilising the bound reliability mea-
sure, with an improvement higher than a 3% in preci-
sion. Examples of these results are depicted in Figure
10.
Video sequences of these results can be found in:
http://profesores.elo.utfsm.cl/ mzuniga/videos/
5 CONCLUSIONS
For addressing real world applications, computer vi-
sion techniques must properly handle noisy data.
In this direction, we have proposed a new tracking
schema considering local features and reliability mea-
sures which have shown promising results for im-
proving the dynamics updating process of the track-
ing phase. The reliability measures were utilised to
control the uncertainty in the obtained information,
through a direct interpretation of the criteria utilised
by the segmentation phase to determine the fore-
ground regions. In this sense, this approach can be
Robust Real-time Tracking Guided by Reliable Local Features
67
(a) (b) (c) (d)
Figure 8: Resulting tracking for three soccer player sequences with different levels of contrast. Figures (a), (b), and (c)
show the result for low, medium, and high contrast situations, respectively. Figure (d) is a control case for ground-truth
segmentation. The segmentation blob bounding boxes are colored red, the merged bounding box for the object hypothesis
colored yellow, and the estimated bounding box from the dynamics model colored cyan. The central object position trajectory
is depicted with blue squares.
(a) (b) (c)
Figure 9: Example of the effect on utilising the patch reliability on the tracking process (T = 20). Figure (a), from left to
right, shows the current, segmentation, and contrast map images, respectively. Figure (b) shows the tracking result without
considering the patch reliability measures (every reliability is set to 1). Figure (c) shows the result of using the patch reliability
measure. Note the difference in tracking bounding box, where the feet of the player are more properly incorporated to the
object. The boxes are colored the same way as previous images. The central object position trajectory is depicted with green
squares, the ground-truth positions in cyan squares, and the distance between them is represented with a yellow line.
(a) (b) (c)
Figure 10: Example of the effect on utilising the patch reliability on the tracking process (T = 25). Figure (a), from top to
bottom, shows the current, segmentation, and contrast map images, respectively. Figures (b) and (c) show the tracking result
not considering and considering the patch reliability measures, respectively.
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
68
applied to other segmentation algorithms to improve
the tracking phase in the same way.
In particular, the proposed global patch reliability
measure, considering a diverse range of features, has
shown one of the many possible ways of integrating
segmentation phase data to object modelling. In the
present work, no a priori knowledge has been consid-
ered about the objects to be tracked. The integration
of the data from the segmentation phase with more
complex object models can also improve the tracking
phase, by better determining the objects of interest for
a context or application. At the same time, these relia-
bility measures can help these object models to better
determine their parameters, subject to noisy measure-
ments.
The preliminary evaluation obtained promising re-
sults both in robust tracking and quick processing.
Nevertheless, extensive testing is required for fully
validating the approach.
This work can be extended in several ways: the
approach can be tested for different types of detectors
of interest points and local feature detectors. Also,
the algorithm can be tested for different background
subtraction approaches. Also, an extensive parameter
sensitivity evaluation is still needed. As local features
are utilised, this approach could be naturally extended
to deal with dynamic occlusion situations.
ACKNOWLEDGEMENTS
This research has been supported, in part, by Fonde-
cyt Project 11121383, Chile.
REFERENCES
Adam, A., Rivlin, E., and Shimshoni, I. (2006). Ro-
bust fragments-based tracking using the integral his-
togram. In Computer Vision and Pattern Recogni-
tion, 2006 IEEE Computer Society Conference on,
volume 1, pages 798–805.
Alahi, A., Ortiz, R., and Vandergheynst, P. (2012). Freak:
Fast retina keypoint. In Procedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR 2012), pages 510–517.
Bhattacharyya, A. (1943). On a measure of divergence be-
tween two statistical populations defined by probabil-
ity distributions. Bulletin of the Calcutta Mathemati-
cal Society, 35:99–110.
Breitenstein, M., Reichlin, F., Leibe, B., Koller-Meier,
E., and Van Gool, L. (2009). Robust tracking-by-
detection using a detector confidence particle filter. In
Computer Vision, 2009 IEEE 12th International Con-
ference on, pages 1515–1522.
Correia, P. L. and Pereira, F. (2003). Objective evaluation of
video segmentation quality. Image Processing, IEEE
Transactions on, 12(2):186–200.
Erdem, C¸ . E., Sankur, B., et al. (2004). Performance mea-
sures for video object segmentation and tracking. Im-
age Processing, IEEE Transactions on, 13(7):937–
951.
Kalal, Z., Matas, J., and Mikolajczyk, K. (2011). Track-
ing learning detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 34(7):1409–1422.
Lee, S. and Horio, K. (2013). Human tracking using parti-
cle filter with reliable appearance model. In SICE An-
nual Conference (SICE), 2013 Proceedings of, pages
1418–1424.
McIvor, A. (2000). Background subtraction techniques. In
Proceedings of the Conference on Image and Vision
Computing (IVCNZ 2000), pages 147–153, Hamilton,
New Zealand.
Rosten, E. and Drummond, T. (2006). Machine learning
for high-speed corner detection. In Proceedings of
the IEEE European Conference on Computer Vision
(ECCV’06), volume 1, pages 430–443.
Sun, L. and Liu, G. (2011). Visual object tracking based
on combination of local description and global repre-
sentation. Circuits and Systems for Video Technology,
IEEE Transactions on, 21(4):408–420.
Toyama, K., Krumm, J., Brumitt, B., and Meyers, B.
(1999). Wallflower: principles and practice of back-
ground maintenance. In Proceedings of the Interna-
tional Conference on Computer Vision (ICCV 1999),
pages 255–261. doi:10.1109/ICCV.1999.791228.
Troya-Galvis, A., Gancarski, P., Passat, N., and Berti-
Equille, L. (2015). Unsupervised quantification of
under- and over-segmentation for object-based remote
sensing image analysis. Selected Topics in Applied
Earth Observations and Remote Sensing, IEEE Jour-
nal of, 8(5):1936–1945.
Wang, L., Yan, H., yu Wu, H., and Pan, C. (2013).
Forward-backward mean-shift for visual tracking
with local-background-weighted histogram. Intelli-
gent Transportation Systems, IEEE Transactions on,
14(3):1480–1489.
Yang, F., Lu, H., and Yang, M. (2014). Robust superpixel
tracking. IEEE Transactions on Image Processing,
23(4):1639–1651.
Zuniga, M. D., Bremond, F., and Thonnat, M. (2011).
Real-time reliability measure driven multi-hypothesis
tracking using 2d and 3d features. EURASIP Jour-
nal on Advances in Signal Processing, 2011(1):142.
doi:10.1186/1687-6180-2011-142.
Robust Real-time Tracking Guided by Reliable Local Features
69