Robust Background Modeling and Foreground Detection
using Dynamic Textures
M. Sami Zitouni, Harish Bhaskar and Mohammed Al-Mualla
Department of Electrical and Computer Engineering, Khalifa University of Science,
Technology and Research, Abu Dhabi, U.A.E.
Keywords:
Background Modeling, Foreground Detection, Dynamic Texture, Gaussian Mixture Model.
Abstract:
In this paper, a dynamic background modeling and hence foreground detection technique using a Gaussian
Mixture Model (GMM) of spatio-temporal patches of dynamic texture (DT) is proposed. Existing methods
for background modeling cannot adequately distinguish movements in both background and foreground, that
usually characterizes any dynamic scene. Therefore, in most of these methods, the separation of the back-
ground from foreground requires precise tuning of parameters or an apriori model of the foreground. The
proposed method aims to differentiate between global from local motion by attributing the video using spatio-
temporal patches of DT modeled using a typical GMM framework. In addition to alleviating the aforemen-
tioned limitations, the proposed method can cope with complex dynamic scenes without the need for training
or parameter tuning. Qualitative and quantitative analysis of the method compared against competing base-
lines have demonstrated the superiority of the method and the robustness against dynamic variations in the
background.
1 INTRODUCTION
Background modeling and hence foreground detec-
tion are essentials steps in visual surveillance; partic-
ularly for moving object detection and target track-
ing. Conventional background modeling techniques
such as (Stauffer and Grimson, 1999; Stauffer and
Grimson, 2000; Bhaskar et al., 2007) assume lim-
ited changes in the background, making them unsuit-
able for capturing the dynamics in the environment
caused either due to background movements or mo-
tion of the sensor. In addition, background modeling
is complicated by the motion dynamics of moving tar-
gets, for example stoppages during motion, appear-
ance changes of targets, lighting variations, noise and
clutter, that are typical in real-world outdoor scenar-
ios.
1.1 Related Work
A considerable amount of effort has been devoted
for developing adaptive background modeling meth-
ods, exemplified in the work of (Stauffer and
Grimson, 1999) using GMM, and its extensions ex-
ploiting various properties such as global consis-
tency (Dalley et al., 2008), local image neighbor-
hoods (Heikkila and Pietikainen, 2006), and den-
sity clustering (Bhaskar et al., 2007). Additionally,
for handling dynamic background motion, (Zhong
et al., 2009) proposed a background subtraction tech-
nique based on GMM using a multi-resolution frame-
work, while (Zhang et al., 2009) proposed a spatial-
temporal nonparametric background subtraction ap-
proach. Furthermore, (Zhang et al., 2009) pro-
posed using an adaptive Local-Patch GMM as the dy-
namic background model with Support Vector Machi-
ine (SVM) classification for shadow removal applica-
tions.
Despite advances, several issues concerning dy-
namic background continue to remain as challenges
to the background modeling community. In recent
years, saliency detection in motion and appearance
of objects has contributed to significant progress in
handling such inadequacies in background model-
ing. For example, it has been shown in (Tian and
Hampapur, 2008) that accurate foreground detec-
tion can be facilitated by distinguishing salient mo-
tion from background motion. Similarly, the inte-
gration of background learning and object detection
through Detecting Contiguous Outliers in Low-rank
Representation(DECOLOR) in (Zhou et al., 2013)
has shown to accommodate global variations. Specif-
Zitouni, M., Bhaskar, H. and Al-Mualla, M.
Robust Background Modeling and Foreground Detection using Dynamic Textures.
DOI: 10.5220/0005724204030410
In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 403-410
ISBN: 978-989-758-175-5
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
403
Figure 1: A brief block diagram of the proposed spatio-temporal Gaussian Mixture Model of Dynamic Texture.
ically, background modeling under dynamic back-
ground variations has been addressed in the work of
DT in (Doretto et al., 2003), where motion is mod-
eled as a linear dynamical system. Several variations
of the DT model as in (Mumtaz et al., 2014; Chan
et al., 2011; Chan and Vasconcelos, 2009; Zhong
and Sclaroff, 2003) have gained recognition within
this context, and in particular the work of (Chan
et al., 2011) has shown how a generalized formula-
tion of the (Stauffer and Grimson, 1999) algorithm
can be accomplished using mixture components of
DT with an online learning algorithm. Further, the
use of DT along with Kalman filter has been pro-
posed in (Zhong and Sclaroff, 2003) for foreground
detection. A layered implementation in (Chan and
Vasconcelos, 2009) has been used to model a video
represented as stochastic layers using appearance and
dynamics, each modeled by a separate DT. In (Chan
and Vasconcelos, 2009), it has been demonstrated
that over-segmentation can occur in mixed dynamic
backgrounds, as each layer’s segment corresponds
to single motion. Similarly in the work of (Mum-
taz et al., 2014), DT is used to jointly model both
the foreground and the background. Such techniques
have been utilized for modeling a video in (Zhong
and Sclaroff, 2003) considering an image frame as
a whole, or as spatial patches extracted from the
video (Mumtaz et al., 2014; Chan et al., 2011). How-
ever, a majority of these techniques for foreground de-
tection require scene-specific parameterization in ad-
dition to apriori training before classification.
1.2 Novelty & Contributions
Modeling using mixtures of DT as in (Mumtaz et al.,
2014), forces the assumption that the motion en-
capsulated by the DT is an inherent representation
of the background model. However, this limits the
method to those dynamic scenes where the DT is a
strong representation of the background. For exam-
ple, in sequences, where the DT is a stronger cue
for foreground motion, detection fails. The nov-
elty of the proposed approach is the definition of a
GMM of DT that is capable of providing a generic
formulation for modeling dynamic motion either as
a background or as a foreground. In addition, the
proposed method is implemented as a classification
of spatio-temporal patches of DT using GMM in a
manner that it preserves spatio-temporal homogene-
ity, ensuring smoother boundaries avoiding under-or-
over-segmentation and providing computational ad-
vantages. Furthermore, the treatment of DT as a fea-
ture space allows reducing the effect of noise and illu-
mination without the need for training or tuning of pa-
rameters. Finally, the paper presents multi-resolution
analysis on the spatio-temporal patches, exploiting its
effect on accuracy and its relationship with the learn-
ing rate of the GMM scheme.
2 SPATIO-TEMPORAL GMM OF
DT
In this section, the spatio-temporal GMM of DT
method is proposed and formulated as a stochastic
model using probability density function (PDF) cor-
responding to the foreground and background. The
block diagram in Fig. 1 illustrates the process flow of
the foreground detection model proposed in this pa-
per.
According to Fig. 1, the detection process begins
by splitting video frames to spatio-temporal texture
patches using DT. Further, a decision on whether each
patch represents either the foreground or background
is formulated using a conventional GMM according
to (Stauffer and Grimson, 1999).
2.1 Spatio-temporal Representation
The first step in this approach is to reduce the dimen-
sionality of the video analysis into spatio-temporal
blocks. The process begins with the original non-
processed video, that is treated as a three dimensional
(3D) array of gray pixels I
x,y,t
consisting of two spa-
tial dimensions (x,y) and one temporal dimension
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
404
t. The spatio-temporal blocks are represented as N-
dimensional vectors b
X,Y,t
, where each block spans
(2T +1) frames and contains N
b
pixels in each spatial
direction per frame, hence producing N = (2T + 1) x
N
b
x N
b
. Here block vectors b
X,Y,t
can be formally
defined according to (Pokrajac and Latecki, 2003) as,
b
X,Y,t
= [I
x,y,t
]
i=N
b
X, j=N
b
Y,t=t+T
i=(N
b
1)(X1)+1, j=(N
b
1)(Y 1)+1,t=tT
(1)
The key advantage of such a representation is the flex-
ibility to exploit the square of linear block sizes of
the vectors to reduce dimension in such a manner that
maximum information can be preserved. Dimension-
ality reduction is typically practiced using the Princi-
pal Component Analysis (PCA). However, in this pa-
per, this reduction is imposed during the computation
of DT which facilitates estimating low-level appear-
ance and motion features locally and thus allowing to
study their impact on global background estimation.
2.2 Dynamic Textures
The model of DT can be written as a linear dynam-
ical system and generated for an image sequence or
in this case, for each spatio-temporal patch. Let the
array of gray pixels I
t
be decomposed into q number
of blocks b
i
t
, where 0 i q. The linear system con-
tains two stochastic processes, the dynamics as state
process evolve over time b
i
t
R
n
, and the correspond-
ing appearance d
i
t
R
m
as a function of current state
process and observation noise. The system is defined
by:
b
i
t
= Ab
i
t1
+ ν
t
d
i
t
= Cb
i
t
+ ω
t
(2)
where A R
n×n
and C R
m×n
are the state transition
matrix and the observation matrix respectively. The
state is modeled as Gaussian process ν
t
N(0,Q)
as well as the observation noise ω
t
N(0, R). Ac-
cording to (Doretto et al., 2003), the system pa-
rameters are calculated with least squares algorithm.
Given a spatio-temporal patch, for example, D
i
1:τ
=
[d
i
1
,.. ., d
i
τ
], the estimated temporal mean of the patch:
¯
d
i
=
1
τ
τ
t=1
d
i
t
(3)
which is used to get the mean subtracted sequence:
˜
D
i
1:τ
= D
i
1:τ
¯
D
i
= [
˜
d
i
1
,.. .,
˜
d
i
τ
] (4)
where
¯
D
i
is a matrix with τ replications of mean
¯
d
i
.
For the parameters estimation, singular value decom-
position (SVD) is performed on the mean subtracted
sequence
˜
D
i
1:τ
= U
i
S
i
V
i
0
(5)
The n principal components of the largest eigenval-
ues of V are used to estimate the observation matrix,
assuming that diagonal entries of S are ordered in de-
creasing value. Then
ˆ
C
i
= [u
1
,.. ., u
n
] and the state
space is estimated as:
ˆ
B
i
1:τ
=
ˆ
C
i
0
˜
D
i
1:τ
= [
ˆ
b
i
1
,.. .,
ˆ
b
i
τ
] (6)
The initial state of the block b
i
t
is assumed to be b
i
1
.
Then, the least square estimation of the transition ma-
trix A is calculated with:
ˆ
A
i
=
ˆ
B
i
2:τ
(
ˆ
B
i
1:τ1
)
+
(7)
given that the Moore-Penrose pseudoinverse of B is
B
+
= B
0
(BB
0
)
1
. The state space prediction error is
used to estimate the state noise:
ˆ
V
i
1:τ1
=
ˆ
B
i
2:τ
ˆ
A
i
ˆ
B
i
2:τ1
(8)
ˆ
Q
i
=
1
τ 1
ˆ
V
i
1:τ1
(
ˆ
V
i
1:τ1
)
0
(9)
As well, the reconstruction error is used to estimate
the observation noise:
ˆ
W
i
1:τ
= D
i
1:τ
ˆ
C
i
ˆ
B
i
1:τ
(10)
ˆ
R
i
=
1
τ
ˆ
W
i
1:τ
(
ˆ
W
i
1:τ
)
0
(11)
This suboptimal approach for LDS parameter estima-
tion of DT is done q times for the concatenation of
each spatial patch.
2.3 Gaussian Mixture Model
Framework
It has been shown that an input frame I
t
at time in-
stant t of a given video sequence is decomposed into q
number of patches d
i
t
, where 0 i q, using a DT al-
gorithm aforementioned. Henceforth, the decision of
whether each texture patch d
i
t
represents a foreground
(FG) or a background (BG) can be formulated as the
ratio of pdfs as below,
p(BG|d
i
t
)
p(FG|d
i
t
)
=
p(d
i
t
|BG)p(BG)
p(d
i
t
|FG)p(FG)
(12)
where d
i
t
= {d
i
1,t
,.. ., d
i
N
b
,t
} characterizes a DT patch
d
i
consisting of N
b
number of pixels such that the im-
age frame at time t is represented as I
t
=
S
q
i=1
d
i
t
=
{d
1
t
,.. ., d
q
t
}. While p(BG|d
i
t
) represents the pdf of
the background modelled using the DT on patch d
i
t
,
p(FG|d
i
t
) is the pdf of the foreground representing
the same DT patch d
i
t
. Here, p(d
i
t
|BG) denotes the
Robust Background Modeling and Foreground Detection using Dynamic Textures
405
background model whereas p(d
i
t
|FG) is the appear-
ance model of the foreground object. The decision of
whether any of the DT patches d
i
t
represents the back-
ground is according to:
p(d
i
t
|BG) >
p(d
i
t
|FG)p(FG)
p(BG)
(13)
The background and foreground in the input video are
modeled with the GMM framework, where the DT of
the patches are used as a feature for the GMM model
with K Gaussian distributions. The probability of a
certain DT patch d
i
t
at time t is represented as:
p(d
i
t
) =
K
k
=
1
w
k
N (d
i
t
;µ
k
,Σ
k
) (14)
where w
k
is the weight of the k
th
Gaussian compo-
nent, and N (d
i
t
;µ
k
,Σ
k
) is the Normal distribution of
the k
th
component given as:
N (d
i
t
;µ
k
,Σ
k
) =
1
|2πΣ
k
|
1/2
e
1/2(d
i
t
µ
k
)
T
Σ
1
k
(d
i
t
µ
k
)
(15)
where µ
k
is the mean and Σ
k
is the co-variance of the
k
th
component. The distributions are ordered accord-
ing to the value of w
k
/Σ
k
, and the first Bg distribu-
tions are used to initialize the background model of
the scene and it is estimated by:
Bg = argmin
b
(
b
l=1
w
l
> T ) (16)
The decision threshold T is the minimum prior prob-
ability of the background, and in this method its value
is obtained using Otsu’s method on the mean of a gray
version of the input video frames, adding the texture
features that contributes in distinguishing the back-
ground. The Gaussian components initializing the
background model, are updated through an adaptive
learning procedure using:
ˆw
t+1
k
= (1 α) ˆw
t
k
+ α ˆp(ω
k
|d
i
t+1
) (17)
ˆµ
t+1
k
= (1 ρ)ˆµ
t
k
+ ρd
i
t+1
(18)
ˆ
Σ
t+1
k
= (1 ρ)
ˆ
Σ
t
k
+ ρ(d
i
t+1
ˆµ
t+1
k
)(d
i
t+1
ˆµ
t+1
k
)
0
(19)
ρ = αN (d
i
t+1
;µ
t
k
,Σ
t
k
) (20)
where ω
k
is the k
th
Gaussian component, and
ˆp(ω
k
|d
i
t+1
) is either 1 when ω
k
is the first match and 0
otherwise. The value of α determines the learning rate
corresponding to the changes in the texture patches.
3 EXPERIMENTS
In this section, experiments conducted to validate the
performance of the proposed model and benchmark
it against competing baseline methods, are described.
The chosen dataset (FBDynSyn from (Mumtaz et al.,
2014)) includes video sequences that encapsulate all
the challenges of dynamic background variations, and
in addition an even more challenging video sequence
dubbed as Sailing2 from (Chan et al., 2011) demon-
strating complex motion of a boat on water has also
been used for validation. The FBDynSyn dataset con-
sists of 7 video sequences with dynamic backgrounds
such as water, fountains, and trees and moving targets
of interest such as people and boats in the foreground.
The video sequence contained (210-601) number of
frames at a resolutions in the range of (120x190) -
(300x600).
All sequences are annotated with ground truth,
which is further used for the qualitative and quanti-
tative evaluations using a range of metrics including
True Positive Rate (TPR), False Positive Rate (FPR),
Accuracy (ACC), Variation of information (VI) and
the Rand Index (RI). All experiments were conducted
using MATLAB on an Intel i5 2.6 GHz processor ma-
chine with 8 GB RAM. During evaluation, each video
frame has been segmented into patches of size 10 x 10
x 3 and extracted the dynamic texture components for
each patch. The learning rate of the GMM model for
all video sequences was fixed at 0.0001. The (Otsu,
1979) method has been incorporated to automatically
estimate the decision threshold within the proposed
detection framework. Finally, the proposed method is
compared against state-of-the-art methods of (Mum-
taz et al., 2014), (Zhou et al., 2013) and (Chan and
Vasconcelos, 2009) both quantitatively and qualita-
tively as described in (Mumtaz et al., 2014) using
the same dataset.
In Fig. 2. example frames from each video
sequence comparing the proposed method to the
ground truth and the results of the state-of-the-art
methods for qualitative evaluation through visual
inspection are presented. It is clear that the results of
detection using the proposed method is more accurate
and precise than other methods and moreover, closely
matches with its ground-truth counterpart. Table. 1
summarizes the performance comparison of the
proposed method against baseline methods (Mumtaz
et al., 2014) and (Zhou et al., 2013) using the TPR
and FPR metrics estimated at the operating point
(OP) using the receiver operating characteristics
(ROC) curves. The TPR achieved by the proposed
method in all videos is higher than both (Mumtaz
et al., 2014) and (Zhou et al., 2013) methods with
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
406
B1P1 B1P2 B2 F1P2 F2P2 P2T1 S2
Frame
Truth
Proposed
(a)
(b)
(c)
Figure 2: Results of background segmentation of selected a frame from different sequences (across columns) using proposed
method (row 3), compared to the ground-truth (row 2) and other baseline methods including (a) (Mumtaz et al., 2014), (b)
(Zhou et al., 2013) and (c) (Stauffer and Grimson, 1999).
Table 1: Quantitative Evaluation of proposed method against state of art algorithms (a) (Mumtaz et al., 2014), (b) (Zhou et al.,
2013) at their operating point.
Data
TPR FPR
Proposed (a) at OP (b) Proposed (a) (a) at OP
(b)
B1P1 0.997 0.973 0.967 0.003 0.004 0.019
0.007
B1P2 0.989 0.919 0.977 0.010 0.009 0.022
0.018
Boat2 0.996 0.955 0.931 0.004 0.004 0.022
0.008
F1P2 0.988 0.972 0.791 0.011 0.034 0.055
0.007
F2P2 0.995 0.892 0.946 0.005 0.064 0.038
0.086
P2T1 0.992 0.953 0.967 0.008 0.030 0.056
0.017
S2 0.976 0.968 0.947 0.023 0.016 0.040
0.164
Average 0.990 0.947 0.932 0.009 0.023 0.036
0.044
a comparative average of 0.990 against 0.947
and 0.932. As well, the measured FPR of the pro-
posed method is lower than that for (Mumtaz et al.,
2014) at OP and (Zhou et al., 2013) with an average
of 0.009 as against 0.036 and 0.044. However,
(Mumtaz et al., 2014) at lower TPR has a similar
FPR for Boat1Person1, Boat1Person2, and Boats2
videos, and a lower one at Sailing2 (0.023 vs 0.016).
Further, the proposed method is compared to both
(Mumtaz et al., 2014) and (Zhou et al., 2013) using
the StopPerson1 video sequence. The scenario in this
sequence has the target-of-interest (person) stopping
for a short duration of time, making it extremely
challenging for conventional background modeling
techniques to cope with.
In Table 2, the comparison of the proposed
method to these baseline method on the StopPerson1
sequence is listed. The results on this sequence indi-
cates the superiority of the proposed technique both
in terms of the higher TPR and lower FPR against
both (Mumtaz et al., 2014) and (Zhou et al., 2013).
The qualitative comparison of results between the
Robust Background Modeling and Foreground Detection using Dynamic Textures
407
Table 2: Evaluation in stop case video StopPerson1 against (Mumtaz et al., 2014) and (Zhou et al., 2013) methods at their
operating points.
Proposed (Mumtaz et al., 2014) (Zhou et al., 2013)
TPR 0.996 0.945 0.642
FPR 0.004 0.026 0.003
Example
Table 3: Evaluation of proposed method against state of art (a) (Mumtaz et al., 2014), (b) (Zhou et al., 2013) using Rand Index
(RI).
B1P1 B1P2 B2 F1P2 F2P2 P2T1 SP1 Avg. S2
Proposed 0.9681 0.9465 0.9589 0.9535 0.9707 0.9825 0.9612 0.9631 0.9314
(a) 0.9632 0.9428 0.9610 0.9156 0.9388 0.9270 0.9482 0.9424
(b) 0.9524 0.7021 0.7986 0.7769 0.3833 0.8646 0.8668 0.7635
compared techniques supports the claim of the quan-
titative results. The superiority in performance can
be mainly attributed to the more robust background
model built by the proposed algorithm, while in case
of (Mumtaz et al., 2014) leads to over-segmentation,
as against (Zhou et al., 2013) that under-segments.
For further evaluation of the motion segmentation, RI
is calculated comparing the proposed method against
(Mumtaz et al., 2014) and (Chan and Vasconcelos,
2009) methods in Table. 3. In all videos, the proposed
method outperforms LDT at an average of (0.9631
vs 0.7635). In the cases of Boat1Person1(B1P1)
and Boat1Person2(B1P2), the proposed method
achieved similar or slightly higher RI, and slightly
lower in Boat2(B2) compared to (Mumtaz et al.,
2014) as some parts of the target are lost due to
smoothing throughout the modeling process. In
the other cases as in Fountain1Person2(F1P2),
Fountain2Person2(F2P2) and StopPerson1 (SP1) se-
quences, where the background is more complex, the
proposed method outperforms (Mumtaz et al., 2014).
On average, the performance of the proposed method
is higher than (Mumtaz et al., 2014) with a rand
index of 0.9631 vs 0.9424. Moreover, in Table. 4,
additional quantitative results using precision, ACC,
and VI indices are presented. The average precision,
ACC, and VI indices achieved are 0.8446, 0.9774
and 0.2107 respectively.
The results described in this section so far, consid-
ers DT constituting a majority of the background re-
gion as in FBDynSyn dataset. However, without loss
of generality, it is equally possible that DT can con-
veniently represent foreground regions as in the case
of crowd motion. In Fig. 3, a plot showing the var-
Table 4: Quantitative results of proposed technique us-
ing precision, accuracy(ACC) and variation of information
(VI).
Data Precision ACC VI
Boat1Person1 0.8823 0.9837 0.1628
Boat1Person2 0.8982 0.9695 0.2659
Boat2 0.8311 0.9789 0.2046
Fountain1Person2 0.8187 0.9724 0.2464
Fountain2Person2 0.8977 0.9818 0.1673
Person2Tree1 0.7966 0.9867 0.1070
StopPerson1 0.9495 0.9849 0.1828
Sailing2 0.6850 0.9614 0.3488
Average 0.8446 0.9774 0.2107
iations of the DT values of a chosen block from two
sequences, one where DT represents the background
(Boat1Person1 sequence) and the other where DT
represents the foregroun (S1 L1 13-57 PETS crowd
sequence) from (Ferryman and Shahrokni, 2009), is
illustrated. It can be observed that the variations in
DT for the Boat1Person1 sequence represented using
red dashed lines, matches the characteristics of the
DT changes in the crowd sequence represented as
yellow solid line.
Finally, further validation has been done by
testing the proposed method on additional dynamic
background sequences from changedetection dataset
of (Goyette et al., 2012), that also includes the boats,
fall, fountain01 and fountain02 sequences. In order
to prove beyond doubt that the proposed model can
handle dynamic motion characteristics of either the
background or foreground, the framework was also
tested on different scenarios from the PETS dataset.
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
408
boats fall fountain01 fountain02
S1 L1 13-57 S1 L1 13-59 S1 L2 14-06 S3 MF 14-37
Figure 4: Segmentation examples using the proposed method on changedetection dataset and PETS dataset.
Frame
0 10 20 30 40 50 60 70 80 90 100
DT in uint8
0
20
40
60
80
100
120
140
FG (Boat1Person1)
BG (Boat1Person1)
FG (PETS)
BG (PETS)
Figure 3: Variations of DT values of chosen blocks from the
background (BG) and foreground (FG) of B1P1 sequence
and S1 L1 13-57 PETS crowd sequence.
Fig. 4 displays qualitative results on changedetection
and PETS datasets.
3.1 Patch Resolution Analysis
One key parameter of the proposed method is the
size of the block patches. During empirical study,
some close relationship between the patch size and
the learning rate used within the GMM model could
be observed. In order to formalize this relationship,
the results of changing patch size against different
learning rates on the recall of the detection process
are presented in Fig. 5. The results in Fig.5 has been
generated using the Boat1Person1 video sequence, as
an example, from FBDynSyn dataset. It can be ob-
served that at low learning rates of the GMM, smaller
patch sizes produce better recall. Nevertheless, as
higher learning rates are used, this is no longer the
case.A qualitative assessment of the impact of learn-
ing rate and the patch size using sample frames from
the Boat1Person1 sequence is illustrated in Fig.6.
0 0.002 0.004 0.006 0.008 0.01
0.35
0.4
0.45
0.5
0.55
0.6
GMM Learning Rate vs Recall for different patch sizes
GMM Learning Rate
Recall
5x5
10x10
20x20
Figure 5: Relation between learning rate and recall for var-
ious patch sizes using Person1Boat1 video (the peaks indi-
cated with points).
It can be noticed that with small patches at high
learning rates, the amount of false positive detec-
tion increases. However, at lower learning rate at
Robust Background Modeling and Foreground Detection using Dynamic Textures
409
small patch sizes, the detection is more accurate.
For Boat1Person1 sequence, the processing time was
found to be 655 ms per frame at patch size 10x10,
while it was 2251 ms per frame at patch size 5x5. De-
spite better performance at patch size 5x5, the com-
putational overhead of performing background mod-
eling at that patch size is significantly higher than at
10x10 patch size for a small compromise in accuracy.
5x5 10x10 20x20
0.0001
0.005
0.01
Figure 6: Segmentation example of Boat1Person1 video
with different patch sizes and GMM learning rates.
4 CONCLUSIONS
The proposed method of spatio-temporal GMM of DT
is an accurate and robust mechanism for background
modeling and foreground detection. Evaluative per-
formance follows the hypothesis underpinning the
theoretical model. The relationships infused between
key parameters of patch size and learning rate indi-
cate that when the patch size is decreased, the number
of patches and hence the number of DT components
increases, thus yielding higher accuracy of detection
even at fixed learning rate. However, on the contrary,
with decrease in the patch size, the amount of motion
information that is encapsulated within each patch is
reduced, thereby causing slower recognition for the
motion patterns at that chosen learning rate.
REFERENCES
Bhaskar, H., Mihaylova, L., and Maskell, S. (2007). Back-
ground modeling using adaptive cluster density esti-
mation for automatic human detection. Informatics 2,
pages 130–134.
Chan, A., Mahadevan, V., and Vasconcelos, N. (2011). Gen-
eralized stauffer–grimson background subtraction for
dynamic scenes. Machine Vision and Applications,
22(5):751–766.
Chan, A. and Vasconcelos, N. (2009). Layered dynamic
textures. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 31(10):1862–1879.
Dalley, G., Migdal, J., and Grimson, W. (2008). Back-
ground subtraction for temporally irregular dynamic
textures. WACV.
Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S. (2003).
Dynamic textures. International Journal of Computer
Vision, 51:91–109.
Ferryman, J. and Shahrokni, A. (2009). Pets2009: Dataset
and challenge. Twelfth IEEE International Workshop
on Performance Evaluation of Tracking and Surveil-
lance.
Goyette, N., Jodoin, P., Porikli, F., Konrad, J., and Ishwar, P.
(2012). Changedetection.net: A new change detection
benchmark dataset. In IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition
Workshops (CVPRW), pages 1–8.
Heikkila, M. and Pietikainen, M. (2006). A texture-based
method for modeling the background and detecting
moving objects. IEEE TPAMI, 28(4):657–662.
Mumtaz, A., Zhang, W., and Chan, A. B. (2014). Joint mo-
tion segmentation and background estimation in dy-
namic scenes. CVPR, pages 368–375.
Otsu, N. (1979). A threshold selection method from gray-
level histograms. IEEE TSMC, 9(1):62–66.
Pokrajac, D. and Latecki, L. J. (2003). Spatiotemporal
blocks-based moving objects identification and track-
ing. IEEE Visual Surveillance and Performance Eval-
uation of Tracking and Surveillance (VS-PETS), pages
70–77.
Stauffer, C. and Grimson, E. (2000). Learning patterns
of activity using realtime tracking. IEEE TPAMI,
22(8):747–757.
Stauffer, C. and Grimson, W. (1999). Adaptive background
mixture models for real-time tracking. CVPR, 2.
Tian, Y.-L. and Hampapur, A. (2008). Robust salient mo-
tion detection with complex background for real-time
video surveillance. WACV/MOTIONS.
Zhang, S., Yao, H., and Liu, S. (2009). Spatial-temporal
nonparametric background subtraction in dynamic
scenes. ICME, pages 518–521.
Zhong, B., Liu, S., Yao, H., and Zhang, B. (2009).
Multl-resolution background subtraction for dynamic
scenes. pages 3193–3196.
Zhong, J. and Sclaroff, S. (2003). Segmenting foreground
objects from a dynamic textured background via a ro-
bust kalman filter. ICCV, 1:44–50.
Zhou, X., Yang, C., and Yu, W. (2013). Moving object
detection by detecting contiguous outliers in the low-
rank representation. IEEE TPAMI, 35(3).
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
410