Region Extraction of Multiple Moving Objects
with Image and Depth Sequence
Katsuya Sugawara
1
, Ryosuke Tsuruga
2
, Toru Abe
2
and Takuo Suganuma
2
1
NTT Resonant Inc., Granparktower, 3-4-1 Shibaura, Minato-ku, Tokyo 108-0023, Japan
2
Cyberscience Center, Tohoku University, 2-1-1 Katahira, Aoba-ku, Sendai 980–8577, Japan
Keywords:
Region Extraction, Multiple Moving Objects, Image Sequence, Depth Sequence.
Abstract:
This paper proposes a novel method for extracting the regions of multiple moving objects with an image and
a depth sequence. In addition to image features, diverse types of features, such as depth and image-depth-
derived 3D motion, have been used in existing methods for improving the accuracy and robustness of object
region extraction. Most of the existing methods determine individual object regions according to the spatial-
temporal similarities of such features, i.e., they regard a spatial-temporal area of uniform features as a region
sequence corresponding to the same object. Consequently, the depth features in a moving object region, where
the depth varies with frames, and the motion features in a nonrigid or articulated object region, where the
motion varies with parts, cannot be effectively used for object region extraction. To deal with these difficulties,
our proposed method extracts the region sequences of individual moving objects according to depth feature
similarity adjusted by each object movement and motion feature similarity computed only in adjacent parts.
Through the experiments on scenes where a person moves a box, we demonstrate the effectiveness of the
proposed method in extracting the regions of multiple moving objects.
1 INTRODUCTION
Determining the regions of individual moving objects
is a requisite preprocessing step for various applica-
tions including visual surveillance, control (e.g. man-
aging something by object movements), and analysis
(e.g. diagnosing something from object movements).
Thus, many object region extraction methods have
been proposed for use in such applications. Most of
the existing methods determine individual object re-
gions in an image sequence according to the spatial
and temporal similarity of image features, i.e., these
methods regard a spatial-temporal area of uniform im-
age features as a region sequence corresponding to the
same object (Grundmann et al., 2010; Galasso et al.,
2012; Xu and Corso, 2012).
Recently, along with the popularization of image
and range sensing cameras such as Kinect (Microsoft,
2015), not only image sequences but also depth se-
quences can be easily acquired. Consequently, be-
sides image features, depth features have been used in
several methods for improving the accuracy and ro-
bustness of object region extraction (Fern
´
andez and
Aranda, 2000; C¸ i
˘
gla and Alatan, 2008; Xia et al.,
2011; Abramov et al., 2012; Bergamasco et al., 2012).
As is the case in image features, the spatial-temporal
similarity of depth features is usually used for deter-
mining individual object regions. However, as shown
in Figure 1, the depth in a moving object region varies
with frames; therefore, depth features cannot be ef-
fectively used in such cases for determining a region
sequence corresponding to the same object.
t = T (frame)
3.0m
1.5m
0m
.
.
.
.
.
.
t = T + 15
3.0m
1.5m
0m
(a) (b)
Figure 1: Example of (a) Image sequence and (b) Depth
sequence acquired by Kinect.
Sugawara, K., Tsuruga, R., Abe, T. and Suganuma, T.
Region Extraction of Multiple Moving Objects with Image and Depth Sequence.
DOI: 10.5220/0005782402550262
In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 255-262
ISBN: 978-989-758-175-5
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
255
Meanwhile, 2D motion (optical flow) features are
derived from an image sequence, and 3D motion
(scene flow) features are derived from an image and a
depth sequence as shown in Figure 2. The similarities
of such motion features also can be used as clues for
extracting object regions. Unlike the 2D motion fea-
tures, the 3D motion features are less subject to the
difference in the distances between a camera and ob-
jects; therefore the similarity of 3D motion features
is more suitable than that of 2D motion features for
determining a region sequence corresponding to the
same object. However, in a nonrigid or articulated
object region, motion varies with part, and then it de-
creases the effectiveness of both 2D and 3D motion
features for object region extraction.
To deal with these difficulties, this paper proposes
a novel method for extracting the region sequences of
multiple moving objects with an image and a depth
sequence. The proposed method employs the simi-
larities of image, depth, and image-depth-derived 3D
motion features as clues for extracting object regions.
In this method, depth feature similarity is adjusted by
each object movement for adapting to the change of
depth features in moving object regions, and 3D mo-
tion feature similarity is computed only in adjacent
parts for adapting to the nonuniformity of motion fea-
tures in nonrigid or articulated object regions.
The remainder of this paper is organized as fol-
lows. Section 2 presents the existing object region
extraction methods based on several types of features.
t = T (frame)
t = T + 1
(a)
(b)
(c)
0
1
2
3
4
5
6
0
1
2
3
4
0
1
2
3
4
5
height [m]
width [m]
depth [m]
height [m]
t = T
Figure 2: Example of (a) Image sequence, (b) Depth se-
quence, and (c) 3D motion features derived from them.
Section 3 explains the details of our proposed method
for extracting the region sequences of multiple mov-
ing objects with an image and a depth sequence. Sec-
tion 4 presents the results of object region extraction
experiments. Finally, Section 5 concludes the paper.
2 RELATED WORK
Many methods have been proposed for extracting ob-
ject regions from an image. Most of them regard
a uniform area in the image as a subregion corre-
sponding to the same object, and determine individ-
ual object regions according to the spatial similar-
ity of image features, such as intensities and colors
in the image (Comaniciu and Meer, 1999; Felzen-
szwalb and Huttenlocher, 2004; Achanta et al., 2012).
To improve the accuracy and robustness of object re-
gion extraction, diverse types of features are used to-
gether with the image features. For example, 2D mo-
tion features (C¸ i
˘
gla and Alatan, 2008), which can be
computed from an image sequence, and depth fea-
tures (Fern
´
andez and Aranda, 2000; C¸ i
˘
gla and Ala-
tan, 2008; Bergamasco et al., 2012), which can be
acquired by a range sensing camera, are widely used.
Subregions extracted in each individual image
frame are not corresponded with those in the other
frames. Thus, various methods have been proposed
for extracting a subregion sequence corresponding
to the same object in a frame sequence. Their ap-
proaches are classified into two main types. One ap-
proach performs subregion extractions in each frame,
and then carries out subregion matching between suc-
cessive frames based on the temporal similarity of
features (Xia et al., 2011; Abramov et al., 2012; Cou-
prie et al., 2013). The other approach regards the
frame sequence as 3D data, and extracts a 3D region
(volume) corresponding with the same object accord-
ing to the spatial-temporal similarity of features (De-
Menthon and Megret, 2002; Grundmann et al., 2010).
In both the approaches, to improve the accuracy and
the robustness, 2D motion features (DeMenthon and
Megret, 2002; Grundmann et al., 2010), depth fea-
tures (Xia et al., 2011; Abramov et al., 2012), and
3D motion features (Xia et al., 2011) are also used in
combination with image features.
The entire region of an object is commonly com-
posed of different subregions with dissimilar proper-
ties. For such a object region, a number of different
subregion sequences are constructed. Accordingly,
several methods have been proposed for merging sub-
region sequences and extracting the sequence of an
entire object region (Lezama et al., 2011; Trichet and
Nevatia, 2013). Those methods use mainly motion
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
256
features for subregion sequence merging.
As mentioned above, in the existing methods,
spatial-temporal similarities are computed from di-
verse types of features, and employed for extracting
object region sequences. However, some types of fea-
tures cannot contribute to the object region extraction
in particular conditions. For example, the depth in a
moving object region varies with frames; therefore,
even if a series of subregions corresponds to the same
moving object, its depth features are not always tem-
porally similar under such a condition. Furthermore,
the motion in a nonrigid or articulated object region
varies with parts; therefore, although in the same ob-
ject region, the spatial similarity of motion features is
not always kept under such a condition.
3 REGION EXTRACTION OF
MULTIPLE MOVING OBJECTS
We proposes a novel method for extracting the region
sequences of multiple moving objects with an image
and a depth sequence. A brief overview of the pro-
posed method is depicted in Figure 3.
Our proposed method firstly extracts subregions
from each frame, secondly constructs subregion se-
quences through subregion matching between succes-
Input
(image and depth sequence)
t (frame)
Output
(object region sequences)
t
t
Subregion
extracting
t
Subregion
matching
t
Sequence
merging
Figure 3: Overview of the proposed method.
sive frames, and finally merges subregion sequences
into the region sequences of individual moving ob-
jects. To effectively make use of depth features
and motion features in these processes, the proposed
method employs depth feature dissimilarity adjusted
by each object movement and motion feature dissim-
ilarity computed only in adjacent parts.
3.1 Subregion Extracting
To extract subregions from each frame, our pro-
posed method uses a graph-based image segmentation
method (Felzenszwalb and Huttenlocher, 2004) with
image and depth features. Example of subregion ex-
traction from an image and a depth frame is shown
in Figure 4. In this figure, (c) shows extracted subre-
gions using (a) and (b), where each extracted subre-
gion is assigned a random color.
In graph-based image segmentation methods, an
entire image is represented by a graph G = (V, E)
with vertices v
k
V corresponding to pixels and edges
(v
k
, v
l
) E corresponding to pairs of neighboring ver-
tices. Each edge (v
k
, v
l
) has a weight w
S
(v
k
, v
l
) based
on some property of neighboring vertices v
k
and v
l
.
A segmentation S is a partition of V into components
(subregions) C
i
S such that each C
i
corresponds to a
connected component in a graph G
0
= (V, E
0
), where
E
0
E.
The method of (Felzenszwalb and Huttenlocher,
2004) uses the dissimilarity of features between
neighboring vertices v
k
and v
l
as their edge weight
w
S
(v
k
, v
l
). From such edge weights, the internal dif-
ference Int(C
i
) of a component C
i
V is determined
to be the largest edge weight in the minimum span-
ning tree MST (C
i
, E) of C
i
as
Int(C
i
) = max
(v
k
,v
l
)MST (C
i
,E)
w
S
(v
k
, v
l
), (1)
and the difference Di f (C
i
, C
j
) between two compo-
nents C
i
, C
j
V is determined to be the minimum
edge weight connecting C
i
and C
j
as
Di f (C
i
, C
j
) = min
v
k
C
i
,v
l
C
j
,(v
k
,v
l
)E
w
S
(v
k
, v
l
). (2)
If Di f (C
i
, C
j
) is small compared with Int(C
i
) and
Int(C
j
), then those C
i
and C
j
are merged into one
(a) (b) (c)
Figure 4: Example of (a) Image frame, (b) Depth frame,
and (c) Extracted subregions using (a) and (b).
Region Extraction of Multiple Moving Objects with Image and Depth Sequence
257
component. Through the iteration of this procedure,
an image is segmented into final components, each of
which is an area of spatially uniform features, which
are extracted from a frame.
Our proposed method defines each edge weight
w
S
(v
k
, v
l
) between neighboring pixels v
k
and v
l
from
the dissimilarities of their image features I (RGB in-
tensities) and depth features D by
w
S
(v
k
, v
l
) = kI(v
k
) I(v
l
)k + σ
S
kD(v
k
) D(v
l
)k,
(3)
where σ
S
is a coefficient for the dissimilarity of depth
features.
3.2 Subregion Matching
For constructing subregion sequences, the proposed
method carries out graph-based matching (Couprie
et al., 2013) on extracted subregions in successive
frames. Example of subregion matching is shown in
Figure 5, where random colors are assigned to each
extracted subregion in (a) and each constructed se-
quence (moving object sequence only) in (b).
In the method of (Couprie et al., 2013), two suc-
cessive frames t and t + 1 are represented by a graph
G = (V, E). A set V of vertices comprises two ver-
tex sets V (t) and V (t + 1); vertices v
i
V (t) corre-
spond to subregions extracted in the frame t, and ver-
tices v
j
V (t + 1) correspond to those in the frame
t + 1. Edges (v
i
, v
j
) E correspond to pairs of ver-
tices in different frames, and each (v
i
, v
j
) has a weight
w
T
(v
i
, v
j
) associated with the dissimilarity of features
between v
i
and v
j
. Using a minimum spanning for-
est algorithm based on these edge weights w
T
(v
i
, v
j
),
the correspondences between vertices v
i
V (t) and
v
j
V (t + 1) are determined. Through successively
applying this procedure to all frames, subregion se-
quences, each of which is a series of subregions with
temporally uniform features, are constructed.
The method of (Couprie et al., 2013) defines each
edge weight w
T
(v
i
, v
j
) from the differences in shape,
position (centroid), and appearance (image) features
between subregions v
i
and v
j
. In addition to those
t = T (frame)
t = T + 15
t = T + 30
(a)
(b)
Figure 5: Example of (a) Subregions extracted in each
frame, and (b) Subregion sequences constructed from (a).
differences, for improving the accuracy and robust-
ness of subregion matching, our proposed method
also takes into account the differences in depth fea-
tures, and determines each edge weight w
T
(v
i
, v
j
) by
w
T
(v
i
, v
j
) = d
s
(v
i
, v
j
)d
p
(v
i
, v
j
)
+ σ
Ta
d
a
(v
i
, v
j
) + σ
T d
d
d
(v
i
, v
j
), (4)
where d
s
, d
p
, d
a
, and d
d
represent the differences in
shape, position, appearance, and depth features, re-
spectively, while σ
Ta
and σ
T d
denote coefficients for
d
a
, and d
d
. The feature differences in Equation (4) are
defined as
d
s
(v
i
, v
j
) =
|v
i
| + |v
j
|
|v
i
v
j
|
, (5)
d
p
(v
i
, v
j
) = k(C(v
i
) + M2(v
i
)) C(v
j
)k, (6)
d
a
(v
i
, v
j
) = kI(v
i
) I(v
j
)k, (7)
d
d
(v
i
, v
j
) = k(D(v
i
) + M3
d
(v
i
)) D(v
j
)k. (8)
In Equation (5), |v
i
|, |v
j
|, and |v
i
v
j
| represent the
number of pixels of subregions v
i
, v
j
, and overlap-
ping area between v
i
and v
j
. In Equations (6)(8),
C, I, and D denote the centroid of pixels, the mean
of image features (RGB intensities), and the mean of
depth features in each subregion, respectively.
As shown in Equation (8), the proposed method
adds M3
d
(v
i
) to the depth feature D(v
i
), and de-
termines the depth feature difference d
d
(v
i
, v
j
) from
D(v
i
) + M3
d
(v
i
) in the frame t and D(v
j
) in the
frame t + 1. Here, M3
d
(v
i
) denotes the depth di-
rectional component of M3(v
i
), which is the median
of 3D motion (scene flow) in v
i
. This is intended
to adapt for depth feature changes in moving object
regions and then use depth features effectively for
subregion matching. In the same way, as shown in
Equation (6), the centroid C(v
i
) is adjusted by adding
M2(v
i
), which is the median of 2D motion (optical
flow) in v
i
, and used for determining the position fea-
ture difference d
p
(v
i
, v
j
).
To obtain M2(v
i
) and M3
d
(v
i
), the 2D motion
of each pixel in the frame t is determined from the
two successive image frames t and t + 1 (Farneb
¨
ack,
2003). Using the 2D motion in the frame t, the me-
dian of 2D motion M2(v
i
) is computed for each sub-
region v
i
. Besides, the correspondences between the
pixels in the frame t and those in the frame t + 1 are
determined from the 2D motion in the frame t. By in-
corporating the pixel correspondences into the depth
frames t and t + 1, the 3D motion in the frame t also
can be determined. Using the 3D motion in the frame
t, the median of 3D motion M3(v
i
) and its depth di-
rectional components M3
d
(v
i
) are computed for each
subregion v
i
.
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
258
3.3 Sequence Merging
The proposed method merges subregion sequences
corresponding with the same object into a single ob-
ject region sequence. Example of subregion sequence
merging is shown in Figure 6, where random colors
are assigned to each subregion sequence in (a) and
each individual object region sequence in (b).
To begin with, subregions of large 3D motion fea-
tures are determined in all frames, and each sequence
including such subregions is chosen as a target sub-
region sequence S
i
corresponding to part of a moving
object. Our proposed method computes the dissim-
ilarity between two target subregion sequences, and
merges them into a single sequence if their dissimi-
larity is less than a given threshold. Iterating this pro-
cess for all target subregion sequences until all sim-
ilar sequences unify, our method obtains the region
sequences of multiple moving objects individually.
Let subregion sequences S
i
and S
j
consist of sub-
regions s
i
(t) S
i
ranging in frame T s
i
t Te
i
and
subregions s
j
(t) S
j
ranging in frame T s
j
t Te
j
,
respectively. Suppose that the frames of these two se-
quences S
i
and S
j
overlap each other from T s to Te as
shown in Figure 7 (a). Our proposed method defines
the dissimilarity between S
i
and S
j
by
DS(S
i
, S
j
) =
1
Te T s + 1
Te
t=T s
ds(s
i
(t), s
j
(t)), (9)
where ds(s
i
(t), s
j
(t)) is the dissimilarity, which is de-
fined from the differences of 3D motion features, be-
tween subregions s
i
(t) and s
j
(t) in the frame t. Al-
though motion features are not always spatially uni-
form in a nonrigid or articulated object region, adja-
cent pixels in the same object region are thought to be
similar to each other in their motion features as shown
in Figure 7 (b). Therefore, to improve the accuracy
and robustness of sequence merging, the proposed
method determines the dissimilarity ds(s
i
(t), s
j
(t))
from the differences of 3D motion features in adja-
cent pixels between subregions s
i
(t) and s
j
(t).
To determine adjacent pixels between subregions
s
i
(t) and s
j
(t), for each pixel pair (v
k
(t), v
l
(t)) made
t = T (frame)
t = T + 15
t = T + 30
(a)
(b)
Figure 6: Example of (a) Subregion sequences, and (b) In-
dividual moving object region sequences obtained from (a).
S
i
S
j
t (frame)
v
k
(t)
v
l
(t)
(v
k
(t), v
l
(t))
Ts Te
s
i
(t)
s
j
(t)
(c)(b)
(a)
Figure 7: (a) Subregion sequences S
i
, S
j
, (b) Subregions
s
i
(t), s
j
(t) in the frame t, (c) Pixels v
k
(t), v
l
(t), and their
pixel pair (v
k
(t), v
l
(t)).
from pixels v
k
(t) s
i
(t) and v
l
(t) s
j
(t), its inter-
pixel 3D distance d
P3
(v
k
(t), v
l
(t)) is computed by
d
P3
(v
k
(t), v
l
(t)) = kP3(v
k
(t)) P3(v
l
(t))k, (10)
where P3(v
k
(t)) and P3(v
l
(t)) are 3D positions cor-
responding to v
k
(t) and v
l
(t), respectively. Among all
pixel pairs, N pixel pairs (v
k
(t), v
l
(t)) AP
i j
(t) with
the shortest inter-pixel 3D distances d
P3
(v
k
(t), v
l
(t))
are chosen for the pairs of adjacent pixels as shown
in Figure 7 (c). For each adjacent pixel pair
(v
k
(t), v
l
(t)) AP
i j
(t), the difference between its pix-
els v
k
(t) and v
l
(t) is defined as
dv(v
k
(t), v
l
(t)) = d
P3
(v
k
(t), v
l
(t))
+ σ
M
d
M3
(v
k
(t), v
l
(t)), (11)
where d
M3
(v
k
(t), v
l
(t)) represents the difference in
3D motion features defined as
d
M3
(v
k
(t), v
l
(t)) = kM3(v
k
(t)) M3(v
l
(t))k, (12)
and σ
M
is a coefficient for d
M3
(v
k
(t), v
l
(t)). The me-
dian of dv(v
k
(t), v
l
(t)) computed for (v
k
(t), v
l
(t))
AP
i j
(t) is used as the dissimilarity ds(s
i
(t), s
j
(t)) be-
tween subregions s
i
(t) and s
j
(t) in the frame t.
4 EXPERIMENTAL RESULTS
To demonstrate the effectiveness of our proposed
method, we conducted experiments on the region se-
quence extraction of multiple moving objects.
Scenes where a person moves a box were taken by
Kinect (Microsoft, 2015), and three pairs (Scenes 1,
2, and 3) of an image and a depth sequence were
used in the experiments. Both the image and depth
sequences were 640 × 480 pixels in size and 30 fps
Region Extraction of Multiple Moving Objects with Image and Depth Sequence
259
in frame rate. Scenes 1, 2, and 3 contained 500,
400, and 400 frames, respectively. In these sequences,
pixel positions on the depth frame were converted to
those on the image frame by using Kinect for Win-
dows SDK (Microsoft, 2013).
The experiments were carried out on a PC (In-
tel Core i7-3770@3.40GHz, 8GB, Windows 7 Pro
x64), and a program was implemented with Visual
C++ 2010. Through preliminary experiments, the pa-
rameters in the proposed method were determined as
σ
S
= 3.0, σ
Ta
= 10.0, σ
T d
= 30.0, and σ
M
= 37.5.
4.1 Experiments of Subregion Matching
Firstly, to illustrate the effectiveness of our proposed
method in subregion matching, we carried out exper-
iments on the construction of subregion sequences.
As shown in Figure 8 (a), (b), and (c), subregions
are extracted using the image and the depth sequence
of Scene 1. To these subregions extracted from each
frame, three subregion matching methods are applied,
(a) Image sequence
t = 0 (frame)
t = 15
t = 30
t = 45
(b) Depth sequence
(c) Extracted subregions
(d) Subregion sequences constructed w/o depth features
(e) With non-adjusted depth features
(f) With adjusted depth features (proposed method)
Figure 8: Experimental results of constructing subregion se-
quences (Scene 1).
and their results are shown in Figure 8 (d), (e), and (f),
where random colors are assigned to each constructed
sequence (moving object sequence only).
Figure 8 (d) shows the subregion sequences con-
structed by a subregion matching method without
depth feature dissimilarity (with only image feature
dissimilarity), whose approach corresponds to that of
(Couprie et al., 2013). Figure 8 (e) shows the result by
a method with non-adjusted depth feature dissimilar-
ity, which corresponds to the approach of (Abramov
et al., 2012). Figure 8 (f) shows the result by the pro-
posed method, which uses adjusted depth feature dis-
similarity.
As can be seen from Figure 8 (d), some parts
of the background are mistakenly regarded as subre-
gions of the person or box by the subregion matching
method without depth feature dissimilarity. Conse-
quently, subregions of the person’s head and the back-
ground are incorporated into the light blue sequence,
and subregions of the box and the background are in-
corporated into the dark green sequence. Compared to
this, as shown in Figure 8 (e), such inaccurate corre-
spondences between subregions can be reduced by us-
ing depth feature dissimilarity. However, subregions
of the box in t = 0, 15 and t = 30, 45 are assigned dif-
ferent colors. This is because, subregions correspond-
ing to the same object are regarded as corresponding
to different objects according to the change of their
depth features.
In contrast to those results, as shown in Fig-
ure 8 (f), the proposed method constructs three differ-
ent sequences from subregions of the person’s head,
the person’s body, and the box individually. These
experimental results indicate the effectiveness of our
proposed method, which uses the temporal dissimi-
larity of depth features adjusted by each object move-
ment, in subregion matching.
(a) Object region sequences using mode of 3D motion fea-
tures in each subregion
t = 0 (frame)
t = 15
t = 30
t = 45
(b) Using 3D motion features in adjacent parts (proposed
method)
Figure 9: Experimental results of merging subregion se-
quences (Scene 1).
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
260
4.2 Experiments of Sequence Merging
Secondly, to illustrate the effectiveness of our pro-
posed method in subregion sequence merging, we car-
ried out experiments on the extraction of individual
object region sequences.
From Scenes 1, 2, and 3, subregion sequences
are constructed by the proposed method with adjusted
depth feature dissimilarity as shown in Figures 8 (f),
10 (d), and 11 (d). To these subregion sequences, two
sequence merging methods are applied, and their re-
sults are shown in Figures 9 (a), (b), 10 (e), (f), 11 (e),
and (f), where random colors are assigned to each in-
dividual object region sequence.
Figures 9 (a), 10 (e), and 11 (e) show the region
sequences extracted by a sequence merging method
which computes the mode of 3D motion features in
each subregion and uses the dissimilarity of 3D mo-
tion feature modes between subregions. This ap-
(a) Image sequence
t = 0 (frame)
t = 30
t = 60
t = 90
(b) Depth sequence
(c) Extracted subregions
(d) Subregion sequences constructed with adjusted depth
feature
(e) Object region sequences merged using mode of 3D mo-
tion features in each subregion
(f) Using 3D motion features in adjacent parts (proposed
method)
Figure 10: Experimental results of merging subregion se-
quences (Scene 2).
proach corresponds to that of (Trichet and Nevatia,
2013). As can be seen from those results, although the
region sequences of a person and a box are extracted
separately in all scenes, person’s head and body are
also extracted as different region sequences because
the movement of the head is substantially different
from that of the body.
Figures 9 (b), 10 (f), and 11 (f) show the results
by the proposed method using 3D motion feature dis-
similarity computed only in adjacent parts. Almost
the entire region of the person is correctly extracted as
a single region sequence in all scenes. These exper-
imental results indicate the effectiveness of our pro-
posed method, which uses the spatial dissimilarity of
3D motion features computed only in adjacent parts,
in subregion sequence merging.
(a) Image sequence
t = 0 (frame)
t = 30
t = 60
t = 90
(b) Depth sequence
(c) Extracted subregions
(d) Subregion sequences constructed with adjusted depth
feature
(e) Object region sequences merged using mode of 3D mo-
tion features in each subregion
(f) Using 3D motion features in adjacent parts (proposed
method)
Figure 11: Experimental results of merging subregion se-
quences (Scene 3).
Region Extraction of Multiple Moving Objects with Image and Depth Sequence
261
5 CONCLUSIONS
In this paper, we have proposed a method for extract-
ing the region sequences of multiple objects using an
image and a depth sequence. The proposed method
extracts subregions from each frame, constructs sub-
region sequences through subregion matching be-
tween successive frames, and merges subregion se-
quences into the region sequences of individual ob-
jects. To effectively make use of depth features and
3D motion features in these processes, our proposed
method employs depth feature similarity adjusted by
each object movement and 3D motion feature similar-
ity computed only in adjacent parts. Through the ex-
periments, we demonstrated the effectiveness of our
proposed method in extracting the region sequences
of multiple moving objects, where the depth varies
with frames, and articulated objects, where the mo-
tion varies with parts.
Currently, our proposed method extracts object re-
gion sequences from a whole input sequence (i.e. it
cannot process every input frame serially), and the
average processing time of every frame is more than
two seconds. In future work, we would like to in-
vestigate extending our method not only to improve
the accuracy of object region sequence extraction but
also to process every set of a few input frames or ev-
ery input frame serially in real time. Furthermore,
we plan to conduct quantitative evaluation of the pro-
posed method for various scenes.
ACKNOWLEDGEMENT
This work was supported in part by the Japan Society
for the Promotion of Science (JSPS) under a Grant-
in-Aid for Scientific Research (C) (No.15K00171).
REFERENCES
Abramov, A., Pauwels, K., Papon, J., W
¨
org
¨
otter, F., and
Dellen, B. (2012). Depth-supported real-time video
segmentation with the Kinect. In Proc. IEEE Work-
shop Appl. Comput. Vision, pages 457–464.
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and
Susstrunk, S. (2012). SLIC superpixels compared to
state-of-the-art superpixel methods. IEEE Trans. Pat-
tern Anal. Machine Intell., 34(11):2274–2282.
Bergamasco, F., Albarelli, A., Torsello, A., Favaro, M., and
Zanuttigh, P. (2012). Pairwise similarities for scene
segmentation combining color and depth data. In
Proc. 21st Int. Conf. Pattern Recognit., pages 3565–
3568.
C¸ i
˘
gla, C. and Alatan, A. A. (2008). Object segmentation
in multi-view video via color, depth and motion cues.
In Proc. IEEE Int. Conf. Image Process., pages 2724–
2727.
Comaniciu, D. and Meer, P. (1999). Mean shift analysis
and applications. In Proc. Int. Conf. Comput. Vision,
volume 2, pages 1197–2003.
Couprie, C., Farabet, C., LeCun, Y., and Najman, L. (2013).
Causal graph-based video segmentation. In Proc.
IEEE Int. Conf. Image Process., pages 4249–4253.
DeMenthon, D. and Megret, R. (2002). Spatio-temporal
segmentation of video by hierarchical mean shift anal-
ysis. Technical Report TR-4388, Center for Automat.
Res., U. of Md, College Park.
Farneb
¨
ack, G. (2003). Two-frame motion estimation based
on polynomial expansion. In Proc. Scand. Conf. Im-
age Anal., pages 363–370.
Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efficient
graph-based image segmentation. Int. J. Comput. Vi-
sion, 59(2):167–181.
Fern
´
andez, J. and Aranda, J. (2000). Image segmentation
combining region depth and object features. In Proc.
15th Int. Conf. Pattern Recognit., volume 1, pages
618–621.
Galasso, F., Cipolla, R., and Schiele, B. (2012). Video seg-
mentation with superpixels. In Proc. 11th Asian Conf.
Comput. Vision, volume 1, pages 760–774.
Grundmann, M., Kwatra, V., Han, M., and Essa, I. (2010).
Efficient hierarchical graph-based video segmenta-
tion. In Proc. IEEE Conf. Comput. Vision Pattern
Recognit., pages 2141–2148.
Lezama, J., Alahari, K., Sivic, J., and Laptev, I. (2011).
Track to the future: Spatio-temporal video segmenta-
tion with long-range motion cues. In Proc. IEEE Conf.
Comput. Vision Pattern Recognit., pages 3369–3376.
Microsoft (2013). Kinect for Windows SDK v1.8.
http://www.microsoft.com/en-us/download/
details.aspx?id=40278. Online; accessed 1–Sep.–
2015.
Microsoft (2015). Kinect Windows app development.
https://dev.windows.com/en-us/kinect. Online; ac-
cessed 1–Sep.–2015.
Trichet, R. and Nevatia, R. (2013). Video segmentation with
spatio-temporal tubes. In Proc. IEEE Int. Conf. Adv.
Video Signal Based Surv., pages 330–335.
Xia, L., Chen, C.-C., and Aggarwal, J. K. (2011). Human
detection using depth information by Kinect. In Proc.
IEEE Conf. Comput. Vision Pattern Recognit. Work-
shops, pages 15–22.
Xu, C. and Corso, J. J. (2012). Evaluation of super-voxel
methods for early video processing. In Proc. IEEE
Conf. Comput. Vision Pattern Recognit., pages 1202–
1209.
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
262