GROWING AGGREGATION ALGORITHM FOR DENSE TWO-
FRAME STEREO CORRESPONDENCE
Elisabetta Binaghi, Ignazio Gallo , Chiara Fornasier, Mario Raspanti
Universita' degli Studi dell'Insubria, Varese, Italy
Keywords: stereo, occlusion, disparity space, neural networks.
Abstract: This work aims at defining a new method for matching correspondences in stereoscopic image analysis. The
salient aspects of the method are -an explicit representation of occlusions driving the overall matching
process and the use of neural adaptive technique in disparity computation. In particular, based on the
taxonomy proposed by Scharstein and Szelinsky, the dense stereo matching process has been divided into
three tasks: matching cost computation, aggregation of local evidence and computation of disparity values.
Within the second phase a new strategy has been introduced in an attempt to improve reliability in
computing disparity. An experiment was conducted to evaluate the solutions proposed. The experiment is
based on an analysis of test images including data with a ground truth disparity map.
1 INTRODUCTION
The reconstruction of three-dimensional shape from
two or more images is a well known and intensively
investigated research problem within the Computer
Vision community (Barnard and Fischler 1982;
Barnard and Thompson W 1980; Dhond and
Aggarwal 1989).
Major efforts have been devoted to the stereo
matching sub-task aimed at computing
correspondences in two (or more) images for
obtaining dense depth maps. A substantial amount of
work has been done on stereo matching giving rise
to a variety of novel approaches (Scharstein and
Szelisky, 2002) attempting to improve upon existing
early methods (Hannah, 1989) and satisfy the high
accuracy demand in diversified application domains
such as object recognition, robotics and virtual
reality (McMillan and Bishop 1995).
Despite important achievements, the accuracy of
most innovative stereo techniques may not be
adequate especially in those situations where even
isolated errors in the depth map create visible
undesirable artefacts. The problem originates from
the fact that most stereo algorithms ignore
occlusions analysis or address it in a post processing
stage within a more general smoothing task (Bobik
and Intille 1999).
Occlusions are widespread in stereo imagery and
even when images with small disparity jumps are
processed, they drastically affect the accuracy of the
overall reconstruction process being the major
source of errors.
Recent works on stereo matching stem from the
idea of mimicking the human visual system which
uses occlusions to reason about the spatial
relationships between objects during binocular
stereopsis. Explicit representation of occlusions and
direct processing within occlusion edges
characterizes these approaches (Bobik and Intille
1999).
This paper proposes a novel algorithm for
solving stereo correspondence based on an explicit
representation of occlusions driving the overall
matching process. In particular, based on the
taxonomy proposed by Scharstein and Szelinsky, the
dense stereo matching process has been divided into
three tasks: matching cost computation, aggregation
of local evidence and computation of disparity
values (Scharstein and Szelisky, 2002) . Within the
second phase a new strategy has been introduced in
an attempt to improve reliability in computing
disparity. An experiment was conducted to evaluate
the solution proposed. The experiment is based on
the analysis of test images including data with a
ground truth disparity map and makes use of the
quality metrics proposed by Scharstein and
Szelinsky (Scharstein and Szelisky, 2002).
326
Binaghi E., Gallo I., Fornasier C. and Raspanti M. (2006).
GROWING AGGREGATION ALGORITHM FOR DENSE TWO-FRAME STEREO CORRESPONDENCE.
In Proceedings of the First International Conference on Computer Vision Theory and Applications, pages 326-332
DOI: 10.5220/0001362203260332
Copyright
c
SciTePress
2 REPRESENTATION
This section describes a data structure called
Disparity Space Image, or DSI, already introduced
in previous works (Okutomi and Kanade 1994;
Bobik and Intille 1999). The DSI is an explicit
representation of matching space and plays an
essential role in the development of the overall
matching algorithm which makes use of occlusion
constraints.
The correspondence between pixel
()
rr
yx ,
in a
Reference Image
r
f
and a pixel
()
mm
yx ,
in a
Matching Image
m
f
is defined as
()
[]
()
rrrrrrm
rrr
yxyyxdsxf
yxf
,,,
),(
η
++
=
(1)
where
1±=s
is a sign chosen so that disparities
are always positive;
(
)
r
,y
r
xd
is the disparity function
and
(
)
r
,y
r
x
η
is the Gaussian white noise.
From equation (1) we obtain
()
rrrm
yxdsxx ,+=
(2)
and from equation (2):
()( )
rmrr
xxsyxd =,
(3)
Introducing the epipolar constraint, we also
have:
mr
yy =
(4)
supposing the pixels move from right to left.
Once the disparity space has been specified, the
concept of DSI can be introduced and defined as any
image or function over the disparity space
),,( dyx
rr
. Values of DSI usually represent the cost
of a particular match implied by the particular
(
)
x,yd
considered.
Figure 2 shows a graphic representation of DSI:
each slice indicates a level of disparity varying from
0 to a value
max
d
representing the maximum
disparity for the pair of images considered.
3 GROWING TEMPLATE
ALGORITHM
According to the taxonomy proposed by Scharstein
and Szelinsky, the dense stereo matching process
can be divided into four tasks (Scharstein and
Szelisky, 2002):
1) Matching Cost Computation
2) Aggregation Cost
3) Disparity Computation and Optimization
4) Disparity Refinement
Many dense stereo matching methods have
presented several different solutions to one or more
of these tasks. The most common matching costs
include squared intensity differences (SD) and
absolute intensity differences (AD) (Cox et al., 1996;
Scharstein and Szelisky, 2002).
The actual sequence of steps in the overall
matching procedure depends on the matching
Figure 1: Correspondence between a pixel
()
rr
yx ,
in
reference image
r
f
and a pixel
()
rm
yx ,
in matching
image
m
f
The difference
()
rm
xxd =
is the disparity
value.
d
x
x
r
m
+=
r
x
r
x
r
y
r
f
m
f
x
r
Disparity = d
max
-1
Disparity = 0
y
r
d
Disparity = d
max
-2
Figure 2: Graphic Representation of Disparity
Space Image (DSI).
GROWING AGGREGATION ALGORITHM FOR DENSE TWO-FRAME STEREO CORRESPONDENCE
327
algorithm and in particular, on its local or global
nature.
Our approach, which follows a local strategy,
extends the conventional Aggregation Cost phase
including two novel sub-tasks:
2.1) Growing Raw Cost
2.2) Growing Aggregation Cost
3.1 Matching Cost Computation
Assuming the use of SD as matching function, by
equation (3) the matching cost computed for a pixel
()
rr
yx ,
is defined as:
()()()
[]
2
,,,,
rmmrrrrr
yxfyxfdyxDSI =
(5)
where d is the disparity associated with the pixel
()
rr
yx ,
and
max
0 dd
.
3.2 Aggregation Cost
Local and window-based methods aggregate the
matching cost by summing or averaging over a
support region in the DSI. The support region we
use is a two-dimensional squared window of a fixed
dimension.
In particular, this second step is performed by
summing the calculated matching costs over a
squared window with constant disparity
d
. The
aggregation cost
d
ji
AC
,
is defined as:
()
∑∑
+
=
+
=
=
Wa
am
Wb
bn
d
ji
dnmDSIAC ,,
,
(6)
where
=
2
W
ia
and
=
2
W
jb
.
Considering all the aggregation values obtained
varying the disparity in the range
[]
max
0 dΚ
, we
obtain:
[]
max
,
1
,
0
,
,
,...,
d
nmnmnm
nm
ACACACAC =
(7)
A classic Stereo Matching Algorithm, at this
point, with a Winner Take All (WTA) technique for
example, decides that the disparity is computed by
selecting the minimal aggregated cost in
nm
AC
,
. The
Growing Template Algorithm adds two new steps at
the aggregation cost phase.
We now describe the two parts that characterize
the Growing Template Aggregation Cost step.
3.3 Growing Raw Cost
Unlike conventional techniques that base further
steps of matching algorithm on the minimal
aggregated cost computed in
nm
AC
,
, our approach
bases decisions on all the costs obtained. To this
purpose, we introduce a new quantity
nm
RC
,
defined
as
]
max
,
1
,
0
,
,
,...,,
d
nmnmnm
nm
RCRCRCRC =
(8)
where each element indicates the position in the
sorted list of the element
d
ji
AC
,
.
For example, if we have the vector
[
]
16,1,12
,
=
nm
AC
the corresponding vector of raw
cost is
[
]
3,1,2
,
=
nm
RC
.
At the end of this phase for every pixel of
coordinates
(
)
nm,
in the reference image we have
associated the
nm
RC
,
calculated.
3.4 Growing Aggregation Cost
This sub-task calculates the number of confirmations
in a given support window for a given cost l.
Formally, from equation (8) we obtain the vector:
]
max
,
1
,
0
,
,
,...,,
d
nmnmnm
nm
GAGAGAGA =
(9)
where
(
)
()
×
=
WWnm
d
nm
d
nm
lRCGA
,
,,
(10)
The salient aspect of our strategy is that of
integrating contextual confirmation within the
matching cost aggregation phase. The aggregation
can be performed based on different cost values
varying the l parameter in equation 10.
Figure 3 compares DSI’s slices for fixed y
obtained by means of a matching algorithm which
uses SD matching cost within conventional AC
computation performed with support window W=5
(a) and our algorithm which includes the GA task
performed with the following parameters(c):
growing aggregation window GW=25 and l=1.
VISAPP 2006 - MOTION, TRACKING AND STEREO VISION
328
Both figures 3a and 3c highlight bands (dark
lines and white lines respectively) indicating regions
that match at a given disparity. They are more
visible in figure 3c depicting disparity and occlusion
situations without ambiguity.
3.5 Disparity Computation
The next phase in the matching algorithm consists in
the computation of the disparity map by selecting
the
nm
GA
,
components which satisfy a given
criterion. Adopting a WTA strategy, the disparity
associated with the minimum cost value is selected
at each pixel.
The present work tested an adaptive strategy
based on neural networks for disparity computation
(Rumelhart et al. 1986; Pao, 1989). A Multilayer
Perceptron neural model was adopted to compute the
disparity based on specific local information
extracted from the DSI slice.
The network is trained receiving input data from
the DSI slices. In particular, at each position of a
moving window over the DSI slice, an input pattern
is extracted and presented to the network. A training
example is constituted by a pair of elements (a,b)
where a is the input pattern collecting a set
{}
i
d
and
b corresponding disparity extracted from ground
truth image for each x within the moving window.
(Figure 4).
4 EXPERIMENTS
The experiments illustrated in this section addressed
the following questions:
how did the performance depend upon their main
parameters and upon the neural refinement
stage?
how did the Growing Template Algorithm
compare with other matching approaches?
The overall experimental activity was supported
by tools and test data available within the
implementation framework proposed by Scharstein
& Szelinski in their paper (Scharstein and Szelisky,
2002) and made available on the Web at
www.middlebury.edu/stereo. We included our stereo
correspondence algorithm in this framework, and
applied it to the test data available.
Four stereo image pairs with different types of
content are used to evaluate the performances of the
proposed algorithm (see Figure 5).
Among the quality measures available within the
framework, we adopted the RMS (root mean
squared) error (measured in disparity units) between
the computed disparity map
),( yxd
C
and the ground
truth map
),( yxd
T
2
1
2
)|),(),((|
1
(
= yxdyxd
N
R
TC
(12)
These measures are intended computed over the
whole image and five different kinds of regions in
the whole image:
y textureless regions (TEXTRD): regions where
the squared horizontal intensity gradient
averaged over a square window of a given size
(suggested value 3) is below a given threshold
(suggested value 4.0)
y textured regions (TEXTRLS): regions
complementary to the textureless regions
(c)
x
d
(a)
Figure 3: DSI’s slices obtained from the traditional AC
values for each pixel of line 100 (a), and obtained from
the GA on the same line (c). Slice (b) represents
correct dis
p
arit
y
information for line 100
(b)
Trained
Neural
Network
d
d
1
or winner= maximum value
d
2
or second
x
moving
window
input values
selection
d
x
d
x
computed
disparity
Figure 4 –Disparity computation procedure based on a
trained neural network. The process extracts patterns
from a window positioned over a DSI slice. For each x
we select two values: d
1
and d
2
. The trained neural
network calculates the disparity associated with the
center of the moving window.
GROWING AGGREGATION ALGORITHM FOR DENSE TWO-FRAME STEREO CORRESPONDENCE
329
y occluded regions (OCCL): regions that are
occluded in the matching image
y non occluded regions (NONOCCL): regions
complementary to the occluded regions
y depth discontinuity regions (DISC): pixels
whose neighboring disparity differs by more
than a given threshold (suggested value 2.0),
dilated by a window of a given width (suggested
value 9)
These regions are computed by pre-processing
reference images and ground truth disparity maps
yielding binary segmentation.
4.1 Sensitivity Analysis
We attempted to evaluate the effects of
systematically varying some of the key parameters
of our stereo algorithm to find an optimal setting for
all situations. Experiments were developed using SD
and AD matching costs and windows of size 7, 15,
21, 35 for Growing Aggregation Cost.
Results obtained demonstrate that performances
are not strongly influenced by the type of matching
costs used. A large window can help for occlusion
regions. Inversely, small windows perform better on
discontinuity regions.
Dimensions of 35X35 were selected allotting the
task of optimizing the balance between high
accuracy in occlusion and discontinuity regions to
the neural adaptive stage.
4.2 Performance Evaluation and
Comparison Analysis
At first the evaluation procedure aimed to identify
and evaluate the contribution of neural refinement
within the global matching algorithm. To this
purpose we compared the Growing Aggregation
Algorithm including the neural stage with the same
version including the WTA strategy for disparity
computation. The evaluation was based on the
monochromatic MAP pair of images. Training data
for the neural stage has been selected from the Map
image in a measure of 10% of global pixels.
As shown in Table 1, the algorithm GA presents
a competitive behaviour. However its principal merit
consists in preparing ideal conditions for the
subsequent neural stage as demonstrated by the fact
that the algorithm with neural refinement (GA+N)
strongly prevails in all regions considered.
We compared performances obtained by means
of the Growing Aggregation Algorithm including
neural refinement with those obtained by running
four algorithms implemented within the cited
framework available, selected among those with
better performances (Table 2).
Figure 5: Left image and ground truth disparity
ma
p
s of test set.
RM S
0
2
4
6
8
10
12
14
16
18
7 15 21 35 7 15 21 35
w indow size
ALL
N_OCCL
OCCL
TEXTRD
TEXTRLS
DISC
SD AD
Figure 6: Sensitivity Analysis of Growing
Aggregation Algorithm.
VISAPP 2006 - MOTION, TRACKING AND STEREO VISION
330
Table 1: Results obtained with MAP image. GA=Growing
Aggregation, GA+N=Growing Aggregation+ Neural
Network for disparity computation.
ALL
NON
OCCL
OCCL
TEXT
RD
TEXT
RLS
DISC
GA
3.85 1.31 14.88 1.32 0.40 4.69
GA+N
1.66 1.42 3.82 1.42 0.72 4.86
Figure 6 shows final disparity maps obtained by
processing the four stereo image pairs considered.
Table 2: Results obtained with MAP image. DP= Dynamic
Programming, SSD=Sum of Squared Difference,
SO=Scanline Optimization, BD=Bayesian Diffusion.
ALL
NON
OCCL
OCCL
TEXT
RD
TEXT
RLS
DISC
DP
3.15 2.98 5.21 2.99 1.75 5.86
SSD
3.92 1.66 14.65 1.67 0.44 6.07
SO
4.39 2.02 16.09 2.02 2.57 5.25
BD
4.66 0.93 18.74 0.94 0.43 2.95
5 CONCLUSIONS
Our objective in this study was to investigate the
potentialities of a new method aimed at solving
correspondence problem within a two-frame area
matching approach and producing dense disparity
maps.
The strategy was tested on standard data sets
available on the Web. As seen in this experimental
context the allied use of the growing aggregation
strategy and neural adaptive techniques benefits the
matching in general and in particular in occluded
regions. The use of adapting techniques allow to
process raw data directly extracted from DSI slices
without formalizing explicitly the information useful
for handling occlusions. The trained network
encodes the knowledge about occlusions and
efficiently uses it in generalization.
We consider this study preliminary to further
investigation involving both methodological and
experimental issues.In particular, the present
solutions must be reinforced implementing an
operative strategy for training neural network;
strategies will be integrated in an attempt of
improving generalization in such a way that trained
networks could be reliably applied to different kind
of images never seen during the training stage .
Further experiments dealing with Scanning
Electron Microscopy imagery are planned.
REFERENCES
Barnard, S.T. and Fischler, M.A. 1982.Computational
Stereo. ACM Computing Surveys, 14(4):553-572.
Barnard, T. and Thompson W.B. 1980. Disparity
Analysis of Images. IEEE Trans. PAMI :333-340.
Bishop, C.M.1995. Neural Networks for Pattern
Recognition, Oxford University Press, Oxford.
Bobik, A. F and Intille, S. S.. 1999. Large occlusion
stereo. International Journal on Computer Vision, 33,
181-200.
Figure 6: Final disparity maps and difference
images.
GROWING AGGREGATION ALGORITHM FOR DENSE TWO-FRAME STEREO CORRESPONDENCE
331
Cox, J.I., S.L. Higonani, S.P. Rao, B.M. Maggs. 1996. A
Maximum Likelihoods Stereo Algorithm. Computer
Vision and Image Understanding, 63, 542-567.
Dhond, U.R. and Aggarwal, J.K. 1989. Structure from
Stereo – a review. IEEE Trans. On Systems, Man, and
Cybernetics, 19, 1489-1510.
Hannah, M.J. 1989. A system for digital stereo
image matching. Photogrammetric Engineering
and Remote Sensing, 55, 1765-1770.
McMillan, L. and Bishop, G. 1995. Plenoptic
modelling:An image-based rendering system.
Computer Graphics (SIG-GRAPH’95), 39-46.
Kanade, T.and Okutomi, M. 1994. A Stereo Matching
Algorithm with an Adaptive Window: Theory and
Experiment. IEEE Trans. on PAMI, 16(9), 920-932.
Pao, Y.H..1989. Adaptive Pattern Recognition and
Neural Networks. Addison Wesley, MA.
Rumelhart, H., G.E.Hinton, and R.J.Williams. 1986.
Learning Internal Representation by Error
Propagation, in Rumelhart H., McClelland J.L.(eds.),
Parallel Distributed Processing, 318-362. MIT Press,
Cambridge, MA.
Scharstein, D. and Szeliski R. 2002. A Taxonomy and
Evaluation of Dense Two-Frame Stereo
Correspondence Algorithms. International Journal of
Computer Vision, 47, 7-42.
VISAPP 2006 - MOTION, TRACKING AND STEREO VISION
332