Selective Use of Optimal Image Resolution for Depth from Multiple

Motions based on Gradient Scheme

Norio Tagawa and Shoei Koizumi

Graduate School of System Design, Tokyo Metropolitan University, 6-6 Asahigaoka, Hino, Tokyo, Japan

Keywords:

Direct Shape Recovery from Motion, Gradient Method, Random Camera Motions, FixationalEye Movements.

Abstract:

The gradient-based depth from motion method is effective for obtaining a dense depth map. However, the

accuracy of the depth map recovered only from two successive images is not so high, and hence, to increase

the depth information by tracking corresponding image points through an image sequence is often performed

by using, for example, the Kalman ﬁlter-like technique. Alternatively, multiple image pairs generated by

random small camera rotations around a reference direction can be used for gaining much information of

depth without such the tracking procedure. In the framework of this strategy, in this study, to further improve

the accuracy, we propose a selective use of the optimal image resolution. The appropriate resolution image

is required to have a linear intensity pattern which is the most important supposition for the gradient method

often used for dense depth recovery based on the theory of “shape from motion.” The performance of our

proposal is examined through numerical evaluations using artiﬁcial images.

1 INTRODUCTION

The gradient-based depth from motion methods have

been vigorously studied to recover a dense depth map

(Horn and Schunk, 1981), (Simoncelli, 1999), (Bruhn

and Weickert, 2005), (Tagawa et al., 2008), (Brox

and Malik, 2011), (Ochs and Brox, 2012). How-

ever, the accuracy of the depth map recovered from

two successive images is not enough, and hence some

methods track corresponding points in an image se-

quence to use multiple viewpoint. The accurate track-

ing is also difﬁcult and the various techniques have

been studied, for example, based on the Kalman ﬁlter

(Paramanand and Rajagopalan, 2012) and the parti-

cle ﬁlter. We proposed a tracking method, too, which

adopts the Bayesian label assignment instead of ex-

plicit tracking (Ikeda et al., 2009). If possible, the

accurate depth recovery with no use of the tracking is

desired.

The accuracy of the gradient method hardly de-

pends on the equation error of the gradient equation.

The gradient equation is a ﬁrst order approximation

of the intensity invariant constraint before and after

the relative motion between a camera and an object,

and in general the second and more higher order terms

causes the equation error. The amount of the error de-

pends on the relative relation between the size of the

image motion called an optical ﬂow and the spatial

frequency of an image intensity pattern. This means

that the appropriate spatial frequency exists at each

pixel respectively according to the size of the opti-

cal ﬂow. Therefore, we can select the optimal image

resolution including the effective frequency and use

it for the gradient equation. However, if the images

have little variations of the spatial frequency, the opti-

mal frequency component will not necessarily be ex-

tracted at each pixel according to the speciﬁc optical

ﬂow determinedby the depth at that pixel and the rela-

tive camera motion. To avoid the problem, we should

analyze many intensity pairs, i.e., many optical ﬂows

for each 3-D point on a target object.

On the other hand, the depth recovery method us-

ing random camera rotations imitating ﬁxational eye

movements of a human’s eye ball (Martinez-Conde

et al., 2004) has been proposed (Tagawa, 2010). In

this method, since a camera is assumed to rotate ran-

domly around the reference direction with a small ro-

tation angle, the gradient method is applied simul-

taneously to a lot of image pairs without the image

point tracking. In the usual framework of the gra-

dient method, the optical ﬂow is detected based on

the gradient equation in the ﬁrst step, and next, the

depth map is recovered from the optical ﬂow. This

two step procedure is not suitable for expanding the

gradient scheme for multiple image pairs, and the di-

rect method is adopted in (Tagawa, 2010), in which

Tagawa N. and Koizumi S..

Selective Use of Optimal Image Resolution for Depth from Multiple Motions based on Gradient Scheme.

DOI: 10.5220/0005462500920099

In Proceedings of the 5th International Workshop on Image Mining. Theory and Applications (IMTA-5-2015), pages 92-99

ISBN: 978-989-758-094-9

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

the depth map is directly recoveredwithout the optical

ﬂow detection. In this study, we propose the selective

use of the optimal image resolution in the framework

of the method in (Tagawa, 2010)

In the following, the outline of the method in

(Tagawa, 2010) is explained in Sec. 2, and the pro-

posed method for the optimal resolution selection and

its effectiveness conﬁrmed by numerical evaluations

are referred in Sec. 3. We show the conclusions of

this study in Sec. 4.

2 OUTLINE OF DEPTH FROM

MULTIPLE IMAGE PAIRS

2.1 Camera Motions Imitating Tremor

The camera coordinate system and the camera mo-

tion model imitating tremor, which is one of the ﬁxa-

tional eye movements and is smallest one (Martinez-

Conde et al., 2004), in this study are the same as

those in (Tagawa, 2010) and are shown in Fig. 1. We

use a perspective projection system as our camera-

imaging model. A space point (X,Y,Z)

⊤

on an ob-

ject is projected to an image point ~x ≡ (x,y,1)

⊤

(X/Z,Y/Z,1)

⊤

On the analogy of a human eyeball, we can set

a camera’s rotation center at the back of a lens cen-

ter with Z

along an optical axis, and we assume

that there is no explicit translational motions of a

camera. This rotation with the rotational vector~r =

)

⊤

can also be represented using the coor-

dinate origin as its rotation center with the same ro-

tational vector ~r. On the other hand, this difference

between the origin and the rotation center causes a

translational vector ~u = (u

)

⊤

implicitly, and

is formulated as follows:

























= Z





−r





. (1)

With this system, Z

can be simply known before-

hand, hence an absolute depth can be recovered, al-

though a general camera motion enables us to get only

relative depth.

From Eq. 1, it can be known that r

causes no

translations. Therefore, we set r

= 0 and deﬁne

~r = (r

,0)

⊤

as a rotational vector like an eyeball.

~r(t) can be treated as a stochastic white Gaussian pro-

cess, in which t indicates time measured from a ref-

erence time. The ﬂuctuation of ~r(t) at each time is

assumed to be a two-dimensional Gaussian distribu-

tion with a mean 0 and a variance σ

, where σ

Figure 1: Coordinate system and camera motion model

used in this study.

assumed to be known.

p(~r(t)|σ

) =

(

√

2πσ

)

exp



−

~r(t)

⊤

~r(t)

2σ



. (2)

In the above description, we deﬁne ~r as a rota-

tional velocity to make a theoretical analysis simple.

For small values of the actual rotation angle, Eq. 1 and

the other equations below approximate ﬁnite camera

motions.

2.2 Depth from Multiple Image Pairs

based on Gradient Method

Using the camera motion model described above and

the inverse depth d(x,y) ≡ 1/Z(x,y), the optical ﬂow

~v ≡ [v

]

⊤

is formulated as follows:

= xyr

−(1+ x

+ yr

−Z

d ≡v

−r

(3)

= (1 + y

−xyr

−xr

+ Z

d ≡v

+ r

(4)

In the above equations, d is an unknown variable at

each pixel, and~u and~r are unknown common param-

eters for all pixels.

At each pixel position (x, y), the gradient equation

is formulated with the partial differentials f

, f

and f

of the image brightness f(x, y,t) and the optical ﬂow

as follows (Horn and Schunk, 1981):

= −f

− f

, (5)

where t denotes time. By substituting Eqs. 3 and 4

into Eq. 5, the gradient equation representing a rigid

motion constraint can be derived explicitly

= −( f

+ f

) −(−f

+ f

≡ −f

− f

d. (6)

M is the number of pairs of two successive frames

and N is the number of pixels. We assume that f

(i, j)

SelectiveUseofOptimalImageResolutionforDepthfromMultipleMotionsbasedonGradientScheme

has a Gaussian random error corresponding to the

equation error, and f

(i, j)

and f

(i, j)

have no error.

p( f

(i, j)

(i)

,~r

( j)

,σ

) =

√

2πσ

×exp











−



(i, j)

+ f

(i, j)

+ f

(i, j)

(i)



2σ











, (7)

where i = 1,··· ,N and j = 1,··· ,M, and σ

is an un-

known variance.

Since multiple frames vibrated by irregular rota-

tions {~r

( j)

}are used for depth recovery without track-

ing of the corresponding points in the images, the re-

covered d

(i)

at each pixel takes an average value of

the neighboring region deﬁned by vibration width in

image. Therefore, {d

(i)

} should be assumed to have

local correlation in the image. In this study, to sim-

plify the stochastic modeling of {d

(i)

}, we adopt the

following equation as the depth model.

d|σ

) =

(

√

2πσ

)

exp

(

−

⊤

2σ

)

, (8)

where

d is a N-dimensionalvector composedof {d

(i)

}

and

L indicates the matrix corresponding to the 2-

dimensional Laplacian operator. By assuming this

probabilistic density, we make a recovered depth map

smooth. The use of the prior distribution of the val-

ues to be estimated is interpreted as a regularization

scheme in the signal processing viewpoint (Poggio

et al., 1985). In this study, the variance σ

is con-

trolled heuristically in consideration of smoothness of

a recovered depth map. Hereafter, we use the deﬁni-

tion Θ ≡ {σ

,σ

Based on the probabilistic models of ~r

( j)

, f

(i, j)

and d

(i)

deﬁned above, we can statistically estimate

the depth map. By applying the MAP-EM algo-

rithm (Dempster et al., 1977), {

d, Θ} can be esti-

mated as a MAP estimator based on p(

d,Θ|{f

(i, j)

}),

which is formulated by marginalizing the joint prob-

ability p({~r

( j)

d,Θ|{f

(i, j)

}) with respect to {~r

( j)

in which the prior of Θ is formally regarded as

an uniform distribution. The concrete formula of

d,Θ|{f

(i, j)

}) is shown in Eq. 15 in the AP-

PENDIX. Additionally, {~r

( j)

} can be estimated as a

MAP estimator based on p({~r

( j)

}|{f

(i, j)

Θ,

d), in

which ˆ· means a MAP estimator. The concrete for-

mula of it is also introduce in the APPENDIX by

Eq. 17. Form the formulations, the direct MAP es-

timation of

d is realized to be difﬁcult, but the MAP-

EM algorithm can solve it stably through an indirect

iterative scheme each iteration of which consists of

the E-step and the M-step. In the concrete update pro-

cedure of

d, we use the One-Step-Late (OSL) tech-

nique (Green, 1990) to avoid complicated computa-

tion of . The details of the algorithm are shown in

(Tagawa, 2010).

3 ACCURATE RECOVERY BY

OUTLIER REDUCTION

3.1 Selection of Optimal Resolution for

Gradient Equation

The gradient equation in Eq. 5 is a linear approxima-

tion of the intensity invariant constraint before and af-

ter the relative camera motion. In general, there are

the second and more higher order terms which are

considered as the equation error included in the obser-

vation of f

, which is deﬁned as a simple difference

between successive images like conventional many

methods, and cause the recovery error of the depth

map. The amount of these unwanted terms depend on

the relative relation between the spatial frequency of

the intensity pattern and the size of the optical ﬂow at

each pixel. Therefore, in this study, we try to improve

the accuracy of the recovered depth by selecting and

using the suitable spatial frequency component and

discarding the other components as an outlying data

for each image pair.

After multi-resolution decomposition shown in

Fig. 2, the proposed strategy consists of two steps.

In the ﬁrst step, to detect the spatial frequency by

which drastically large amount of the equation er-

ror is observed in the gradient equation. As one

of the indices, we use the consistency of the spa-

tial gradient vectors between two successive frames

at each pixel,

(i, j)

≡ [ f

(i, j)

, f

(i, j)

]

⊤

and

(i, j−1)

≡

[ f

(i, j−1)

, f

(i, j−1)

]

⊤

. It should be noted that low resolu-

tion images are likely to generate small equation error

as compared with high resolution images but high res-

olution images can be used to recover high resolution

depth map. Hence, from low resolution to high reso-

lution at each pixel, we search the resolution in which

the directions of the spatial gradient between the suc-

cessive frames are reverse using the sign of the inner

product

(i, j)

⊤

(i, j−1)

. The image components whose

resolution is lower than the resolution in which the

sigh of

(i, j)

⊤

(i, j−1)

is negative are selected as the

candidates for the most appropriate resolution.

In the second step, the amount of the nonlinear

terms included in the observation of f

is estimated,

and using it the most appropriate resolution is de-

IMTA-52015-5thInternationalWorkshoponImageMining.TheoryandApplications

tected and is used to recover the depth at each pixel.

is exactly represented as follows:

= −f

− f

−



+ f

+ 2f



+···.

(9)

If the ﬁrst step is performed well, the nonlinear term

can be considered small, and in this case the second

order term in Eq. 9 can be estimated at each pixel i as

follows:

−

( f

(i, j)

− f

(i, j−1)

(i, j)

+ ( f

(i, j)

− f

(i, j−1)

(i, j)

(10)

Spontaneously, we can deﬁne two measures for esti-

mating the amount of the equation error.

|( f

(i, j)

− f

(i, j−1)

(i, j)

+ ( f

(i, j)

− f

(i, j−1)

(i, j)

2|f

(i, j)

+ f

(i, j)

(11)

|( f

(i, j)

− f

(i, j−1)

(i, j)

+ ( f

(i, j)

− f

(i, j−1)

(i, j)

+ f

(i, j)

(12)

measures the nonlinearity as a ratio for the amount

of the ﬁrst term, which can be interpreted a signal-to-

noise ratio. We can know the merit of J

from Eq. 11

that J

can be estimated using only the direction of the

true optical ﬂow, namely the amplitude of the optical

ﬂow is not required to be known. J

measures the

nonlinearity with the dimension of the optical ﬂow,

and this amount is proportional to the recovery error

of d.

Additionally, to estimate the higher order terms in-

cluding the second order term the following two mea-

sures can be used.

(i, j)

− f

(i, j)

+ f

(i, j)

, (13)

(i, j)

− f

(i, j)

+ f

(i, j)

, (14)

where f

(i, j)

is a true value of f

. For the candidate

resolutions selected in the ﬁrst step, the most appro-

priate resolution for depth recovery is determined by

comparing the value of one selected from J

(k =

1,2, 3,4). It is noted that the exact values of these

measures cannot be computed, since these include the

variables to be determined. Therefore, only those es-

timates are provided.

3.2 Numerical Evaluation

To conﬁrm the effectiveness of the proposed method,

and especially compare the efﬁciency of J

(k =

Figure 2: Example of multi-resolution decomposition.

Figure 3: True depth map used for evaluation.

1,2, 3,4), we conducted numerical evaluations using

artiﬁcial images. Figure 3 shows the true inverse

depth map for evaluation. The vertical axis indi-

cates the inverse depth d using the focal length as a

unit, and the horizontal axes represent the pixel posi-

tion in the image plane. The reference image gen-

erated by a computer graphics technique is shown

in Fig. 2. The images viewed with random cam-

era motions are generated using the reference image,

the true depth map and the random camera rotations

SelectiveUseofOptimalImageResolutionforDepthfromMultipleMotionsbasedonGradientScheme

(a)

(b)

(c)

(d)

(e)

Figure 4: Recovered depth map with σ

= 0.006

: appro-

priate resolution is selected using (a) J

, (b) J

, (c) J

, (d)

, and (e) result by conventional method.

(a)

(b)

(c)

(d)

(e)

Figure 5: Recovered depth map with σ

= 0.008

: appro-

priate resolution is selected using (a) J

, (b) J

, (c) J

, (d)

, and (e) result by conventional method.

IMTA-52015-5thInternationalWorkshoponImageMining.TheoryandApplications

sampled as a Gaussian independent random variable

deﬁned in Eq. 2. The image size adopted in these

evaluations is 256×256 pixels, which corresponds to

−0.5 ≤ x,y ≤ 0.5 measured using the focal length as

a unit. All images were decomposed into four res-

olution layers shown in Fig. 2. We recovered the

depth map from a image set consisting of 100 im-

ages each having random movements. Image pairs

consist of each image and the reference image were

used to compute f

, f

and f

in the gradient equa-

tion. The most appropriate resolution was determined

by the method explained in the above section at each

pixel in each image pair, and was used as an observa-

tion for the MAP-EM algorithm. A plane Z = 9 was

used as an initial value of the depth map. The param-

eter σ

determining the degree of smoothness of the

recovered depth was ﬁxed as σ

= 1.0×10

−4

, and the

accuracy of the recovereddepth map was evaluated by

varying σ

determining the amplitude of the random

camera rotations.

In this evaluation, ﬁrstly we use the true value of

the optical ﬂow for computing J

(k = 1, 2,3,4) to

conﬁrm the essential effectiveness of the selective use

of the spatial frequency. Figure 4 shows the recovered

inverse depth map for σ

= 0.006

. Likewise, Fig. 5

shows the results for σ

= 0.008

, i.e. the results for

the large motions compared with Fig. 4. In the ﬁg-

ures, the conventional method (Tagawa, 2010) indi-

cates that the image intensity is used as is, namely se-

lective use of the optimal resolution is not performed.

Figure 6 represents the relation between the depth re-

covery error and the amplitude of the random camera

rotations. From both of the root mean square error

(RMSE) and the relative error, we can conﬁrm that

the proposed outlier reduction is in effect especially

for the large camera rotations. For relatively small

camera rotations, the proposed all measures can be

used to improve the accuracy of In the four measures,

shows good performance regardless of size of the

camera motion.

Successively, we evaluate the actual case in which

the true optical ﬂow is unknown. At each pixel in each

image pair, the minimum least square solution of the

optical ﬂow is derived using the candidate resolutions

selected by the ﬁrst step in the proposed method, and

is used for compute J

(k = 1,2, 3,4). The relation

between the depth recovery error and the amplitude of

the random camera rotations are sown in Fig. 7. As a

result, the performance of J

was still in effect, which

is expected to be caused by the fact that J

depends

only on the direction of the optical ﬂow.

(a)

(b)

Figure 6: Recovery error vs. amplitude of camera rotation

evaluated using true values of optical ﬂow: (a) RMSEs, (b)

relative errors.

4 CONCLUSIONS

In this study, we examine the method to improve the

accuracy of the depth recovery based on the rela-

tive camera motions. We focused on the linear ap-

proximation error of the gradient equation, and pro-

posed the selective use of the optimal image reso-

lution. We deﬁned different four measures to esti-

mate the approximation error, and the effectiveness

was conﬁrmed by integrating the proposed selection

method into the conventional depth recovery method

using random camera rotations. In the future, real im-

age experiments have to be carried out to indicate the

actual performance of our method.

We are now proceeding with the further examina-

tion to improve the accuracy of the recovered depth,

in which we adopt Eq. 10 for the representation of the

second order term included in the observed f

for the

resolution selected by the measure J

. By this repre-

sentation, the gradient equation corresponding to the

second order approximation of the intensity invari-

ant constraint can be formulated as a linear equation

SelectiveUseofOptimalImageResolutionforDepthfromMultipleMotionsbasedonGradientScheme

(a)

(b)

Figure 7: Recovery error vs. amplitude of camera rotation

using estimates of optical ﬂow: (a) RMSEs, (b) relative er-

rors.

about an optical ﬂow, which is easy to be handled. It

is noted that this formulation is nothing but redeﬁn-

ing a spatial gradient of intensity patterns using the

intensity of two sucsessive images.

In the framework using random camera rotations,

the integral-formed method based on motion blur has

also proposed. This method is effective for the large

image motions compared with the image intensity

patterns, and hence, it can be used for the case in

which the differential-formed method in this study

is not suitable (Tagawa et al., 2012), (Tagawa et al.,

2013). Uniﬁcation of both schemes is a future prob-

lem, too.

REFERENCES

Brox, T. and Malik, J. (2011). Large displacement opti-

cal ﬂow: descriptor matching in vartional motion es-

timation. IEEE Trans. Pattern Anal. Machine Intell.,

33(3):500–513.

Bruhn, A. and Weickert, J. (2005). Locas/kanade meets

horn/schunk: combining local and global optic ﬂow

methods. Int. J. Comput. Vision, 61(3):211–231.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).

Maximum likelihood from incomplete data. J. Roy.

Statist. Soc. B, 39:1–38.

Green, P. J. (1990). On use of the em algorithm for pe-

nalized likelihood estimation. J. Roy. Statist. Soc. B,

52:443–452.

Horn, B. P. and Schunk, B. (1981). Determining optical

ﬂow. Artif. Intell., 17:185–203.

Ikeda, M., Okubo, K., and Tagawa, N. (2009). Implicit

tracking of multiple objects based on bayesian region

label assignment. In proc. VISAPP2009, pages 503–

506.

Martinez-Conde, S., Macknik, S. L., and Hubel, D. (2004).

The role of ﬁxational eye movements in visual percep-

tion. Nature Reviews, 5:229–240.

Ochs, P. and Brox, T. (2012). Higher order motion models

and spectral clustering. In proc. CVPR2012, pages

614–621.

Paramanand, C. and Rajagopalan, A. N.(2012). Depth from

motion and optical blur with unscented kalman ﬁlter.

IEEE Trans. Image Processing, 21(5):2798–2811.

Poggio, T., Torre, V., and Koch, C. (1985). Computational

vision and regularization theory. Nature, 317:314–

319.

Simoncelli, E. P. (1999). Bayesian multi-scale differential

optical ﬂow. In Handbook of Computer Vision and

Applications, pages 397–422. Academic Press.

Tagawa, N. (2010). Depth perception model based on ﬁx-

ational eye movements using byesian statistical infer-

ence. In proc. ICPR2010, pages 1662–1665.

Tagawa, N., Iida, Y., and Okubo, K. (2012). Depth per-

ception model exploiting blurring caused by random

small camera motions. In proc. Int. Conf. on Com-

puter Vision, Theory and Applications, pages 329–

334.

Tagawa, N., Kawaguchi, J., Naganuma, S., and Okubo, K.

(2008). Direct 3-d shape recovery from image se-

quence based on multi-scale bayesian network. In

proc. ICPR08, page CD.

Tagawa, N., Koizumi, S., and Okubo, K. (2013). Direct

depth recovery from motion blur caused by random

camera rotations imitating ﬁxational eye movements.

In proc. Int. Conf. on Computer Vision, Theory and

Applications, pages 177–186.

APPENDIX

The posterior distribution of

d and Θ is derived using

the Bayes’ theorem and the uniform prior probability

p(Θ) = Const. as follows:

IMTA-52015-5thInternationalWorkshoponImageMining.TheoryandApplications

d,Θ|{f

(i, j)

}) =

d, Θ,{f

(i, j)

})

p({f

(i, j)

}

∝

···

p({f

(i, j)

},{~r

( j)

d,Θ)d{~r

( j)

}

···

p({f

(i, j)

}|{~r

( j)

d,σ

)p({~r

( j)

}|σ

)p(

d|σ

)

×p(Θ)d{~r

( j)

}

∝

···

∏

i=1

∏

j=1

p( f

(i, j)

(i)

,~r

( j)

,σ

)

∏

j=1

p(~r

( j)

|σ

)

×p(

d|σ

)d{~r

( j)

}

(2π)

N(M+1)/2+M

···

exp











−

∑



(i, j)

+ ~w

(i, j)⊤

( j)



2σ

−

∑

j=1

( j)⊤

( j)

2σ

)

d{~r

( j)

}exp

(

−

⊤

2σ

)

, (15)

(i, j)

≡







(i, j)

(i)

+ f

(i, j)

(1+ y

(i)

)

−f

(i, j)

(1+ x

(i)

) − f

(i, j)

(i)

(i, j)

(i)

− f

(i, j)

(i)







(i)







(i, j)

−f

(i, j)







. (16)

The posterior distribution of {~r

( j)

} is also derived

as follows:

p({~r

( j)

}|{f

(i, j)

},Θ,

d) =

p({~r

( j)

},{f

(i, j)

}|Θ,

p({f

(i, j)

}|Θ,

∝ p({f

(i, j)

}|{~r

( j)

d,σ

)p({~r

( j)

}|σ

)

(2π)

∏

det

( j)

×exp

(

−

∑

j=1



( j)

−~r

( j)



⊤

( j)

−1



( j)

−~r

( j)



)

(17)

( j)

= −

( j)

∑

i=1

(i, j)

, (18)

( j)

∑

i=1

(i, j)

(i, j)⊤

−1

. (19)

SelectiveUseofOptimalImageResolutionforDepthfromMultipleMotionsbasedonGradientScheme