Real-Time Estimation of Camera Orientation

by Tracking Orthogonal Vanishing Points in Videos

Wael Elloumi, Sylvie Treuillet and Rémy Leconge

Laboratoire Prisme, Polytech’Orléans, 12 rue de Blois, 45067 Orléans cedex2, France

Keywords: Vanishing Point Tracking, Camera Orientation, Video Sequences, Manhattan World.

Abstract: In man-made urban environments, vanishing points are pertinent visual cues for navigation task. But

estimating the orientation of an embedded camera relies on the ability to find a reliable triplet of orthogonal

vanishing points in real-time. Based on previous works, we propose a pipeline to achieve an accurate

estimation of the camera orientation while preserving a short processing time. Our algorithm pipeline relies

on two contributions: a novel sampling strategy among finite and infinite vanishing points extracted with a

RANSAC-based line clustering and a tracking along a video sequence to enforce the accuracy and the

robustness by extracting the three most pertinent orthogonal directions. Experiments on real images and

video sequences show that the proposed strategy for selecting the triplet of vanishing points is pertinent as

our algorithm gives better results than the recently published RNS optimal method (Mirzaei, 2011), in

particular for the yaw angle, which is actually essential for navigation task.

1 INTRODUCTION

In the context of navigation assistance for blind

people in urban area, we address the problem of the

pose estimation of an embedded camera. In man-

made urban environments, vanishing lines or points

are pertinent visual cues to estimate the camera

orientation, as many line segments are oriented

along three orthogonal directions aligned with the

global reference frame (Coughlan, 1999); (Antone

and Teller, 2000); (Kosecka and Zhang, 2002);

(Martins et al., 2005); (Förstner, 2010); (Kalantari et

al., 2011). Under this so-called Manhattan world

assumption, this approach is an interesting

alternative to structure and motion estimation based

on features matching, a sensitive problem in

computer vision. The orientation matrix of a

calibrated camera, parameterized with three angles,

may be efficiently computed from three noise-free

orthogonal vanishing points.

Since 30 last years, the literature is broad on the

subject of vanishing points (VP) computation. The

first approaches used the Hough transform and

accumulation methods (Barnard, 1983); (Cantoni et

al., 2001); (Boulanger et al., 2006). The efficiency of

these methods highly depends on the discretization

of the accumulation space and they are not robust in

presence of outliers. Furthermore, they do not

consider the orthogonality of the resulting VP. An

exhaustive search method may take care of the

constraint of orthogonality (Rother, 2000) but it is

off-side for real-time application.

Even few authors prefer to work on the raw

pixels (Martins et al., 2005); (Denis et al., 2008),

published methods mainly work on straight lines

extracted from image. According to the

mathematical formalisation of VP, some variants

exist in the choice of the workspace: image plane

(Rother, 2000); (Cantoni et al., 2001), projective

(Pflugfelder and Bischof, 2005); (Förstner, 2010);

(Nieto and Salgado, 2011) or Gaussian sphere

(Barnard, 1983); (Collins and Weiss, 1990);

(Kosecka and Zhang, 2002). Using Gaussian unit

sphere or projective plane allow to treat equally

finite and infinite VP, unlike image plane. This is

well suited representation for simultaneously

clustering lines that converge at multiple vanishing

points by using a probabilistic Expectation-

Maximisation (EM) joint optimization approach

(Coughlan and Yuille, 1999); (Antone and Teller,

2000); (Kosecka and Zhang, 2002) (Nieto and

Salgado, 2011). These approaches adress the mis-

classification and optimality issues but the

initialization and grouping are the determining

factors of their efficiency.

215

Elloumi W., Treuillet S. and Leconge R..

Real-Time Estimation of Camera Orientation by Tracking Orthogonal Vanishing Points in Videos .

DOI: 10.5220/0004214102150222

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 215-222

ISBN: 978-989-8565-48-8

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

Recently, many authors adopt robust estimation

based on RANSAC, as the code is fast, easy to

implement, and requires no initialization. These

approaches consider intersection of line segments as

VP hypotheses and then iteratively clustering the

parallel lines consistent with this hypothesis

(Förstner, 2010); (Mirzaei and Roumeliotis, 2011).

A variant by J-Linkage algorithm has been used by

(Tardif, 2009). By dismissing the outliers, the

RANSAC-based classifiers are much more robust

than accumulative methods, and give a more precise

position of the VP, limited by the size of the

accumulator cell. They have been used to initialize

EM estimators to converge to the correct VP. Other

optimal solutions rely on analytical approach often

based on time-consuming algorithms (Kalantari et

al., 2011); (Mirzaei and

Roumeliotis, 2011); (Bazin et

al., 2012). In this last paper, it is interesting to note

that, even if they are non deterministic, the

RANSAC-based approaches hold comparable results

against exhaustive search for the number of

clustered lines. So, it remains a very good approach

for extracting the VP candidates, in addition with a

judicious strategy for selecting a triplet consistent

with the orthogonality constraint.

Indeed, the estimation of camera orientation

relies on the ability to find a robust orthogonal triplet

of vanishing points in a real image. Despite

numerous papers dedicated to the straight line

clustering to compute adequate vanishing points, this

problem remains an open issue for real time

application in video sequences. The estimation of

the camera orientation is generally computed in a

single image. Few works address the tracking along

a video sequence (Martins et al., 2005).

Based on previous works, we propose a

pragmatic solution to achieve an accurate estimation

of the camera orientation while preserving a short

processing time. Our algorithm pipeline relies on

two contributions: a novel sampling strategy among

finite and infinite vanishing points extracted with a

RANSAC-based line clustering, and a tracking along

a video sequence.

The paper is organized as follows. An overview

of the method is proposed in Section 2. The Section

3 presents experimental results and the Section 4

concludes the paper.

2 PROPOSED PIPELINE

The proposed pipeline is given in figure 1.

To achieve an accurate estimation of the camera

orientation based on three reliable orthogonal

vanishing points (VP), the pipeline is composed of

four steps. The first one consists on dominant line

extraction from the detected image edges. The

second one consists on selecting a triplet of

vanishing points by dominant line clustering with

RANSAC. At this step, we introduce a clever

strategy to select only three reliable orthogonal VP

that represent the orientation of the camera relative

to 3D world reference frame. Another contribution is

the vanishing point tracker performed along the

video sequences (step 3) to enforce the robustness of

the camera orientation computation (step 4). The

next sections give some details and justifications

about each bloc.

Figure 1: Overview of the proposed algorithm.

2.1 Dominant Line Detection

Some pre-processing are introduced to improve the

quality and the robustness of the detected edges in

case of embedded camera: first, an histogram

equalization harmonizes the distribution of

brightness levels in the image, secondly a geometric

correction of lenses distortion is done assuming that

the camera calibration matrix is known. To find the

dominant lines, we detect edges by using a Canny’s

detector. Then, edge points are projected into

sinusoidal curves in polar accumulation space by

applying a Hough Transform (HT), where peaks

correspond to the dominant clusters of line

segments. We use the probabilistic version of HT as

it is faster than the classic one. Only 10% or 20% of

the edges are randomly selected to obtain

statistically good results. Only the straight lines that

are long enough are selected as input to estimate

multiple VP in an image.

2.2 Vanishing Points Candidates

To provide three VP, each of them aligned with the

three main orthogonal directions of the Manhattan

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

216

world, the most intuitive method is to detect the

intersection of dominant lines in images. By

perspective projection, the parallel lines in 3D scene

intersect in the image plane in a so-called vanishing

points. If the image plane is parallel to one axis of

the 3D world, vanishing lines intersect very far from

the image center, that is called infinite vanishing

point, unlike the finite ones whose coordinates may

be determined in the image plan. Working directly in

the image plan is fast because it does not require a

projection in other bounded space like Gaussian

sphere. On the other hand, infinite VP need to be

detected separately from the finite ones, but we will

see that we can take advantage of this differentiation

in the good choice of orthogonal VP, with a fast and

robust sampling strategy.

Recently, numerous authors adopt RANSAC as a

simple and powerful method to provide a partition of

parallel straight lines into clusters by pruning

outliers. The process starts by randomly selecting

two lines to generate a VP hypothesis, then, all lines

consistent with this hypothesis are grouped together

to optimize the VP estimate. Once a dominant VP is

detected, all the associated lines are removed, and

the process is repeated to detect the next dominant

VP. The principal drawback of this sequential search

is that no orthogonality constraint is imposed for

selecting a reliable set of three VP to compute the

camera orientation. Very recent works propose

optimal estimates of three orthogonal VP by an

analytical approach based on a multivariate

polynomial system solver (Mirzaei and

Roumeliotis,

2011) or by optimization approach based on interval

analysis theory (Bazin et al., 2012), but at the

expenses of complex time-consuming algorithms.

In this work, we introduce a clever strategy to

extract a limited number of reliable VP while

enforcing the orthogonality constraint, in

conjunction with RANSAC.

2.3 VP Sampling Strategy

In the context of pedestrian navigation, the main

orthogonal directions in Manhattan world consist

generally in a vertical one (often associated with an

infinite VP) and two horizontal ones (associated

with finite or infinite VP). So we consider three

different possible configurations depending on the

alignment of the image plane with the 3D urban

scene: i) one finite and two infinite VP, ii) two finite

and one infinite VP, iii) three finite VP. The two

first configurations are common unlike the third.

More details about the computation of the camera

orientation depending on these three configurations

will be given in section 3.2.

For a robust selection of VP, we detect the three

finite candidates and two infinite ones that maximize

the consensus set. The criteria used in the consensus

score (1) for clustering lines by RANSAC are

different depending on each category. Unlike the

finite VP whose coordinates may be determined in

the image plan, the infinite VP are generally

represented as a direction. For finite VP, the

consensus score is based on a distance between the

candidate straight line and the intersecting point (2).

For infinite VP, it uses an angular distance between

the direction of the candidate straight line and the

direction representing the infinite VP (3).







,









(1)





,







1, 



,







0, 

(2)

where  is the number of dominant lines and





,





is the Euclidian distance from the finite VP

candidate  to the line 



. All lines whose distance is

below a fixed threshold  are considered as

participants (the threshold  is equal to 4 pixels in

our experiments).





,







1, Min



,











,







,









0, 

(3)

where ,











is the angle between the infinite VP

direction from the image center and the line 



to test

in image space (the threshold  is equal to 4° in our

experiments).

To avoid redundant VP candidates, we introduce

the supplementary constraint to be far enough from

each other: we impose on VP to have a minimum

angular distance between their directions from the

image center (threshold is set to 30° for finite VP

and 60° for infinite ones).

By separating finite from infinite VP, the

sampling strategy provides the most significant of

them without giving more importance to one or

other category (we enforce to have at least one

candidate finite). Furthermore, this clever strategy is

faster as we detect only five reliable VP candidates

against generally much more for the previous

published methods.

Among the five candidates selected before, only

three VP whose directions from the optical center

are orthogonal have to be accepted, included at least

one finite VP. We adopt the following heuristic: i)

choose the finite VP with the highest consensus

Real-TimeEstimationofCameraOrientationbyTrackingOrthogonalVanishingPointsinVideos

217

score, ii) select two other VP (finite or infinite)

based on their orthogonality to the first one, and

considering their consensus score as a second

criterion. Finally, we identify the vertical VP and the

two horizontal ones. In our application, we assume

that the camera is kept upright: we identify the

vertical VP as which presents the closest direction

with the vertical direction from the image center.

The two remaining VP are thus horizontal.

2.4 Vanishing Point Tracker

Once the whole described algorithm is processed for

the first frame of the video sequence 1, the

VP positions can be tracked from one frame to

another. Indeed, VP positions or directions are

slightly modified in video-sequences or even in a list

of successive frames. So we introduce a tracker to

check consistency between the positions of the

estimated VP in the frame  and those estimated in

frame 1 . For this we use the distance between the

positions of the VP for the finite ones 







,





and the angle between the VP directions for the

infinite ones











,











. When a VP is not coherent

with its previous position or direction, it is re-

estimated taking into account its previous position or

direction and using the remains of unclassified lines.

Hence, aberrant VP are discarded and replaced by

new VP that are, at the same time, consistent with

the previous ones and satisfy the orthogonality

constraint. This tracker is efficient since it makes our

algorithm much more stable and robust as will be

shown in the Experiments section. Once the three

most reliable VP are extracted in the image, the

camera orientation is computed frame-by-frame as

described in the next section.

2.5 Computation of the Camera

Orientation

This part is directly inspired from (Boulanger et al.,

2006) to compute the camera orientation from the

three VP supposed to be orthogonal.

We use the directions of the detected VP which

correspond to the camera orientation to compute the

rotation matrix







,,







. The vectors 



,and 





be found represent three orthogonal directions of the

scene, respectively the first horizontal direction, the

vertical direction and the second horizontal

direction. They need to satisfy the following

orthonormal relations:







.







.











.





0



‖





‖



‖





‖



‖







‖

1

(4)

The estimation of these vectors depends on the

VP configurations.

2.5.1 One Finite and Two Infinite VP

This situation is the most frequent one. It occurs

when the image plane is aligned with two axis of the

world coordinate frame. Let  be the finite VP and 

the focal length. The direction of  can be expressed

as 









,



,



whereas the directions of

the infinite VP, in image space, are 













,



,0



and 











,



,0



. The vectors of

the rotation matrix are given by the following

system of equations:





















,































,











































,











(5)

2.5.2 Two Finite and One Infinite VP

This situation happens when the image plane is

aligned with only one of the three axis of the world

coordinate frame. Let 



and 



be the two finite VP

of directions 











,



,



and 













,



,



. Since there are two finite horizontal

VP, we set 





to the closest VP to the image center.

The vector  is obtained by cross product as shown

in the system of equations below.













,



















,





























(6)

2.5.3 Three Finite VP

This last configuration is the least frequent one. It

occurs when there is no alignment between the

image plane and the world coordinate frame. Let 







and 



be the three finite VP of directions













,



,



, 











,



,



and













,



,



. We start by setting  to the

VP whose direction is closest to the vertical

direction. We then set







to the closest VP to the

image center. In the system of equations (7), we

assume that 



is the vertical VP and 



is closest to

the image center.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

218















,



















,





















,











(7)

3 EXPERIMENTAL RESULTS

This section presents the performance evaluation of

the proposed method.

3.1 Accuracy Study

For comparison purpose, we have tested our

algorithm facing the ground truth provided with the

public York Urban Database (YUD). This database

provides the original images, camera calibration

parameters, ground truth line segments, and the three

Euler angles relating to the camera orientation for

each image (Denis et al., 2008). Figure 2 illustrates

some orthogonal vanishing points and their

associated parallel lines extracted by our algorithm

on some images pulled out the YUD.

The Table 1 presents the angular distance from

Ground Truth (GT) of the camera orientation

computed with our method. The average and

standard deviation of the angular distance are

performed on a set of fifty images for the three

angles. The three last rows of the Table 1 give the

number of times the distance exceeds a fixed value

of 2, 5 and 10 degrees respectively. Our method

performs accurate estimates of the camera

orientation since the angular distance remains

inferior to 2 degrees for the most images. For

comparison purpose, the analytical method RNS

recently published by (Mirzaei and Roumeliotis,

2011), that provides optimal least-squares estimates

of three orthogonal vanishing points, performs an

average angular distance of 0.74 degree, 1.70 and

1.81 degrees for pitch, yaw, roll angles respectively.

The full results of the RNS method are available in a

technical report provided online by the authors

(http://umn.edu/~faraz/vp).

The RNS method gives the best result for the

pitch angle but it is interesting to note that our

method is significantly better for the yaw and roll

angles. The yaw is actually essential for a pedestrian

navigation task since it gives the camera viewing

direction. This may be explains by our clever

strategy for selecting orthogonal vanishing points

that are distant enough from each other without

confusion between finite and infinite points.

Table 1: Average and standard deviation of the angular

distance from the GT (in degrees).

Angular distance from GT

pitch yaw roll

Average

1,38 0,75 0,69

Standard deviation

1,57 0,60 0,65

> 2°

8 3 1

> 5°

2 0 0

> 10°

0 0 0

3.2 Tracking the Camera Orientation

To show the efficiency of our algorithm for tracking

the camera orientation, we acquire real video

sequences, with an embedded camera. Our

experimental prototype is composed of a camera

AVT GUPPY F-033C equipped with a 3.5mm lens

and a laptop. As we use a lens with a short focal

length, it is recommended to apply a geometric

distortion correction before extracting line segments.

The camera has been first calibrated using the tool:

http://www.vision.caltech.edu/bouguetj/calib_doc/, a

software proposed by Bouguet.

Figure 3 depicts some typical results of

vanishing point extraction for a video sequence. It is

composed of 350 frames (320x240 pixels) acquired

at 25 frames per second in the hallways of our

laboratory. Figure 4 compares the evolution of the

roll, pitch and yaw angles of the camera orientation

along the video sequence by applying our method

with and without the vanishing point tracker (VPT).

The VPT produces a smooth running and a more

reliable estimation for camera orientation along the

video sequence.

Since the VPT removes some

aberrant vanishing points, keeping only the points

that are consistent, we then obtain a more accurate

camera orientation.

Figure 2: Examples of some triplets of orthogonal vanishing points detected by our algorithm on images from the YUD.

Real-TimeEstimationofCameraOrientationbyTrackingOrthogonalVanishingPointsinVideos

219

Figure 3: Examples of detection and tracking of triplets of orthogonal vanishing points and their associated lines.

Figure 4: Smoothing effect of the VPT on the estimation of the camera’s orientation (pitch, yaw and roll angles).

Figure 5: Evolution of the total number of lines extracted in images and the numbers of lines respectively associated to

horizontal and vertical VP along the video sequence.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

220

To illustrate the efficiency of the proposed

sampling strategy of vanishing points based on line

clustering with RANSAC, figures 5 shows the

evolution of the number of vanishing lines extracted

along the video sequence. The figure represents the

total number of vanishing lines together with the

number of participant lines to the 3 VP (inliers),

shared into the subset of lines associated to the 2

horizontal VP and the subset the lines associated to

the vertical VP. It is clear that by using RANSAC-

based classification of lines, the method removes the

outliers.

Our method has been implemented by using

visual c++ and opencv library. The full processing

time for estimating the camera orientation takes 16

milliseconds per image of size 320x240 pixels with

non-optimized code on a laptop (intel core 2 duo

2.66ghz/4096mb). Therefore, our algorithm is

suitable for real time applications, such as

navigation assistance for blind pedestrian.

4 CONCLUSIONS

We take advantage of three reliable orthogonal

vanishing points corresponding to the Manhattan

direction to achieve accurate estimation of the

camera orientation. Our algorithm relies on a novel

sampling strategy among finite and infinite

vanishing points and a tracking along a video

sequence. The performance of our algorithm is

validated using real static images and video

sequences. Experimental results on real images,

show that, even simple, the adopted strategy for

selecting three reliable distant and orthogonal

vanishing points in conjunction with RANSA

performs well in practice since the estimation of the

camera orientation is better than those obtained with

a state-of-art analytical method. Furthermore, the

tracker proved to be relevant to dismiss aberrant

vanishing points along the sequence, making

outmoded refinement or optimization later step and

preserving a short processing time for real-time

application. This algorithm is devoted to be a part of

a localization system that should provides navigation

assistance for blind people in urban area.

ACKNOWLEDGEMENTS

This study is supported by HERON Technologies

SAS and the Conseil Général du LOIRET.

REFERENCES

M. Antone and S. Teller, 2000. Automatic recovery of

relative camera rotations for urban scene. In: Proc. of

IEEE Conf. Computer Vision and Pattern recognition

(CVPR) 282-289.

S. T. Barnard, 1983. Interpreting perspective images.

Artificial Intelligence, 21(4), 435-462, Elsevier

Science B.V.

J. C. Bazin, Y. Seo, C. Demonceaux, P. Vasseur, K.

Ikeuchi, I. Kweon and M. Pollefeys, 2012. Globally

optimal line clustering and vanishing points estimation

in a Manhattan world. In: the IEEE Int. Conf. on

Computer Vision and Pattern Recognition (CVPR).

K. Boulanger, K. Bouatouch, and S. Pattanaik, 2006.

ATIP : A tool for 3D navigation inside a single image

with automatic camera calibration. In: EG UK Conf on

Theory and Practice of Computer Graphics.

V. Cantoni, L. Lombardi, M. Porta and N. Sicard, 2001.

Vanishing Point Detection: Representation Analysis

and New Approaches. In : Proc. of Int. Conf. on Image

Analysis and Processing (ICIAP), 90-94.

R. T Collins and R. S Weiss, 1990. Vanishing point

calculation as statistical inference on the unit sphere.

In: Proceedings of the 3rd Int. Conference on

Computer Vision (ICCV), 400-403.

J. M. Coughlan and A. L. Yuille, 1999. Manhattan World:

Compass direction from a single image by Bayesian

inference. In: Int. Conference on Computer Vision

(ICCV).

P. Denis, J. H. Elder and F. Estrada, 2008. Efficient Edge-

Based Methods for Estimating Manhattan Frames in

Urban Imagery. In: European Conference on

Computer Vision (ECCV), 197-210.

W. Förstner, 2010. Optimal vanishing point detection and

rotation estimation of single images from a legoland

scene. In: Proceedings of the ISPRS Symposium

Commision III PCV. S. 157-163, Part A, Paris.

M. Kalantari, A. Hashemi, F. Jung and J.P. Guédon, 2011.

A New Solution to the Relative Orientation Problem

Using Only 3 Points and the Vertical Direction.

Journal of Mathematical Imaging and Vision archive

Volume 39(3).

J. Kosecka and W. Zhang, 2002, Video Compass, In Proc.

of the 7th European Conf. on Computer Vision

(ECCV).

A. Martins, P. Aguiar and M. Figueiredo, 2005.

Orientation in Manhattan world: Equiprojective

classes and sequential estimation. In: the IEEE Trans.

on Pattern Analysis and Machine Intelligence, Vol. 27,

822-826.

F. M. Mirzaei and S. I. Roumeliotis, 2011. Optimal

estimation of vanishing points in a Manhattan world.

In: the Proc. of IEEE Int. Conf. on Computer Vision

(ICCV).

M. Nieto and L. Salgado, 2011. Simultaneous estimation

of vanishing points and their converging lines using

the EM algorithm. Pattern Recognition Letters, vol.

32(14), 1691-1700.

Real-TimeEstimationofCameraOrientationbyTrackingOrthogonalVanishingPointsinVideos

221

R. Pflugfelder and Bischof, 2005. Online auto-calibration

in man-made world. In: Proc. Digital Image

Computing : Techniques and Applications, 519-526.

C. Rother, 2000. A new approach for vanishing point

detection in architectural environments. In: Proc. of

the 11th British Machine Vision Conference (BMVC),

382-391.

J.-P. Tardif, 2009. Non-iterative approach for fast and

accurate vanishing point detection. In: Proc. Int.

Conference on Computer Vision (ICCV), 1250-1257.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

222