LARGE SCALE LOCALIZATION

For Mobile Outdoor Augmented Reality Applications

I. M. Zendjebil, F. Ababsa, J-Y. Didier and M. Mallem

IBISC Laboratory, University of Evry Val d’Essonnes, Evry, France

Keywords:

3D localization, Outdoor application, Hybrid sensor, Fallback system, Markerless pose estimation, 2D/3D

matching.

Abstract:

In this paper, we present an original localization system for large scale outdoor environments which uses a

markerless vision-based approach to estimate the camera pose. It relies on natural feature points extracted from

images. Since this type of method is sensitive to brightness changes, occlusions and sudden motions which

are likely to occur in outdoor environment, we use two more sensors to assist the vision process. In our work,

we would like to demonstrate the feasibility of an assistance scheme in large scale outdoor environment. The

intent is to provide a fallback system for the vision in case of failure as well as to reinitialize the vision system

when needed. The complete localization system aims to be autonomous and adaptable to different situations.

We present here an overview of our system, its performance and some results obtained from experiments

performed in an outdoor environment under real conditions.

1 INTRODUCTION

Localization process is crucial for many applications

such as augmented reality or robotics. Most systems,

mainly those based on video see-through, use vision-

based approaches. The vision-based approaches esti-

mate the camera pose. However, in outdoor environ-

ments, these approaches are sensitive to work condi-

tions such as brightness changes, occlusions and sud-

den motions. For this reason, these applications con-

verge to use hybrid sensor systems to overcome the

drawbacks of using a single type of sensor (i.e. cam-

era) in order to gain in robustness and accuracy.

The idea of combining several kind of sensors is

not new. Indeed, in (Azuma, 1993), following the reg-

istration criteria imposed by the AR applications, R.

Azuma suggests to use the hybrid sensors in order to

improve efﬁciency. He gives the example of inertial

sensors which have inﬁnite range but poor accuracy

due to accumulated drift. Using speciﬁc measure-

ments provided by several types of sensors during a

short time can correct the drift and improve the ef-

ﬁciency of each used sensor. In parallel, in robotic

applications, Vi

eville et al. (Vi

eville et al., 1993) pro-

posed to cooperate the vision with inertial sensor to

automatically correct the path of an autonomous mo-

bile robot. This idea was inspired by human behavior.

Indeed, the human is moving in its environment using

the vestibular organ, located at the inner ear, and eyes.

By comparison, the inertial sensor has the function of

the vestibular organ and the camera replaces the eye.

In our work, we focus on combining several types

of sensors in order to continuously provide an accu-

rate estimation of the position and the orientation of

the camera assisted by a GPS receiver and an inertial

sensor. These sensors operate following an assistance

strategy where some sensors are used as a fallback to

the other ones. The system adapts to external condi-

tions by changing its internal state.

The paper is structured as follows: After expos-

ing the related works, we present an overview of our

proposed system. The section 3 illustrate a presenta-

tion of the assistance strategy. The section 4 and 5

present the vision-based localization and the predic-

tion and correction process. Experiments and resultas

are developed in the last section.

2 RELATED WORKS

Most works converge towards coupling vision based-

methods and other types of sensors mainly inertial

sensors. We can distinguish between two strategies

for combination: data fusion or assistance.

The data fusion approach aims to merging all data

provided by all sensors (mostly camera and inertial

sensor). Such strategy usually implemats a Kalman

492

M. Zendjebil I., Ababsa F., Didier J. and Mallem M..

LARGE SCALE LOCALIZATION - For Mobile Outdoor Augmented Reality Applications.

DOI: 10.5220/0003364404920501

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2011), pages 492-501

ISBN: 978-989-8425-47-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

ﬁlter (You et al., 1999; Ribo et al., 2002; Hol et al.,

2006; Reitmayr and Drummond, 2006; Bleser, 2009;

Ababsa, 2009; Schall et al., 2009) or a particle ﬁl-

ter (Ababsa and Mallem, 2007; Bleser and Strickery,

2008) in order merge data. Generally, data from other

sensors such as gyroscopes and magnetometers are

used to predict the 3D motion of the camera which is

then reﬁned using artiﬁcial vision techniques. These

approaches are interesting because the measures are

estimated by combining all data from various sensors

used on a model that describes the cinematic camera

motion. Some works propose to use complementary

type ﬁlter such as in (Ababsa and Mallem, 2007) to

compensate the differences in sampling time and the

unavailability of data at certain times.

Other works proposed to use an assistance

scheme. The principle is to rely on a main sensor to

provide accurate and robust localization and replace

it by others sensors when it fails to provide a con-

sistent measure. This concept appears with the work

of Borenstein and Feng (Borenstein and Feng, 1996)

in order to combinate gyros and odometry in mobile

robots. The development of this approach aims to

overcome the fact that using motion models do not

anticipate a kinf of motion. Vision has demonstrated

through several works that it is able to provide satis-

factory camera pose estimation. However, the prob-

lem arises when the sensor is unable to provide a con-

sistent estimate in case of occlusions (partial or total)

or sudden motion that may occur in hand-held sys-

tems. In these cases, vision needs to be substitued

by other sensors. So, an assistance strategy relies on

two subsystems: a main subsystem and an fallback

subsystem. The main subsystem provides continu-

ous measurements. When it fails, the fallback subsys-

tem takes over until the main subsystem is operational

again. We ﬁnd this principle in Aron et al (Aron et al.,

2007) and Maidi et al. (Maidi et al., 2005) works.

Following an assistance scheme, our idea is to pro-

pose an autonomous system that adapts to different

situations encountered while working. According to

available data, the system decides to perform a partic-

ular type of processing in order to continuously pro-

vide an accurate localization estimation. This is re-

ﬂected when the vision deﬁned as main subsystem is

operational, the localization system grants its conﬁ-

dence in measures provided by vision-based methods.

The system should be able to detect the vision failure

in order to switch to the assistance system. Certainly

the idea of the assistance is not new. However, it has

only been tested in small indoor environments. Our

goal is to see the behavior of such strategy in large

scale environments and see its potential outdoor and

in mobile situation. We aim at proposing a palliative

method to vision. We want to propose a software so-

lution which gives some intelligence to the system so

it can adapt itself according to available data and the

tracking accuracy.

3 SYSTEM OVERVIEW

In our work, we are moving to a system combining

several types of sensors. Our hardware system is

composed of a tablet-PC connected to three sensors

dedicated to the localization task (cf. ﬁg.1): a GPS

receiver worn by the user and an inertial sensor at-

tached rigidly to a camera. The GPS returns a global

positioning. The inertial sensor estimates 3D orien-

tations, accelerations, angular velocity and 3D mag-

netic ﬁelds. The camera is used for the visual feed-

back and to exploit video stream to provide camera

pose. The combination of the GPS receiver and the in-

ertial sensor can provide an estimation of the position

and orientation. Thus, using an assistance scheme,

our system will be divided in two subsystems: a main

vision subsystem and a fallback subsystem using GPS

and inertial data called Aid-Localization (AL) subsys-

tem (Zendjebil et al., 2008). The AL subsystem is not

only restricted to the fallback functionnality. It has

a hand in the process of (re)initialization of the main

subsystem.

(a) Tablet PC with Camera and

inertial sensor

(b) GPS receiver

Figure 1: Hardware platform.

Several issues must be taken into account to im-

plement this system. Among them, the hybrid sensor

should be calibrated in order to deﬁne the relation-

ship between the different sensors local coordinate

system and standardize measurements in the same co-

ordinate system. We use calibration process described

in (Zendjebil et al., 2010). Another problem is to de-

ﬁne criteria to detect failures of the vision subsystem.

Added to this is the imprecision of measurements pro-

vided by the assistance subsystem compared to vi-

sion. Therefore, we need to estimate the generated

errors to correct the data in order to converge to the

measurements given by the vision subsystem in terms

of registration accuracy. This brings us to estimate

the offset between the two measurements. So we in-

LARGE SCALE LOCALIZATION - For Mobile Outdoor Augmented Reality Applications

493

Figure 2: System workﬂow.

corporate into the aid-localization subsystem a pre-

diction/correction module. Our system is presented

in the ﬁgure 2. Our two subsystems are designed to

interact with each other in order to exploit the data

provided by each subsystem. Now we will expose the

details of our assistance strategy.

4 ASSISTANCE STRATEGY

Our system follows on assistance sheme. Thus, we

will decompose it into four states:

1. init state: the system initializes itself using the

semi-automatic approach described in section 5.1;

2. vision predominance state: where the system uses

vision based method for localization;

3. AL predominance state: where the system uses

AL subsystem to estimate the localization;

4. reinit state: through this state, the system tries to

reinitialize the vision after its failure.

The system switches from one state to another ac-

cording to different criteria. To modelize these transi-

tions, we use the formalism of ﬁnite state machine

which is a theoretical model composed of a ﬁnite

number of states and transitions between these states.

This formalism is mainly used in the theory of com-

putability and formal languages. Using the states pre-

sented above, the transitions described in ﬁgure 3 al-

low to control our system as follows:

1. Initially, the system is in the init state where it tries

to perfom 2D/3D matching using semi-automatic

approach (cf. section(5.1));

2. Once the initialization is performed and validated,

the vision subsystem starts. Thus, the system

switches from init state to vision predominance

state (transition (1));

3. When the system is in the predominance Vision

state, it uses the vision-based method described in

5. Each estimated pose is assessed by the system.

If it is validated, it will be used for registration.

Moreover, this pose is used in the learning phase

of the error in the Gaussian process (cf. section

6);

4. If the pose is not validated, the system switches

from predominance vision to the AL predomi-

nance (transition (2));

5. When the tracking system switches to AL predom-

inance, the camera pose is provided by the AL

subsystem using a prediction/correction scheme;

6. After a few video frames, the system tries to reini-

tialize the vision subsystem. Thus, the system

switches to reinit state (transition (3));

7. In reinit state, the system uses an automatic pro-

cedure to ﬁnd the 2D/3D matches. To speed up

computations, the poses provided by the AL sys-

tem are used to deﬁne a search area in the current

image. These search areas are determined around

the projected 3D points with the last pose pro-

vided by the AL subsystem in order to restrict the

area to match patches composed of SURF features

and associated to each 3D point (see section 5.3);

8. If the reinitialization step succeeds, the system

switches to the vision predominance state (tran-

sition (4));

9. If the system does not succeed in reinitializing

the vision subsystem, the system returns to init

state in order to use the semi-automatic procedure

(transition (5));

10. The system offers the possibility for the user to

force the system to switch to init state at any time

if he considers itself the system does not operate

properly (transitions (6) and (7))

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

494

Figure 3: Localization system: state machine.

Figure 4: Point-based method: data ﬂow.

5 VISION-BASED

LOCALIZATION

The vision-based methods use the video stream to es-

timate the camera pose which is the relationship that

maps the world coordinate system R

on the cam-

era coordinate systemR

. The image is then obtained

by the perspective projection model. Let be M

, Y

, Z

)

, i = 1..n, n ≥ 3 a set of points deﬁned in

, which coordinates in R

, M

= (X

, Y

, Z

)

, are

given by:

= R

(1)

Such that R

= (r

, r

) and t

= (t

, t

)

are

respectively the rotation matrix and translation vector.

If m

= (u

, v

) is the projection image of the point M

on a normalized plan, the relationship between M

and

is given by:

(RM

+t) (2)

This equation is known as collinearity equation. Fi-

nally, the pose estimation is viewed as a minimization

of the error between the 2D image points m

and the

projection of 3D points M

their corresponding. This

forms a set of 2D/3D matches. The re-projection error

is as follows:

E(R

, t

) =

∑

−

(3)

There are several algorithms to minimize the criterion

of equation 3. We choose the orthogonal iteration al-

gorithm (IO) (Lu et al., 2000) for its accuracy and

global fast convergence. Thus, to estimate the cam-

era pose using the points, we must ﬁnd the 2D/3D

matching points. So, we propose a 2D/3D match-

ing point method. The idea is to match the 3D points

considered relevant in a 3D model and can be easily

identiﬁed in the images as the corners. However, this

matching step will occur only during an initialization

step and must be maintained in the video stream in or-

der to be able to estimate the pose. This initialization

phase is followed by a 2D visual tracking. Indeed,

the visual tracking allow us to ﬁnd the positions of

2D points, originals identiﬁed as a projection of 3D

points in each new image of the video stream. By

ﬁnding the position of these points in each image, we

can indirectly obtain the 2D/3D matching. Indeed,

knowing the 2D/3D matching at time t, and following

the tracking from points in image t to the image t + 1,

we deduce the 2D/3D matching. The ﬁgure below

(see ﬁg.4) illustrates the data ﬂow of the method de-

scribed above. Now we will describe the initialization

method that we propose.

5.1 Semi-automatic Approach

The initialization process is very important. It repre-

sents the process that matches 3D visible points with

their 2D projections in the initial view. A bad match-

ing affects the 3D localization estimation. In order

to avoid a full manually point matching done by the

user, we propose a semi-automatic matching proce-

dure. The approach consists in making a rendering of

a wireframe model that is manually registered by the

user over the real view by moving around the cam-

era. Once the registration is validated, the second step

consists in identifying the 2D correspondances points.

The process detects the corners close to the projec-

tions of the 3D points using Harris detector (Harris,

1993). For each 3D point, we associate a SURF de-

scriptor (Bay et al., 2008). Next, the process matches

a 2D points which have the shortest distance between

its descriptor and the descriptors computed off-line to

the 3D points of the model. Once the 2D/3D match-

ing is obtained, the second phase is to maintain it

throughout the video stream by track visually the ob-

tained 2D points using Kanade-Lucas-Tomasi Tracker

LARGE SCALE LOCALIZATION - For Mobile Outdoor Augmented Reality Applications

495

(KLT) (Lucas and Kanade, 1981). This method has

the advantage of operating in real time.

5.2 Failure Tests

The pose estimated by vision can be wrong. So, we

need to handle errors in order to switch to the Aid-

Localization subsystem. The errors are due to sev-

eral parameters affecting the visual tracking mainly

occlusions, sudden motion and the change of bright-

ness. Therefore, we deﬁne some criteria for judging

the validity of the estimated pose. If one of these cri-

teria is not veriﬁed, the pose is rejected and the system

switches to the Aid-Localization subsystem.

5.2.1 Number of Tracked Points

The number of 2D/3D matching points affects the ac-

curacy of the minimization of the equation 3. Indeed,

the more we have 2D/3D matched points; the more

the estimated pose is accurate. We empirically de-

ﬁned a minimum number of matching. Below this

threshold, it is considered impossible to estimate the

pose with the vision. Theoretically, we need 3 points

to estimate the camera pose but in practice with 10

points, well distributed in the scene, we obtain a good

estimation.

5.2.2 Projection Error

The number of matched points is not sufﬁcient, we

use also projection error. This error represents the av-

erage square of the difference between the projection

of 3D points using estimated pose and the 2D image

points. If the error is large (greater than an thresh-

old), the pose is considered wrong. The reprojection

threshold is deﬁned in the range of 25 to 100 pixels

5.2.3 Conﬁdence Intervals

In addition to the above criteria, the data provided by

the Aid-localization subsystem can be used as an in-

dicator of validity for the camera poses obtained by

the vision subsystem. Indeed, these data can be used

to deﬁne conﬁdence intervals, for judging whether

the camera pose is consistent or not. Thus, from

each position obtained from GPS and transformed

with the calibration parameters, we can deﬁne an el-

lipse whose center is determined by this position and

whose axes are deﬁned by 3σ (the standard devia-

tion of the offset obtained between GPS and cam-

era position) or empirically. During the validation

step, the position obtained with the camera is checked

against the obtained conﬁdence interval. If this posi-

tion is deﬁned in this interval, it is considered valid

otherwise it is rejected. Regarding orientation, each

orientation given by the camera is compared to the

orientation given by the inertial sensor. The sys-

tem estimate the difference between the two rotations

(∆R = R

f (R

)). Computing the trace of this dif-

ference, we can deduce the angle θ between these two

rotations as θ = arccos

Trace(∆R)−1

. If both rotations

are identical, the result should be equal to the identity

matrix which trace is equal to 3 (i.e. θ = 0). Thus,

the validity test consists in estimating the trace of the

difference of the two rotations. Then, if this trace is

below a deﬁned threshold, the obtained orientation is

considered valid otherwise it is rejected. In practise,

we choose threshold equal to 2.9 which corresponds

to ≈ 18

◦

5.3 Automatic Initialization

Unlike the semi-automatic approach, the automatic

approach does not need the intervention of the user.

This approach is useful to reinitialize the vision sub-

system. The idea consists in using the patches. But

instead of associating for each 3D point an image ar-

eas centered around this point, we will use descrip-

tors extracted around each 2D projection of a 3D point

model by deﬁning an area centered around this point,

using an operator to detect features points. We use

the SURF detector. This detector is characterized

by its robustness and its invariance against rotation

and scale changes. The SURF points deﬁned around

the projection of 3D points can recover indirectly the

matching of the 3D points. To ﬁnd the corresponding

of the projections of 3D points, we choose to ﬁnd the

relationship between two images. Identiﬁes the trans-

formation that maps a point m

deﬁned in image i to

image j at the point m

. This homography is calcu-

lated from a set of matches, in our case obtained from

SURF matching. This homography can ﬁnd the corre-

sponding 3D points by transforming their 2D projec-

tions in image i to the image j using the estimated ho-

mography. In this way, if we know the 2D/3D match-

ing at time i, we can ﬁnd them at time j. To make

the matching robust and eliminate outliers, we use the

RANSAC algorithm (Fischler and Bolles, 1981).

6 ERROR PREDICTION

AND CORRECTION

The estimation of the produced error is important in

our localization process. Indeed, it allows quantify-

ing the quality of measurements in order to improve

the 3D localization estimation provided by the AL

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

496

subsystem. Our error represents the offset between

the camera pose and the position and orientation de-

duced from GPS and inertial sensor. When the vi-

sion fails, we need to predict this error. So, we model

this error as a regression with a Gaussian process

(Williams, 1997). The idea of using the Gaussian pro-

cess to predict error has been proposed in the work of

Drummond and Reitmayr (Reitmayr and Drummond,

2007). They used to predict the error of GPS in order

to reinitialize the visual tracking. During visual track-

ing, we record the offset between the AL subsystem

and vision subsystem. This represents an online train-

ing step. When the visual tracking fails, the Gaussian

process predicts the offset made by GPS. This offset

which is represented by the mean error is used to cor-

rect the estimation of 3D localization.

7 EXPERIMENTS AND RESULTS

Our system is developed using ARCS (Didier et al.,

2006) (Augmented Reality System Component), a

component-programming system. ARCS allows to

prototype rapidly AR applications and facilitates in-

terfacing multiple heterogeneous technologies. On

the one hand, ARCS uses a programming paradigm

of classical components specially designed to meet

the constraints imposed by the AR applications (es-

pecially real-time constraint). On the other hand,

ARCS is based on a ﬁnite state machine which allows

switching from one state to another state resulting in

the reconﬁguration of the organisation of our compo-

nents. This feature facilitates the implementation of

our hybrid system.

The experiments were performed using an USB uEye

UI-2220RE industrial camera with 8mm focal length.

The camera captures a video frames with a resolution

of 768x576. Our tests are performed at 10 fps. The

attached inertial sensor is an XSens MTi which con-

tains gyroscopes, accelerometers and magnetometers

and provides 3D orientation data. The GPS receiver is

a Trimble Pro XT which has an accuracy lower than

the meter. The system runs on a handheld Dell tablet-

PC Latitude XT CORE 2 DUO U7700(1, 33GHZ)A.

For all our experiments, we have a 3D model of the

building we track in the scene. This model is build

based on data acquired using telemetry and building

blueprints. The model contains primarily a set of rel-

evant 3D points of the building.

We evaluated our localization system using real data

acquired in outdoor under real conditions. The cam-

era was calibrated off-line using the Faugeras-Toscani

algorithm (Faugeras and Toscani, 1987) to compute

intrinsic parameters. The hybrid sensor was cali-

brated using a set of reference data (GPS positions

and images for GPS/Camera calibration and inertial

sensor orientations and images for Inertial/Camera

calibration). The experiments conducted are intended

to demonstrate how the system operates in different

situations, mainly:

1. The occlusion of tracked points: this may be

caused by objects or by the camera motion;

2. The brightness variations;

3. Sudden and rapid motion of camera worn by the

user.

The system is worn by a un constrained user moving

in an outdoor environment. In parallel, the system

estimates the position and orientation of the camera.

In order to visualise the obtained results, we will

ronment. We opt for a color code to differentiate

between the two sub-systems operational. Thus, if

registration is obtained with data provided by the

vision subsystem, the model will be shown in red.

Otherwise if the poses are calculated with the AL

subsystem, the model is in magenta.

7.1 Occlusion Case

When the vision subsystem was in operation, we have

obscured some of the tracked points used in the pose

estimation. We can see in ﬁgure 5 an example of

obtained results. We can observe that in 5-(b-c) the

system uses the Al subsystem to align the wireframe

model on real image. The localization system detects

that there are not enough points to estimate the pose

using the vision. Thus, the system switches to the AL

subsystem that provides the necessary poses for regis-

tration. Meanwhile, the localization system is trying

to reinitialize the vision. When it succeeds to obtain a

sufﬁcient number of matching points, the vision sub-

system reprises his role as can be noted from Figure

5-d. We conducted these tests several times and in

each time the system can adapt to the situation. Con-

cerning the registration when the system uses the AL

subsystem, we can see that the wireframe model is

registered properly on the real view. Admittedly, this

registration is not accurate compared to what gives

the vision, but this is enough. In addition, projections

of 3D points are in the neighbourhood of their corre-

spondent, which helps in the reintialisation step.

7.2 Case of Change in Brightness

The variations in brightness affect directly the visual

tracking and can generate false matches. Figure 6

LARGE SCALE LOCALIZATION - For Mobile Outdoor Augmented Reality Applications

497

(a) (b)

Figure 5: Semi-occlusion case: Obtained results.

(a) (b)

Figure 6: Change in brightness are handled properly.

shows an example of variation in brightness which

we can see that the image becomes darker (see ﬁg.

6-b and 6-c ). When the brightness varies the visual

tracking fails. The AL subsystem replaces it. We can

observe that the reinitialization approach has ﬁnd suc-

cessfully the 2D/3D matching despite the difference

in brightness. This is due to the use of SURF descrip-

tors which have the advantage of being invariant to

changes in brightness.

7.2.1 Sudden Motion Case

In mobile situation, the user’s motion are not always

smooth, uniform and slow. Indeed, they may be

abrupt and thus create blurred images. In this case,

the visual tracking fails. In fast motion case, the im-

age displacement can be important and thus the visual

tracking can not ﬁnd any matches or tends to cause

(a) (b)

Figure 7: Sudden motion case: obtained results.

mismatches. We have in ﬁgure 7-b an example of

blurred image due to a rapid motion of the camera

caused by user mobility.

The presence of the blur is detected by an insufﬁcient

number of point or a high projection error. The AL

subsystem becomes functional until the vision sub-

system reinitializes (see ﬁg. 7-d) . However, we

found that in some cases from the registration ob-

tained with AL subsystem data is not good enough.

This is because sometimes in the presence of sud-

den motion, the failure of the visual tracking is not

detected quickly which inﬂuences the measurements

used for the correction.

7.3 System in Mobility Situation

The results obtained when the whole system is func-

tional is given below. The initialization process al-

lows us to have the matching of the 3D visible points

from the 3D model with their projections in the ﬁrst

view. From this 2D/3D matching, the set of 2D points

are tracked from one frame to another. For each

frame, we register the wireframe model using the po-

sitions and orientations obtained with our hybrid lo-

calization system.

In ﬁgure 8, the green color projection is obtained

from the positions and orientations provided by the

vision subsystem. Visually, the model is registered to

the real view. In magenta, the projected model is ob-

tained with the positions and orientations provided by

the Aid-Localization subsystem. We observe on ﬁg-

ure 8 that when vision fails, the localization system

switches to the Aid-localization subsystem to provide

localization. The localization is corrected with the

predicted error which contributes to improve the es-

timation (Figure 8). The obtained results are quite

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

498

(a) #0686 (b) #1053

Figure 8: Registration of the 3D model using the poses ob-

tained with our Hybrid system.

(a) #1236 (b) #1239

Figure 9: Registration of the 3D model using the Aid-

Localization subsystem: Occlusion case.

satisfactory regarding our needs (i.e. correct registra-

tion).

In ﬁgure 9, we can observe that during the occlusion

of tracked points the Aid-localization subsystem al-

lows to provide an estimation of the position and ori-

entation. Therefore, even in total occlusion, our sys-

tem can provide a rough estimation of the localiza-

tion.

7.4 Performances of System

To assess the accuracy of the inertial sensor, we com-

pared the orientations produced from the sensor data

to those computed by the vision pose estimation al-

gorithm. We recorded a video with several orienta-

tions in an outdoor environment. The two sensors are

Figure 10: Angle errors = Camera’s orientation vs. Inertial

sensor’s orientation.

(a) (b)

Figure 11: Registration using vision subsystem (red line)

vs. AL subsystem without correction (bleu line).

also time-stamped. Figure 10 shows the error between

the two orientations. With a mean error of about

(θ

= 0.27

◦

, θ

= 0.41

◦

, θ

= 0.24

◦

) and standard de-

viations of (0.28, 0.43, 0.25), we obtained good re-

sults. These errors which are acceptable (in some ex-

treme cases around 5

◦

) can be caught and corrected

with the error prediction.

Regarding the realignment, we present in ﬁgure

11 a comparaison between the results obtained with

the vision subsystem and AL-subsystem. The pro-

jected model shown in red line is obtained with the

poses estimated using the vision. We obtain a mean

reprojection error around 1.67 pixels with a standard

deviation about 2.37. Furthermore, with the AL sub-

system, we obtain the model proposed in blue. From

the poses provided by the AL subsystem, the system

gives a mean reprojection error equal to 47 pixels with

a standard deviation of 60. Note that this result is ob-

tained without correction. However, in our various

tests, we found out that external factors can affect the

inertial measurements, particularly in deﬁning its lo-

LARGE SCALE LOCALIZATION - For Mobile Outdoor Augmented Reality Applications

499

(a) Without correction

(b) With correction

Figure 12: Registration using inertial sensor’s orientaion

and GPS position.

cal level reference frame R

where the x axis points

local magnetic north. This causes errors in the orien-

tations’ estimation. To correct this, we propose to re-

estimate continuously the rotation between R

asso-

ciated and the world reference frame R

. By observ-

ing the registration of the wireframe model on the real

image, drift can be observed in ﬁgure 12. Figure 12.a

present the registration using orientation provided by

the inertial sensor and without correcting the rotation.

We notice clearly that the wireframe is not aligned

correctly due to wrong orientation contrary to ﬁgure

12.b where the rotation R

is corrected online, the

wireframe is registered correctly over the real view.

8 CONCLUSIONS

AND FUTURE WORK

In this paper, we have presented a localization system

combining three sensors (camera, GPS and inertial

sensor) dedicated to large scale outdoor environ-

ments. The proposed system operates under an as-

sistance scheme by deﬁning two subsystems. The

main subsystem is represented by vision that provided

continuously localization measurements using mark-

erless approch. The fallback subsystem, called Aid-

Localization, is composed of the GPS receiver and the

inertial sensor. This subsystems has the main role to

replace the vision subsystem where it can not provide

a correct localization measurements. In deed, accord-

ing to the extern conditions, the system changes its

internal state to adapt itself and to provide a local-

ization measurement under any circumstance. These

state changes trigger switches from a subsystem to an-

other according to available data and localization ac-

curacy.

Various issues were addressed. In addition to

calibration approaches, we are interested primarily

in how to handle the different switches and to pro-

pose appropriate approaches. The vision subsystem

used a point-based pose estimation approach which

used a natural 2D points extracted from images and

matched to a 3D model that describes the 3D structure

of the environment. So, we have proposed two efﬁ-

cient initialization approaches (a semi-automatic and

automatic) which allows to match 2D image points

with 3D points. The automatic approach allows to

reinitialize the vision subsystem by using descriptors

patches. The method is efﬁcient, robust and accurate

even when the point of views are very different (large

motion and/or brightness variations ). To improve

the accuracy of the AL subsystem, we use Gaussian

process to predict and correct the error introduced by

GPS and inertial sensor in order to have the same ac-

curacy in registration as vision subsystem.

We can conclude that the obtained results are quite

satisfactory with respect to the purpose of an AR sys-

tem (i.e. correct registration) with a quite good ac-

curacy. Tested in outdoor environment, our system

adapts to the conditions in the environment. For ex-

ample, as shown in the results, in the case of total oc-

clusion, the AL system takes over the 3D localization

estimation until the vision becomes operational.

However, improvements must be made in vision-

based method. Indeed, other vision-based methods

can be used such as edge-based methods to improve

the accuracy of the vision-based pose estimation. In

addition, to provide more mobility to the user, the sys-

tem can contain a SLAM (Simultaneous Localization

and Mapping) approach in order to reconstruct un-

modelled environment. This allows to enrich online

the 3D model and also allow to the user to evolve in

this part.

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

500

ACKNOWLEDGEMENTS

This work is supported by the RAXENV project

funded by the French National Research Agency

”ANR”.

REFERENCES

Ababsa, F. (2009). Advanced 3d localization by fusing

measurements from gps, inertial and vision sensors.

In Systems, Man and Cybernetics, 2009. SMC 2009.

IEEE International Conference on, pages 871 –875.

Ababsa, F. and Mallem, M. (2007). Hybrid 3d camera

pose estimation using particle ﬁlter sensor fusion. In

Advanced Robotics, the International Journal of the

Robotics Society of Japan (RSJ), pages 165–181.

Aron, M., Simon, G., and Berger, M. (2007). Use of inertial

sensors to support video tracking: Research articles.

Comput. Animat. Virtual Worlds, 18(1):57–68.

Azuma, R. (1993). Tracking requirements for augmented

reality. Commun. ACM, 36(7):50–51.

Bay, H., Ess, A., Tuytelaars, T., and Goo, L. V. (2008). Surf:

Speeded up robust features. Computer Vision and Im-

age Understanding (CVIU), 110(3):346–359.

Bleser, G. (2009). Towards Visual-Inertial SLAM for Mo-

bile Augmented Reality. PhD thesis, Technical Uni-

versity Kaiserslautern.

Bleser, G. and Strickery, D. (2008). Using the marginalised

particle ﬁlter for real-time visual-inertial sensor fu-

sion. Mixed and Augmented Reality, IEEE / ACM In-

ternational Symposium on, 0:3–12.

Borenstein, J. and Feng, L. (1996). Gyrodometry: A new

method for combining data from gyros and odom etry

in mobile robots. In In Proceedings of the 1996

IEEE International Conference onRobotics and Au-

tomation, pages 423–428.

Didier, J., Otmane, S., and Mallem, M. (2006). A compo-

nent model for augmented/mixed reality applications

with reconﬁgurable data-ﬂow. In 8th International

Conference on Virtual Reality (VRIC 2006), pages

243–252, Laval (France).

Faugeras, O. and Toscani, G. (1987). Camera calibration

for 3d computer vision. In Proc. Int’l Workshop In-

dustrial Applications of Machine Vision and Machine

Intelligence, pages 240–247.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: a paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Commun. ACM, 24(6):381–395.

Harris, C. (1993). Tracking with rigid models. Active vi-

sion, pages 59–73.

Hol, J., Schon, T., Gustafsson, F., and Slycke, P. (2006).

Sensor fusion for augmented reality. In Information

Fusion, 2006 9th International Conference on, pages

1–6, Florence. IEEE.

Lu, C.-P., Hager, G. D., and Mjolsness, E. (2000). Fast

and globally convergent pose estimation from video

images. IEEE Trans. Pattern Anal. Mach. Intell.,

22(6):610–622.

Lucas, B. and Kanade, T. (1981). An iterative image regis-

tration technique with an application to stereo vision.

In IJCAI81, pages 674–679.

Maidi, M., Ababsa, F., and Mallem, M. (2005). Vision-

inertial system calibration for tracking in augmented

reality. In 2nd International Conference on Informat-

ics in Control, Automation and Robotics, pages 156–

162.

Reitmayr, G. and Drummond, T. (2006). Going out: Robust

model-based tracking for outdoor augmented reality.

In IEEE ISMAR, Santa Barbara, California, USA.

Reitmayr, G. and Drummond, T. (2007). Initialisation for

visual tracking in urban environments. In IEEE IS-

MAR, Nara, Japan.

Ribo, M., Lang, P., Ganster, H., Brandner, M., Stock, C.,

and Pinz, A. (2002). Hybrid tracking for outdoor aug-

mented reality applications. IEEE Comput. Graph.

Appl., 22(6):54–63.

Schall, G., Wagner, D., Reitmayr, G., Taichmann, E.,

Wieser, M., Schmalstieg, D., and Wellenhof, B. H.

(2009). Global pose estimation using multi-sensor fu-

sion for outdoor augmented reality. In In Proceedings

of IEEE Int. Symposium on Mixed and Augmented Re-

ality 2009, Orlando, Florida, USA.

eville, T., Romann, F., Hotz, B., Mathieu, H., Buffa, M.,

Robert, L., Facao, P., Faugeras, O., and Audren, J.

(1993). Autonomous navigation of a mobile robot

using inertial and visual cues. In Proceedings of the

IEEE International Conference on Intelligent Robots

and Systems.

Williams, C. (1997). Prediction with gaussian processes:

From linear regression to linear prediction and be-

yond. Technical report, Neural Computing Research

Grou.

You, S., Neumann, U., and Azuma, R. (1999). Orien-

tation tracking for outdoor augmented reality regis-

tration. IEEE Computer Graphics and Applications,

19(6):36–42.

Zendjebil, I., Ababsa, F., Didier, J.-Y., and et M. Mallem

(2010). A gps-imu-camera modelization and calibra-

tion for 3d localization dedicated to outdoor mobile

applications. In International Conference On Control,

Automation and system.

Zendjebil, I. M., Ababsa, F., Didier, J.-Y., and Mallem, M.

(2008). On the hybrid aid-localization for outdoor

augmented reality applications. In VRST ’08: Pro-

ceedings of the 2008 ACM symposium on Virtual re-

ality software and technology, pages 249–250, New

York, NY, USA. ACM.

LARGE SCALE LOCALIZATION - For Mobile Outdoor Augmented Reality Applications

501