Absolute Localization using Visual Data for Autonomous Vehicles

Safa Ouerghi

, R

emi Boutteau

, Pierre Merriaux

, Nicolas Ragot

Xavier Savatier

and Pascal Vasseur

Sup’Com Tunis, Ariana, Tunisia

Research Institute for Embedded Systems (IRSEEM-ESIGELEC), Saint Etienne du Rouvray, France

LITIS Lab, University of Rouen, Saint Etienne du Rouvray, France

Keywords:

Absolute Localization, Monocular vision, Structure-From-Motion, Autonomous vehicle, Vehicle Environ-

ment Perception.

Abstract:

In this paper, we propose an algorithm for estimating the absolute pose of a vehicle using visual data. Our

method works in two steps: ﬁrst we construct a visual map of geolocalized landmarks, then we localize the

vehicle using this map. The main advantages of our method are that the localization of the vehicle is absolute

and that it requires only a monocular camera and a low-cost GPS. We ﬁrstly outline our method, then we

present our experimental results on real images using a reference database: the KITTI Vision Benchmark

Suite.

1 INTRODUCTION

In recent years, vision sensors have become ever more

common in vehicles and employed in many Advanced

Driver Assistance Systems (ADAS): pedestrian detec-

tion (Dollar et al., 2012), bird’s eye view for vehi-

cle surrounding monitoring (Liu et al., 2008), backup

cameras and lane departure warnings (Kozak et al.,

2006). ADAS were initially limited to luxury vehi-

cles but have now become available on high-volume

models. The multiplication of vision sensors in ve-

hicles has led to a reduction of costs and several dif-

ferent cameras can now be embedded. Besides the

fact that cameras are inexpensive sensors, their main

advantage is their versatility due to the huge quantity

of information they provide. A single camera can in-

deed perform several functions depending on the im-

plemented algorithms.

Through research on ADAS in the past few years,

vehicles have now reached level 2 of automation as

deﬁned by SAE International’s On-Road Automated

Vehicle Standards Committee (SAE, 2015). This

level refers to partial automation, which means that

the vehicle can execute steering, acceleration and de-

celeration but the driver has to monitor the driving

environment. Researchers and car manufacturers are

now seeking to reach level 5 of automation, which

is the level of full automation. From levels 3 to 5,

the vehicle must be able to localize itself accurately

and to collect information from its surrounding envi-

ronment. These localization and perception tasks can

be solved quite easily by adding new sensors such as

lidars (Light Detection and Ranging) since they pro-

vide a 3D structure of the scene with a high level of

accuracy and at a high rate. However, the constraints

of the automotive industry are not compatible with

such sensors. Costs are high and robustness is not

optimal due to the lidar’s mechanical parts. That is

why vision-based localization is of great interest and

the scientiﬁc community is working actively on this

topic.

In this paper, we propose to use a monocular cam-

era to localize the vehicle on an absolute map, which

is the ﬁrst step towards vehicle automation. First,

a map is built using additional sensors, in particular

an IMU/RTK GPS system. This step can be carried

out by a speciﬁc vehicle and the map-building can be

done off-line. Then, we propose an on-line process

to localize the vehicle using only a monocular camera

and a low-cost GPS.

In Section 2, we set out the state of the art with

regard to vehicle localization using visual data. In

Section 3, we present our method to build the map,

and then to localize the vehicle within this map. Sec-

tion 4 is dedicated to the experimental results obtained

for the map-building and for the localization stages.

Lastly, in Section 5, we present a conclusion and iden-

tify several directions for future work.

Ouerghi, S., Boutteau, R., Merriaux, P., Ragot, N., Savatier, X. and Vasseur, P.

Absolute Localization using Visual Data for Autonomous Vehicles.

DOI: 10.5220/0005682105950601

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 3: VISAPP, pages 597-603

ISBN: 978-989-758-175-5

597

2 RELATED WORK

Vehicle localization is a fundamental requirement in

robotics and intelligent transportation systems that

has been extensively tackled in the two main ﬁelds

dealing with robotics navigation namely Simultane-

ous Localization and Mapping (SLAM) (Durrant-

Whyte and Bailey, 2006) (Bailey and Durrant-Whyte,

2006) (Dissanayake et al., 2001) (Munguia and Grau,

2007) and real-time Structure From Motion or Visual

Odometry (Nister et al., 2006) (Maimone et al., 2007)

(Comport et al., 2010).

SLAM approaches have been rooted in the local

methods, operating in unknown environments, con-

structing a 3D model and estimating the camera pose

relative to it. One of the most successful approaches

to date is PTAM (Parallel Tracking and Mapping)

(Klein and Murray, 2007). PTAM builds keyframes-

based maps from data acquired by a monocular cam-

era and uses tracking to estimate camera pose rela-

tive to the map. This is achieved by two threads run-

ning in parallel: one thread tracks the camera posi-

tion relative to the existing map, and a second map-

ping thread integrates keyframes in the map and re-

ﬁnes the map by performing a global bundle adjust-

ment. Though PTAM provides good performance in

localization and map building, it was originally de-

signed for augmented reality applications in small

workspaces and is therefore not prepared to cope with

large-scale mapping involving loop closures and big

maps. To extend the range of map coverage, Paral-

lel Tracking And Multiple Mapping (PTAMM) (Cas-

tle and Murray, 2009) was proposed, allowing multi-

ple maps with automatic switching. However, maps

are limited in size and they are maintained indepen-

dently with local coordinate frames which could be

more adapted to robot navigation rather than vehicles

and transportation systems.

Another alternative to SLAM algorithms is the

place recognition approach that has been adapted in

SLAM to detect loop closures (Ho and Newman,

2007) (Mei et al., 2011) (Cummins and Newman,

2008). The most successful approach to date is FAB-

MAP (Cummins and Newman, 2010), a probabilis-

tic approach to image matching based on a ”visual

bag-of-words” model. FAB-MAP performs localiza-

tion on trajectories up to 1000km in length. De-

spite the impressive results of FAB-MAP, it is only

demonstrated on trajectories with one loop closure

for each location (Cummins and Newman, 2010).

It could consequently fail when revisiting the same

location several times, reducing therefore the recall

performance over time. An attempt to improve the

aforementioned performance is the CAT-SLAM sys-

tem (Maddern et al., 2012) performing the fusion of

the robot local movement with appearance informa-

tion using a particle ﬁlter.

In addition, any previous localization and map-

ping methods fail when they are tested in dynamic

environments, in environments with too many or very

few salient features or where there are partial or to-

tal occlusions which could justify the resort to the use

of a GPS in vulnerable situations. On the other hand,

the absolute localization is needed by several appli-

cations used in transportation systems, especially for

future ADAS systems based on car to car communi-

cation and assuming that all positions are known in

a global and therefore exchangeable reference frame

(Rockl et al., 2008). This motivated us to conceive a

low-cost robust absolute localization system aimed at

vehicles and transportation systems.

3 METHODOLOGY

The general idea of our algorithm is to localize a cam-

era given a set of previously captured and geolocal-

ized images as described in Figure 1. Our method

is composed of two main steps: the map-building

of the geolocalized landmarks using geolocalized im-

ages, and then the localization within this map. In this

section, we describe the two stages.

Figure 1: General idea of our method: given a set of geolo-

calized images, ﬁnd the pose of a camera.

3.1 Map-building

Map-building is the key step to obtaining an accurate

localization, so this step should be done carefully. If

the map is not accurate, or does not contain sufﬁ-

cient robust features, localization will be very difﬁ-

cult, even impossible. To build the map, an instru-

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

598

mented vehicle has to be driven manually. This ve-

hicle contains a very accurate localization system. In

the KITTI dataset (Geiger et al., 2012) (Geiger et al.,

2013), this system is composed of an IMU and an

RTK-GPS. This system provides the vehicle’s pose

with accuracy within a few centimeters and high fre-

quency (100 Hz). In our experiments, we only need

the IMU/GPS measurements and the images from a

monocular grayscale camera.

Once the acquisition has been done, we need to

process the data to build a compact and representative

map. Indeed, a sequence of a few kilometers gener-

ates several gigabytes of images, so we need to cre-

ate a more compact database if we want it to be em-

bedded in the vehicle. Instead of keeping all the im-

ages, the ﬁrst step is to select only a few key images.

This is achieved by detecting and matching keypoints

between consecutive frames. Let the ﬁrst image be

considered as a keyframe. We detect and match the

keypoints with the following frames. If the number

of matches is too high, it means that the two images

are very close visually, so the second frame is not a

keyframe. We consequently need to deﬁne a thresh-

old (500 in our experiments) below which there are

not enough matched keypoints and a new keyframe is

considered. The selection of keyframes has two ad-

vantages: ﬁrst, the compression of the database by

keeping only a few images, and second it ensures that

there is motion between two frames. In the case of

a stopped vehicle, if all the images are used, there

is no distance between two images and the triangu-

lation step is consequently impossible. Another way

to build the map could be to use spatial discretiza-

tion, for example by taking a keyframe every 3 me-

ters. However, in the case of a rotation, we need more

keyframes than in the case of a translation since the

images vary quickly. To reject false matches, the fun-

damental matrix is computed using a RANSAC (Ran-

dom Sample Consensus) scheme. The computation

of the fundamental matrix is done using the Eight-

Point algorithm (Hartley, 1997). The Five-Point al-

gorithm (Nister, 2004) should provide the essential

matrix with better accuracy than the Eight-Point al-

gorithm but is considerably slower. In fact, what is

important at this stage is not accuracy, but the rejec-

tion of false matches.

Our method relies on the use of keypoints so each

image is compressed into a set of keypoints and their

associated descriptors, as illustrated in Figure 2. Con-

sequently, for each keyframe, keypoints are matched

with the following keyframe. To do so, we use a bi-

nary descriptor, FREAK (Fast Retina Keypoint (Alahi

et al., 2012)). It is advantageous because it detects

keypoints quickly, and requires little memory. More-

over, the computation of similarities between descrip-

tors is very fast because the Hamming distance can

be used instead of the Euclidean distance. These key-

points are then triangulated using the poses provided

by the IMU/GPS system. This is possible thanks to

the high accuracy of this system, especially compared

to the motion between two keyframes. This method

has many advantages. 3D points are geolocalized

and there is not the drift that exists in visual odom-

etry systems. In addition, computation time is much

shorter than that of SFM approaches that use bundle

adjustment with loop-closure for example. For each

keyframe, we store the keypoints, their descriptors

and the computed 3D coordinates of the correspond-

ing points. All this data represents only hundred of

kilobytes, which is considerably smaller than the size

of an image in memory.

Figure 2: Constitution of the map built.

3.2 Localization

The mapping is now achieved and we have a database

which contains, for each keyframe, the absolute posi-

tion of the keyframe, the keypoints, their descriptors

and the corresponding 3D points. The localization

can now take place. In the localization process, we do

not need the costly IMU/RTK-GPS system. The lo-

calization can be done using only a monocular camera

and a low-cost GPS. The aim of the localization is, for

each image captured by the camera, to ﬁnd the pose

of the camera in an on-line process (contrary to the

map building which can be done off-line). To local-

ize the vehicle, we need to ﬁnd the nearest keyframe

in the database. This is achieved using the low-cost

GPS to have an initial guess of the position. Once the

nearest keyframe is identiﬁed, the keypoints are de-

tected in the current frame and then matched against

it. The outliers are rejected using the fundamental

matrix estimation with a RANSAC scheme. The 2D

points of the current frame are now linked to those of

the keyframe and consequently to their corresponding

3D points.

The localization is achieved by solving a

Perspective-n-Point problem. The PnP problem, also

known as pose estimation, was ﬁrst introduced by

Absolute Localization using Visual Data for Autonomous Vehicles

599

Fischler and Bolles (Fischler and Bolles, 1981). It

consists in estimating the relative pose of a camera

and an object knowing the position of n features in

the object coordinate system and their projections in

the image. Given a set of n correspondences be-

tween 3D points M





expressed

in a world reference frame and their 2D projections





, we look for the transformation

R, t from the world coordinate system to the camera

coordinate system given by equation 1. To do so, we

minimize the reprojection error E deﬁned in equation

2, where P is the projection function which depends

on the camera’s pose and on the 3D coordinates of the

points.









= K[R|t]













(1)

E =

∑

i=1

P(R, t, M

) − m

(2)

The minimum number of correspondences to ob-

tain a unique solution is n = 4. However, a larger

set of points is required to obtain a more robust solu-

tion, using a RANSAC scheme (Fischler and Bolles,

1981). It is necessary because some 3D points may be

incorrect (due to bad matches for example), and a lin-

ear estimation will provide an erroneous estimation of

the relative pose if no outlier rejection is carried out.

4 EXPERIMENTAL RESULTS

We have evaluated our approach on the KITTI Vision

Benchmark Suite. The KITTI database provides the

images from the camera, their poses provided by the

IMU/RTK GPS system, and the calibration matrix of

the camera. We have built our map of geolocalized

landmarks as described in Section 3 from the images

of the left grayscale camera mounted on the vehicle.

This camera is a Point Grey Flea 2 camera with a res-

olution of 1384x1032 (1.4MP) working at 10 frames

per second. The images are rectiﬁed and cropped in

the database so the ﬁnal resolution is 1241x376 and

the distance between two images is around 80cm.

4.1 Map building

Table 1 summarizes the results obtained for the

map-building process. As we can see, the size of

the database we obtained is forty times smaller than

the one of the original KITTI database. The number

of keyframes is four times smaller than the number

of images of the KITTI database, which leads to a

keyframe every 3m instead of 80cm.

Table 1: Comparison between the KITTI database and the

map built.

KITTI

database

Map built

Number of poses

4540 1361

Mean distance

between two poses

(cm)

80 320

Size (MB)

1126 26

The rejection of outliers can be done using the

Eight-Point or the Five-Point algorithms as discussed

in Section 3. To justify our choice of using the

Eight-Point algorithm, the computation time and the

proportion of inliers are listed in Table 2. As we can

see, the Five-Point algorithm is ﬁve times slower

than the Eight-Point but the proportion of inliers is

approximately the same for the two algorithms. As

we are only interested in correctly matched points,

and not in the estimated fundamental or essential

matrix, the Eight-Point algorithm is sufﬁcient. Figure

3 shows a matching example between two keyframes

after outlier rejection.

Table 2: Comparison between 8-Point and 5-Point algo-

rithms.

Five-Point

algorithm

Eight-

Point

algorithm

Computation

Time

52 ms 11 ms

Proportion of

inliers

59% 55 %

4.2 Localization

Our algorithm was evaluated on a sequence of the

KITTI database. As there is no low-cost GPS in this

dataset, we ﬁnd the nearest keyframe using the ground

truth value of the vehicle’s pose and adding a random

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

600

Figure 3: An example of matching (in green) between two

keyframes after outlier rejection.

noise of ± 5 meters. Figures 4, 5 and 6 show the

estimated trajectory compared with the ground truth

from the IMU/RTK GPS. The mean error is around

17 centimeters, which is sufﬁcient to control a ve-

hicle autonomously. As shown in Figure 6, there

are sometimes discontinuities between two estimated

poses, which could be eliminated using a Kalman Fil-

ter (Kalman, 1960). The measure of the estimation

conﬁdence could be the reprojection error of the 3D

points into the image.

−300 −200 −100 0 100 200 300

−100

100

200

300

400

500

Trajectory of the vehicle

Ground Truth Trajectory

Estimated Trajectory

Figure 4: The estimated trajectory (red line) and the ground

truth trajectory (blue line).

120 140 160 180 200 220 240 260

360

380

400

420

440

460

480

500

Trajectory of the vehicle

Ground Truth Trajectory

Estimated Trajectory

Figure 5: Zoom on a portion of the trajectory. The estimated

trajectory is in red and the ground truth trajectory in blue.

−45 −40 −35 −30 −25 −20 −15 −10 −5

362

364

366

368

370

372

374

376

378

380

Trajectory of the vehicle

Ground Truth Trajectory

Estimated Trajectory

Figure 6: Zoom on a portion of the trajectory. The estimated

trajectory is in red and the ground truth trajectory in blue.

3 4 5 6 7 8 9 10

0.2

0.25

0.3

0.35

0.4

Figure 7: Evolution of the localization error with respect to

the distance between two keyframes.

Table 3 and Figure 7 represent the error of the esti-

mated trajectories when we decrease the threshold on

the number of matches to determine whether an image

is a keyframe or not. As we can see, the error grows

slowly, and when there is only one image every 10

meters, the number of matches becomes insufﬁcient

to correctly estimate the camera’s pose.

Table 3: Evolution of the mean error with respect to the

distance between keyframes.

Mean distance

between two

keyframes (m)

3.2 6.1 9.6

Mean error of the

estimated trajec-

tory (m)

0.17 0.23 0.36

Absolute Localization using Visual Data for Autonomous Vehicles

601

5 CONCLUSION AND FUTURE

WORK

In this paper, we have presented a new method for the

absolute localization of a vehicle using visual data.

First, a selection of keyframes is made to reduce the

complexity and the size of the database. Then, the

keypoints of these keyframes are matched and trian-

gulated to obtain 3D points. As our images are geolo-

calized, we obtain a map of geolocalized 3D points,

their associated keypoints and descriptors. The map-

building needs to be done only once, so this step

can be carried out off-line. The localization is then

achieved on-line, using the previously built map. For

each frame from the camera mounted on the vehicle,

the keypoints are detected, their descriptors computed

and matched with the nearest keyframe. This provides

the points in the image and their associated 3D points,

so the pose of the camera can be found using a PnP ap-

proach. Our method has been evaluated on the KITTI

dataset and gives precise results for the localization of

the vehicle.

Our future works will focus on the problem of

robustness to improve the results when changes ap-

pear between the acquisition of the map and the lo-

calization step. These changes can be due to light

changes, season changes and/or appearance or dis-

appearance of objects (cars, pedestrians, etc). This

could be achieved for example by using multiple fea-

ture fusion.

ACKNOWLEDGEMENTS

This work was supported by the VIATIC project. VI-

ATIC has been funded by the French Armaments

Procurement Agency (DGA) and managed by the

French National Research Agency (ANR) under the

ASTRID-MATURATION 2014 call.

REFERENCES

Alahi, A., Ortiz, R., and Vandergheynst, P. (2012). Freak

: Fast retina keypoint. In IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), Provi-

dence, Rhode Island, USA.

Bailey, T. and Durrant-Whyte, H. (2006). Simultaneous lo-

calization and mapping (slam): Part ii. Robotics and

Automation Magazine, 13(3):108–117.

Castle, R. and Murray, D. (2009). Object recognition

and localization while tracking and mapping. In

IEEE/ACM International Symposium on Mixed and

Augmented Reality (ISMAR), pages 179–180, Or-

lando, USA.

Comport, A., Malis, E., and Rives, P. (2010). Real-time

quadrifocal visual odometry. International Journal of

Robotics Research (IJRR), 29(2-3):245–266.

Cummins, M. and Newman, P. (2008). Fab-map: proba-

bilistic localization and mapping in the space of ap-

pearance. International Journal of Robotics Research

(IJRR), 27(6):647–665.

Cummins, M. and Newman, P. (2010). Highly scalable

appearance-only slam fab-map 2.0. International

Journal of Robotics Research (IJRR).

Dissanayake, M., Newman, P., Clark, S., Durrant-Whyte,

H., and Csorba, M. (2001). A solution to the simul-

taneous localization and map building (slam) prob-

lem. IEEE Transactions on Robotics and Automation,

17:229–241.

Dollar, P., Wojek, C., Schiele, B., and Perona, P. (2012).

Pedestrian detection: An evaluation of the state of the

art. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence (PAMI), 34(4):743 – 761.

Durrant-Whyte, H. and Bailey, T. (2006). Simultaneous lo-

calization and mapping: Part i. Robotics and Automa-

tion Magazine, 13(2):99–110. ISSN: 1070-9932.

Fischler, M. and Bolles, R. (1981). Random sample consen-

sus : A paradigm for model ﬁtting with applications to

image analysis and automated cartography. In Com-

munications of the ACM, volume 24, pages 381–395.

Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).

Vision meets robotics: The kitti dataset. International

Journal of Robotics Research (IJRR).

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving ? the kitti vision benchmark

suite. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), Providence, Rhode Is-

land, USA.

Hartley, R. (1997). In defense of the eight-point algorithm.

IEEE Transactions on Pattern Analysis and Machine

Intelligence (PAMI), 19(6):580–593.

Ho, K. and Newman, P. (2007). Detecting loop closure with

scene sequences. International Journal of Computer

Vision (IJCV), 74(3):261–286.

Kalman, R. (1960). A new approach to linear ﬁltering

and prediction problems. Transactions of the ASME

- Journal of Basic Engineering, 82(Series D):35–45.

Klein, G. and Murray, D. (2007). Parallel tracking and map-

ping for small ar workspaces. In IEEE/ACM Interna-

tional Symposium on Mixed and Augmented Reality

(ISMAR), Nara, Japan.

Kozak, K., Pohl, J., Birk, W., Greenberg, J., Artz, B., Blom-

mer, M., Cathey, L., and Curry, R. (2006). Evalua-

tion of lane departure warnings for drowsy drivers. In

Human Factors and ergonomics society, pages 2400–

2404.

Liu, Y., Lin, K., and Chen, Y. (2008). Birds-eye view vision

system for vehicle surrounding monitoring. Robot Vi-

sion, 4931:207–218.

Maddern, W., Milford, M., and Wyeth, G. (2012). Cat-slam:

probabilistic localisation and mapping using a contin-

uous appearance-based trajectory. International Jour-

nal of Robotics Research (IJRR), 31(4):429–451.

Maimone, M., Cheng, Y., and Matthies, L. (2007). Two

years of visual odometry on the mars exploration

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

602

rovers. Journal of Field Robotics (JFR), Special is-

sue on Space Robotics, 24:169186.

Mei, C., Sibley, G., Cummins, M., Newman, P., and Reid,

I. (2011). Rslam: A system for large-scale mapping

in constant-time using stereo. International Journal of

Computer Vision (IJCV), 94(2):198–214.

Munguia, R. and Grau, A. (2007). Monocular slam for vi-

sual odometry. In IEEE International Symposium on

Intelligent Signal Processing, pages 1–6.

Nister, D. (2004). An efﬁcient solution to the ﬁve-point

relative pose problem. IEEE Transactions on Pattern

Analysis and Machine Intelligence (PAMI), 26:756–

770.

Nister, D., Naroditsky, O., and Bergen, J. (2006). Visual

odometry for ground vehicle applications. Journal of

Field Robotics (JFR), 23.

Rockl, M., Gacnik, J., and Schomerus, J. (2008). Integra-

tion of car-2-car communication as a virtual sensor

in automotive sensor fusion for advanced driver as-

sistance systems. In FISITA 2008 World Automotive

Congress.

SAE (2015). Sae international. In http://www.sae.org/.

Absolute Localization using Visual Data for Autonomous Vehicles

603