Towards Fast and Automatic Map Initialization for Monocular SLAM

Systems

Blake Troutman

and Mihran Tuceryan

Department of Computer and Information Science, Indiana University - Purdue University Indianapolis,

Indianapolis, U.S.A.

Keywords:

Simultaneous Localization and Mapping (SLAM), Structure From Motion, Map Initialization, Model

Selection, Monocular Vision, Stereo Vision.

Abstract:

Simultaneous localization and mapping (SLAM) is a widely adopted approach for estimating the pose of a

sensor with 6 degrees of freedom. SLAM works by using sensor measurements to initialize and build a virtual

map of the environment, while simultaneously matching succeeding sensor measurements to entries in the

map to perform robust pose estimation of the sensor on each measurement cycle. Markerless, single-camera

systems that utilize SLAM usually involve initializing the map by applying one of a few structure-from-motion

approaches to two frames taken by the system at different points in time. However, knowing when the feature

matches between two frames will yield enough disparity, parallax, and/or structure for a good initialization

to take place remains an open problem. To make this determination, we train a number of logistic regression

models on summarized correspondence data for 927 stereo image pairs. Our results show that these models

classify with signiﬁcantly higher precision than the current state-of-the-art approach in addition to remaining

computationally inexpensive.

1 INTRODUCTION

Simultaneous localization and mapping (SLAM) has

become a popular approach for estimating the pose

of a sensor with respect to its environment in real

time, while simultaneously building and improving

the map. The SLAM process begins by using mea-

surements from a sensor to initialize a virtual map

of the observed environment. After the map is ini-

tialized, SLAM systems use the virtual map in cor-

respondence with the sensor measurements to repeat-

edly estimate the pose of the sensor at regular inter-

vals. This pose can either be used in conjunction with

sensor measurements to continually increase the de-

tail of the virtual map or it can be forwarded to an

external system that needs to know the current sen-

sor’s pose. This process of pose estimation is neces-

sary for both virtual reality (VR) and augmented re-

ality (AR) systems, in which a user moves a device

(often a head-mounted display or a view ﬁnder) and

expects to see a virtual world or virtual objects trans-

formed appropriately, in real time, based on the user’s

motion. The camera pose is also useful in robotics,

https://orcid.org/0000-0002-3809-3180

https://orcid.org/0000-0003-3828-6123

as it can be used as a method to localize a robot in its

environment.

This paper addresses the problem of automatically

determining the best pair of image frames to be used

in initializing the visual SLAM process. The means

of map initialization depends on the SLAM system’s

sensor conﬁguration and whether or not markers are

used in the environment. SLAM systems that uti-

lize depth-sensing cameras or calibrated stereo cam-

era systems can initialize their virtual maps in a

markerless environment relatively quickly and easily

compared to markerless monocular (single-camera)

SLAM systems. Markerless monocular SLAM sys-

tems typically initialize their maps by sampling two

image frames at different points in time and matching

salient image features between the two images to de-

velop a set of point correspondences. After this, these

correspondences can be used to estimate the funda-

mental matrix, the essential matrix, or a homography

between the camera poses of these two images (Hart-

ley and Zisserman, 2004). These matrices can be used

to extract the camera pose of the second frame with

respect to the ﬁrst, and then the correspondences can

be used with this pose estimate to triangulate the ﬁrst

set of 3D map points.

Troutman, B. and Tuceryan, M.

Towards Fast and Automatic Map Initialization for Monocular SLAM Systems.

DOI: 10.5220/0010640600003061

In Proceedings of the 2nd International Conference on Robotics, Computer Vision and Intelligent Systems (ROBOVIS 2021), pages 22-30

ISBN: 978-989-758-537-1

Many of the steps of this monocular map initial-

ization process are well known; however, fast auto-

matic selection of the image pairs to use for the ini-

tialization process still remains an open problem. If

the image pairs do not have a sufﬁcient translation

component in the camera movement or if the corre-

spondences are not sufﬁciently spread across the im-

ages, then the pose estimate is likely to be inaccurate

and the map initialization will cause the SLAM sys-

tem to immediately fail. This work presents a fast

approach for automatically determining if an image

pair is likely to result in an accurate reconstruction

by extracting various information from the collection

of correspondences and feeding that information to

trained logistic regression models.

1.1 Contribution

Given the problem of selecting appropriate im-

age pairs from a video sequence for visual SLAM

map initialization, previous works are too slow for

resource-limited platforms, lack robustness, only

work for non-planar scenes, or struggle in their

decision-making when observing pure rotations. This

work contributes a solution that attempts to address

all of these issues. Namely, this work contributes an

algorithm for selecting good image pairs from a video

sequence for visual SLAM map initialization that:

• is fast enough to be practical for real time use on

resource-limited platforms,

• provides decisions for both planar and non-planar

scenes,

• accurately rejects image pairs that demonstrate

pure rotational movement,

• and yields higher precision than the current state-

of-the-art approach.

2 RELATED WORK

There are a number of visual SLAM systems for real

time use that have been developed over the past sev-

eral years. MonoSLAM (Davidson et al., 2007) is the

ﬁrst successful application of purely visual monocu-

lar SLAM, but its map initialization is aided by the

use of a known ﬁducial marker. During initialization,

markers can be used to act as a pre-existing map with

known 3D point locations. This enables the system to

accurately estimate the camera pose each frame be-

fore the system has sufﬁcient baseline length to trian-

gulate new points into the map. Markers also enable

the system to register an accurate scale for their map

points during initialization, as demonstrated in (Xiao

et al., 2017). The use of markers is further expanded

upon in systems such as (Korah et al., 2011), (Maidi

et al., 2011), (Ufkes and Fiala, 2013), (Kobayashi

et al., 2013), and (Arth et al., 2015) which utilize ﬁdu-

cials for pose estimation throughout the entire runtime

of the system, rather than just using them for map ini-

tialization.

Though the use of markers often makes camera

pose estimation more reliable and straightforward, it

also makes pose estimation systems less versatile, as

the systems will break down and have no reference

points to measure if the ﬁducial markers fall out of

frame. (Berenguer et al., 2017) proposes a SLAM

system that heavily utilizes global image descriptors

to perform localization, which frees the system from

any use of ﬁducial markers. However, the global im-

age descriptors used in the aforementioned work de-

pend on omnidirectional imaging and can take more

than half of a second to perform localization, mak-

ing them unsuitable for systems that require rapid,

real time tracking (such as in AR/VR systems or in

autonomous automobiles). Faster and more robust

pose estimation systems exclude the need for markers

by tracking somewhat arbitrary patches of the image

frame, without any predeﬁned knowledge of what the

patches look like or where they should be located in

3D space (Klein and Murray, 2007), (Klein and Mur-

ray, 2009), (Sun et al., 2015), (Mur-Artal et al., 2015),

(Fujimoto et al., 2016), and (Qin et al., 2018).

Without knowledge of the 3D locations of the

tracked points, structure-from-motion (SfM) algo-

rithms are often used to aid in triangulating virtual 3D

locations for these points. For example, the PTAM

system (Klein and Murray, 2007), (Klein and Mur-

ray, 2009) is a real time visual SLAM system that

initializes its map by having the user manually se-

lect two frames (from different points in time), match-

ing FAST features (Rosten and Drummond, 2006)

between the two frames, estimating the pose differ-

ence between the frames with the 5-point algorithm

(Stewenius et al., 2006) or with a homography de-

composition (Faugeras and Lustman, 1988), and ﬁ-

nally using the estimated pose to triangulate the point

matches into a virtual 3D map. (Sun et al., 2015)

adopts the PTAM framework for its map initialization

as well. The clear drawback of this approach is that

it is not automatic; it requires the user to select the

frames that should be used for initialization. (Huang

et al., 2017) presents an approach for map initializa-

tion that can initialize the map in a single frame with-

out user intervention, but it requires that the system

is operating inside of a typical indoor room and can

only initialize map points that coincide with the walls

of the room.

Towards Fast and Automatic Map Initialization for Monocular SLAM Systems

ORB-SLAM (Mur-Artal et al., 2015) also utilizes

SfM techniques to initialize its map, but it addition-

ally implements an automated approach for select-

ing the frames to use for initialization while main-

taining comparable versatility to PTAM. ORB-SLAM

achieves this automatic frame selection by estimat-

ing both a homography and a fundamental matrix for

the current frame against the initial frame using the

normalized DLT algorithm and the 8-point algorithm,

respectively, both of which are explained in (Hart-

ley and Zisserman, 2004). It then uses a heuristic

to assess the success of each model estimate— if the

heuristic indicates that neither model estimate is ac-

curate, then the system repeats this process on subse-

quent frames until at least one of the models appears

to provide an accurate pose estimate. This approach,

however, is computationally expensive and unsuitable

for resource-limited platforms such as mobile phones.

There are modiﬁed versions of ORB-SLAM, some of

which aim to revise the map initialization process (Fu-

jimoto et al., 2016), but stray from monocular SLAM

by requiring RGB-D data.

Given the high computational cost of ORB-

SLAM’s approach to automatic frame selection, it

would be desirable for a SLAM system to include

some algorithm that can select an appropriate pair

of frames for its SfM-based map initialization much

more quickly. There are relatively few works that

have presented such an algorithm. A work by Tomono

(Tomono, 2005), which predates many of the visual

SLAM systems in the literature, presents an approach

for detecting degeneracy in a potential fundamental

matrix estimate before a system would need to per-

form the 8-point algorithm. However, this approach

is only applied to fundamental matrix estimates, it

does not distinguish between degeneracy caused by

insufﬁcient translation and degeneracy caused by pla-

nar scenes (which could still be useful for a homog-

raphy estimate), and the threshold parameter used is

dependent upon the number of correspondences, mak-

ing this approach less robust for real use. VINS-Mono

(Qin et al., 2018), a visual-inertial SLAM system, uti-

lizes a simpler frame selection approach in which the

system chooses to attempt initialization with a pair of

frames if it is able to track more than 30 features be-

tween them with each correspondence yielding a dis-

parity of at least 20 pixels. A later SLAM system

(Butt et al., 2020) implements a similar approach to

that used in VINS-Mono, but utilizes the median and

standard deviation of correspondence pixel disparities

to additionally select whether a homography estimate

is appropriate or if a fundamental matrix estimate is

appropriate. However, both of these systems are still

dependent on an evaluation of the estimated model

accuracy to reject erroneous initial estimates and they

also give an incorrect decision under pure rotational

movements.

3 METHODS

3.1 Structure-from-Motion (SfM)

Approaches

The goal of stereo SfM methods is to both derive

the transformation between two camera poses and to

project image data back into 3 space using the derived

pose transformation. The cameras are modeled using

a pinhole camera model,

x = K[R|t]X (1)

where x is a homogeneous 2D projection of the

homogeneous 3D point X, R is the 3×3 rotation ma-

trix of the viewing camera, t is column vector of the

translation of the viewing camera, and K is the cam-

era intrinsics matrix, deﬁned by

K =





f s c

0 f c

0 0 1





(2)

where f is the camera’s focal length, s is the

skew (zero in our implementation), and c

and c

are

the image center coordinates. It is assumed that the

ﬁrst camera is at the origin with no rotation, [I|0],

so [R|t] represents the extrinsic camera parameters of

the second viewing camera, i.e., the transformation of

the pose from the ﬁrst camera to the second camera.

These extrinsic parameters are derived by estimating

the fundamental matrix, the essential matrix, or a ho-

mography between the two views.

Given two views of the same scene from differ-

ent camera viewpoints, epipolar geometry can be used

to determine the transformation from the ﬁrst cam-

era’s pose to the second camera’s pose. Epipolar ge-

ometry refers to the property of stereo image projec-

tion in which 3D points observed in one view can

be constrained to unique epipolar planes in 3 space,

each of which intersects with the baseline of the cam-

eras. These planes can then be projected into the other

view as epipolar lines. Each of these lines intersects

with the corresponding projection of the respective

3D point that was observed in the ﬁrst view. The fun-

damental matrix of two views is a 3×3 matrix which

encodes this constraint for all points projected in the

two views. Speciﬁcally, this constraint takes the form

ROBOVIS 2021 - 2nd International Conference on Robotics, Computer Vision and Intelligent Systems

Fx = 0 (3)

where F is the 3×3 fundamental matrix of the two

views, and x and x

are homogeneous 3×1 column

vectors of the projections of a 3D point in the ﬁrst

view and second view, respectively.

If the intrinsic camera parameters are known for

both views, then the fundamental matrix can be trans-

formed into the corresponding essential matrix with

E = K

FK (4)

where K and K

are the camera intrinsic matrices

for the ﬁrst view and second view, respectively.

The extrinsic camera parameters [R|t] of the sec-

ond view are extracted from the essential matrix by

using the singular value decomposition of E and a

single correspondence pair to test for cheirality (the

constraint that dictates that points viewed by a cam-

era must fall in front of the camera). This process is

explained in detail in (Hartley and Zisserman, 2004).

The fundamental matrix is solved with eight or

more image point matches by re-formulating the con-

straint in Equation 3 into a set of linear equations.

This approach is known as the 8-point algorithm. The

essential matrix is solved using a more complex ap-

proach known as the 5-point algorithm, described in

(Nist

er, 2003).

Additionally, the transformation from one pose

to another can be described with a homography, H,

which is a 3×3 matrix that represents the projective

transformation from one image point to its projection

in another image. Namely, a homography is a matrix

for a pair of views such that

Hx = x

(5)

where x and x

are homogeneous vectors repre-

senting 3D points in the ﬁrst view and second view,

respectively. A homography between two views is es-

timated with four or more image matches using the di-

rect linear transformation method, explained in (Hart-

ley and Zisserman, 2004). The extrinsic camera pa-

rameters are then extracted from the homography us-

ing the approach described in (Faugeras and Lustman,

1988).

Visual SLAM systems often estimate either the

fundamental matrix, the essential matrix, or a homog-

raphy between two frames of the video feed to initial-

ize their maps. The 8-point algorithm can accurately

estimate the fundamental matrix in image pairs where

there is complex structure in the corresponding fea-

ture points; however, the 8-point algorithm struggles

when the projected points are coplanar in 3 space. In

contrast, homographies can be estimated with copla-

nar points to yield highly-accurate poses, but their ac-

curacy degrades if the points are not coplanar. The

5-point algorithm that estimates the essential matrix

can be accurate for both coplanar points and non-

coplanar points, but it often yields poses that are less

precise than those that are achieved with the 8-point

algorithm or 4-point DLT algorithm. To utilize these

SfM approaches, we use the corresponding OpenCV

(Bradski, 2000) implementations for each algorithm.

3.2 Model Conﬁguration

This approach utilizes four logistic regression models

for decision-making. Each model takes in correspon-

dence data as input and answers a different question

regarding the data’s likelihood of resulting in a good

map initialization. Speciﬁcally, the models aim to an-

swer the following questions:

• Model F: Would estimating the fundamental ma-

trix (via 8-point algorithm) be a good approach for

initializing the SLAM map with this correspon-

dence data?

• Model E: Would estimating the essential matrix

(via 5-point algorithm) be a good approach for ini-

tializing the SLAM map with this correspondence

data?

• Model H: Would estimating a homography (via 4-

point DLT algorithm) be a good approach for ini-

tializing the SLAM map with this correspondence

data?

• Model R: Is this correspondence data the result

of a pure rotation (no translation in camera move-

ment)?

Each model provides an output of 1 if the answer

to its respective question is “yes” and 0 if the an-

swer is “no.” In practice, a SLAM system could uti-

lize these models for map initialization in real time

by continuously evaluating each frame against a ref-

erence frame taken at a previous point in time. For

example, the ﬁrst frame that is registered upon the

system’s activation could be considered the reference

frame, and then, each frame afterward could be paired

with this reference frame and the pair could be tested

against these models. When at least one of the models

predicts that the two frames in question are sufﬁcient

for SfM estimation, the corresponding SfM approach

is used to initialize the map with the related corre-

spondences.

Additionally, all four models take the same 23 fea-

tures as inputs. The features consist of data pertaining

to the correspondences between the image pairs be-

ing evaluated. A correspondence is deﬁned as a 2D-

to-2D keypoint match between two images and con-

sists of two image points, deﬁned as p

= (x

, y

) and

= (x

, y

), where p

is a point in the ﬁrst image

Towards Fast and Automatic Map Initialization for Monocular SLAM Systems

and p

is its matching location in the second image.

We deﬁne the pixel disparity of a correspondence as

the euclidean distance between its p

and p

values.

Speciﬁcally, the features used include the num-

ber of correspondences, the mean and standard devia-

tion of the pixel disparities, the minimum and max-

imum values of each component of the correspon-

dences (x

, x

, y

, and y

), and the range of each com-

ponent of the correspondences. Additionally, we in-

clude eight features that each represents the propor-

tion of correspondences that approximately yield a

speciﬁc angle. We calculate a correspondence’s an-

gle, θ, as

θ = arctan



− y

− x



(6)

where (x

, y

) is the point in the ﬁrst image and

, y

) is the corresponding point in the second im-

age. The angles included in these eight features sim-

ply cover 8 evenly distributed angles between 0°and

360°.

3.3 Model Training

Models are trained on data that is derived from

the fr3/structure texture far RGB-D dataset from the

Technical University of Munich (Sturm et al., 2012).

This dataset consists of an RGB video of textured pa-

pers on the ground with a poster sitting upright behind

the papers, bent in a zig-zag fashion to produce struc-

ture. Throughout the video, the camera demonstrates

both rotational and translational movements, mostly

moving to the left and rotating to keep the objects in

frame. For model training, the video is split up into

batches of 90-frame sequences. For each batch, ev-

ery frame is paired with the ﬁrst frame in the batch

to form a stereo image pair. We use the KLT optical

ﬂow algorithm (Shi and Tomasi, 1994), (Lucas and

Kanade, 1981) to track approximately 150 to 300 fea-

tures each frame. This optical ﬂow approach allows

the resulting correspondences to be free of signiﬁcant

outliers. After collecting these correspondences on

each stereo image pair, we estimate the pose of the

second camera with respect to the ﬁrst using the 8-

point algorithm, 5-point algorithm, and 4-point DLT

algorithm. This provides us with three different pose

estimates, each of which is used to judge the success

of the respective approach on the stereo image pair

by using the ground truth pose data provided in the

dataset.

We ﬁnd that two error metrics are good descriptors

of successful pose estimates. These metrics include

the normalized translational chordal distance and the

median scaled reconstruction error. The normalized

translational chordal distance, E

, is deﬁned by

= kt

− t

(7)

where t

and t

are the normalized translation vec-

tors of the ground truth pose and the estimated pose,

respectively, and E

is the Frobenius norm of their dif-

ference. The median scaled reconstruction error, E

is deﬁned by

∑

− sP

(8)

where P

is a 3D ground truth point, P

is the cor-

responding 3D point that was triangulated from the

estimated pose, and n is the number of correspon-

dences. Additionally, this metric includes a scale fac-

tor, s, which is deﬁned by

s =

− P

(9)

where s can be described as the ratio of the dis-

tance between the ﬁrst two points of the ground truth

to the distance between the corresponding triangu-

lated points. This scale factor allows us to correct for

scale ambiguity in the map generation.

These error metrics are used to aid in generating

labels for the training data. As the baseline between

the cameras increases in these tests, both of these er-

ror metrics show a rapidly decreasing error followed

by a plateau at some baseline length on most of the

batches, as shown in Figure 1. Using this trend,

thresholds for each error metric are selected empir-

ically. We determine that a pose estimate could be

considered suitable for SLAM initialization if its nor-

malized translational chordal distance was less than

0.8 and its median scaled reconstruction error is less

than 1. This results in the following labelling criteria:

label(I) =

(

1 E

< 0.8 and E

< 1

0 otherwise

(10)

where I is a set of correspondences for a stereo

image pair.

For Model R, the classiﬁer that predicts when cor-

respondences are the result of a pure rotation, we sim-

ply use the ground truth data to determine the baseline

length and consider the label to be 0 when the baseline

length exceeds 0.05 and 1 otherwise.

In total, the training data consists of the correspon-

dence data from 927 different stereo image pairs.

ROBOVIS 2021 - 2nd International Conference on Robotics, Computer Vision and Intelligent Systems

(a) Median Scaled Reconstruction Error

(b) Normalized Translational Chordal Distance

Figure 1: Error metrics for pose estimates of SfM approaches on a typical batch of training data generated from

fr3/structure texture far. (a) is the median scaled reconstruction error as the baseline increases throughout the batch. (b)

is the normalized translational chordal distance as the baseline increases throughout the batch. As the ground truth baseline

increases, both error metrics show a rapid decrease followed approximately by a plateau on each SfM approach. Correspon-

dence sets are considered sufﬁcient for SLAM initialization with a given SfM approach if both errors fall on the plateaus.

Note that the median scaled reconstruction error in (a) is unstable at low baselines and may have an error upwards of 10

(which is cut off here to show the more general pattern).

Towards Fast and Automatic Map Initialization for Monocular SLAM Systems

4 EXPERIMENTAL RESULTS

To evaluate model performance, we construct a test

set from the fr3/structure texture near dataset, also

provided by (Sturm et al., 2012). This dataset is sim-

ilar to the fr3/structure texture far dataset, as it is a

video of the same scene; however, it is a much more

challenging dataset for SfM as the camera is closer to

the objects. This causes keypoints to move off-frame

more quickly and also leads to a far less homogeneous

distribution of keypoints in many parts of the video.

The test set is derived in the same fashion as the train-

ing set, where the video is split into several batches

and each frame is paired with the ﬁrst frame of its re-

spective batch. Due to keypoints moving off-frame

more quickly in this set, we only use batches of 30

frames instead of 90. The resulting test set consists of

correspondence data for 1,062 different stereo image

pairs.

Each classiﬁer is able to run inference on the en-

tire test set in approximately 1 millisecond, which

demonstrates the high speed of this approach and

its real time viability, even for resource-limited plat-

forms. The accuracy, precision, recall, and F1 score

for each of the classiﬁers is shown in Table 2.

For comparison, we also evaluate the current state-

of-the-art approach, which involves purely evaluating

the mean/median and standard deviation of pixel dis-

parities in the correspondences to make the classiﬁ-

cation. In this comparison, we utilize the thresholds

used in (Butt et al., 2020), where the correspondences

are considered suitable for homography estimation if

the median disparity is above 50 pixels and are con-

sidered suitable for fundamental matrix estimation if,

additionally, the standard deviation of the disparities

is above 15 pixels. The results for the state-of-the-art

approach are shown in Table 1.

The precision of the classiﬁers is the most im-

portant metric here, as false positives would lead to

a SLAM system attempting to initialize under poor

conditions and may cause the system to perform a

high amount of wasted computation and/or initialize

with a corrupted map. The logistic regression models

for the fundamental matrix approach and homography

approach both yield higher precision than the current

state-of-the-art classiﬁcation algorithm. Speciﬁcally,

the precision for the fundamental matrix approach in-

creased from 0.0811 in the state-of-the-art approach

to 0.3639 in the logistic regression model, and the pre-

cision for the homography approach increased from

0.2852 in the state-of-the-art approach to 0.3725 in

the logistic regression model. The logistic regres-

sion model for the homography approach has a signif-

icantly worse recall than the state-of-the-art approach

(dropping from 0.4601 to 0.1166), though this is less

crucial as it indicates a higher proportion of false neg-

atives rather than false positives.

Additionally, we evaluate the models’ perfor-

mance after ﬁltering the samples with the pure rota-

tion classiﬁer (Model R). When a sample is detected

to be the result of a pure rotation, its label is automat-

ically set to 0, i.e., unsuitable for map initialization.

Otherwise, the sample is passed on to the other mod-

els for prediction. The results of this approach are

displayed in Table 3.

Using the rotation classiﬁer as a preliminary

model to immediately reject samples that do not yield

sufﬁcient baseline increases the precision of Model F

from 0.3639 to 0.4370 and increases the precision of

Model E from 0.4290 to 0.5041. This approach also

decreases the precision of Model H from 0.3725 to

0.3600. In practice, however, this approach of using

the rotation classiﬁer for early rejection does not need

to be used for all models—it can be utilized for only

Model F and Model E if those are the only models

that are improved by this alteration.

5 CONCLUSIONS AND FUTURE

WORK

In this paper, we have presented an approach for

quickly predicting the success of SfM techniques

on an arbitrary set of image point correspondences

through the use of logistic regression models. The

models used in this paper yield signiﬁcantly higher

precision than the current state-of-the-art approach

without having to perform the computationally-

expensive SfM techniques before making their deter-

minations. The model features used are also very

quick to calculate, making this approach viable for

real time performance on resource-limited platforms.

This work may further be validated by implement-

ing it in a real time visual SLAM system and evalu-

ating the system’s stability. Additionally, the models

may be improved by labelling the training data us-

ing a more robust evaluation criterion. Rather than

using empirical thresholds for the normalized transla-

tional chordal distance and median scaled reconstruc-

tion error, perhaps ORB-SLAM’s (Mur-Artal et al.,

2015) heuristic approach could provide a more ac-

curate scheme for labelling the training/testing data.

The models’ performance may also be improved if

they are expanded into neural networks and trained

on more scenes.

ROBOVIS 2021 - 2nd International Conference on Robotics, Computer Vision and Intelligent Systems

Table 1: Performance of the current state-of-the-art classiﬁcation approach on the fr3/structure texture near test data.

Classiﬁcation Accuracy Precision Recall F1 Score

Fundamental Matrix 0.7589 0.0811 0.0133 0.0229

Homography 0.7401 0.2852 0.4601 0.3521

Table 2: Performance of each logistic regression model on the fr3/structure texture near test data.

Model Accuracy Precision Recall F1 Score

F 0.6959 0.3639 0.5822 0.4479

E 0.7326 0.4290 0.6453 0.5154

H 0.8343 0.3725 0.1166 0.1776

R 0.8380 0.6091 0.9302 0.7362

Table 3: Model performance on the fr3/structure texture near test data after using Model R as a preliminary classiﬁer to aid

in classiﬁcation.

Model Accuracy Precision Recall F1 Score

F 0.7598 0.4370 0.4622 0.4492

E 0.7815 0.5041 0.5256 0.5146

H 0.8399 0.3600 0.0552 0.0957

REFERENCES

Arth, C., Pirchheim, C., Ventura, J., Schmalstieg, D.,

and Lepetit, V. (2015). Instant Outdoor Localiza-

tion and SLAM Initialization from 2.5D Maps. IEEE

Transactions on Visualization and Computer Graph-

ics, 21(11):1309–1318.

Berenguer, Y., Pay

a, L., Peidr

o, A., and Reinoso, O. (2017).

Slam algorithm by using global appearance of omnidi-

rectional images. In Proceedings of the 14th Interna-

tional Conference on Informatics in Control, Automa-

tion and Robotics - Volume 2: ICINCO,, pages 382–

388. INSTICC, SciTePress.

Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Jour-

nal of Software Tools.

Butt, M. M., Zhang, H., Qiu, X. C., and Ge, B. (2020).

Monocular SLAM Initialization Using Epipolar and

Homography Model. 2020 5th International Confer-

ence on Control and Robotics Engineering, ICCRE

2020, pages 177–182.

Davidson, A. J., Reid, I. D., Molton, N. D., and Stasse,

O. (2007). MonoSLAM: Real-Time Single Camera

SLAM. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 29(6):1052–1067.

Faugeras, O. and Lustman, F. (1988). Motion and Struc-

ture From Motion in a Piecewise Planar Environment.

International Journal of Pattern Recognition and Ar-

tiﬁcial Intelligence, 02.

Fujimoto, S., Hu, Z., Chapuis, R., and Aufr

ere, R. (2016).

Orb slam map initialization improvement using depth.

2016 IEEE International Conference on Image Pro-

cessing (ICIP), pages 261–265.

Hartley, R. and Zisserman, A. (2004). Multiple View Geom-

etry in Computer Vision. Cambridge University Press,

2 edition.

Huang, J., Liu, R., Zhang, J., and Chen, S. (2017). Fast

initialization method for monocular slam based on in-

door model. In 2017 IEEE International Conference

on Robotics and Biomimetics (ROBIO), pages 2360–

2365.

Klein, G. and Murray, D. (2007). Parallel Tracking and

Mapping for Small AR Workspaces. 2007 6th IEEE

and ACM International Symposium on Mixed and

Augmented Reality, pages 225–234.

Klein, G. and Murray, D. (2009). Parallel tracking and map-

ping on a camera phone. 2009 8th IEEE International

Symposium on Mixed and Augmented Reality, pages

83–86.

Kobayashi, T., Kato, H., and Yanagihara, H. (2013). Novel

keypoint registration for fast and robust pose detec-

tion on mobile phones. Proceedings - 2nd IAPR

Asian Conference on Pattern Recognition, ACPR

2013, pages 266–271.

Korah, T., Wither, J., Tsai, Y. T., and Azuma, R. (2011).

Mobile augmented reality at the hollywood walk of

fame. 2011 IEEE Virtual Reality Conference, pages

183–186.

Lucas, B. D. and Kanade, T. (1981). An iterative image

registration technique with an application to stereo vi-

sion. In Proceedings of the 7th International Joint

Conference on Artiﬁcial Intelligence - Volume 2, IJ-

CAI’81, page 674–679, San Francisco, CA, USA.

Morgan Kaufmann Publishers Inc.

Maidi, M., Preda, M., and Le, V. H. (2011). Markerless

tracking for mobile augmented reality. 2011 IEEE In-

ternational Conference on Signal and Image Process-

ing Applications, ICSIPA 2011, pages 301–306.

Mur-Artal, R., Montiel, J. M., and Tardos, J. D. (2015).

ORB-SLAM: A Versatile and Accurate Monocular

SLAM System. IEEE Transactions on Robotics,

31(5):1147–1163.

Towards Fast and Automatic Map Initialization for Monocular SLAM Systems

Nist

er, D. (2003). An Efﬁcient Solution to the Five-Point

Relative Pose Problem. 2003 IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion, 2003. Proceedings., 2:II–195.

Qin, T., Li, P., and Shen, S. (2018). VINS-Mono: A Ro-

bust and Versatile Monocular Visual-Inertial State Es-

timator. IEEE Transactions on Robotics, 34(4):1004–

1020.

Rosten, E. and Drummond, T. (2006). Machine learning

for high-speed corner detection. Computer Vision –

ECCV 2006, pages 430–443.

Shi, J. and Tomasi, C. (1994). Good Features to Track. 1994

Proceedings of IEEE Conference on Computer Vision

and Pattern Recognition.

Stewenius, H., Engels, C., and Nister, D. (2006). Re-

cent Developments on Direct Relative Orientation. IS-

PRS Journal of Photogrammetry and Remote Sensing,

60:284–294.

Sturm, J., Engelhard, N., Endres, F., Burgard, W., and Cre-

mers, D. (2012). A benchmark for the evaluation of

rgb-d slam systems. In Proc. of the International Con-

ference on Intelligent Robot Systems (IROS).

Sun, L., Du, J., and Qin, W. (2015). Research on combi-

nation positioning based on natural features and gyro-

scopes for AR on mobile phones. Proceedings - 2015

International Conference on Virtual Reality and Visu-

alization, ICVRV 2015, pages 301–307.

Tomono, M. (2005). 3-D localization and mapping using

a single camera based on structure-from-motion with

automatic baseline selection. Proceedings - IEEE In-

ternational Conference on Robotics and Automation,

pages 3342–3347.

Ufkes, A. and Fiala, M. (2013). A markerless augmented

reality system for mobile devices. Proceedings - 2013

International Conference on Computer and Robot Vi-

sion, CRV 2013, pages 226–233.

Xiao, Z., Wang, X., Wang, J., and Wu, Z. (2017). Monocu-

lar ORB SLAM based on initialization by marker pose

estimation. 2017 IEEE International Conference on

Information and Automation, ICIA 2017, (July):678–

682.

ROBOVIS 2021 - 2nd International Conference on Robotics, Computer Vision and Intelligent Systems