Online Indexing Structure for Big Image Data used for 3D
Reconstruction
Konstantinos Makantasis
1
, Yannis Katsaros
2
, Anastasios Doulamis
3
and Matthaios Bimpas
3
1
Technical University of Crete, Chania, Greece
2
EXUS Software Ltd., London, U.K.
3
National Technical University of Athens, Athens, Greece
Keywords:
Feature Matching and Indexing, 3D Reconstruction, Image and Video Retrieval, Image-based Modeling.
Abstract:
One of the main characteristics of Internet era is the free and online availability of extremely large collections
of images. Although the proliferation of millions of shared photos provide a unique opportunity for cultural
heritage e-documentation, the main difficulty is that Internet image datasets are unstructured. For this reason,
this paper aims to describe a new image indexing scheme with application in 3D reconstruction. The presented
approach is capable, on the one hand to index images in a fast and accurate way and on the other to select
form an image dataset the most appropriate images for 3D reconstruction, improving this way reconstruction
computational time, while simultaneously keeping the same reconstruction performance.
1 INTRODUCTION
Internet era is characterized by extremely large col-
lections of images available over the web that depict
not only contemporary events but also historic inci-
dents and cultural heritage assets. These data are be-
ing captured from individual users and usually are lo-
cated on distributed and heterogeneous databases. Al-
though, the proliferation of millions of shared pho-
tographs provides a unique opportunity for cultural
heritage e-documentation, which includes retrieval,
filtering, indexing and finally exploitation of visual
information, there are limited technological tools and
research methods that meet this purpose.
The main difficulty in using Internet image col-
lections lies in the fact that the stored image con-
tent is unstructured. Simple text-based queries are
inefficient for handling unstructured visual content,
since images’ textual descriptions may be quite dif-
ferent of what they are actually depicting. On the
one hand, human centric textual annotation of images
is an arduous and inconsistent task due to the com-
plexity of visual content and the subjective perception
of humans in interpreting it, and on the other auto-
generated geo-location tags suffer from low precision
since geo-information does not interpret what is actu-
ally depicted.
Our research exploits unstructured Internet im-
age collections stored on distributed multimedia plat-
forms to obtain e-documentation of cultural heritage
objects through 3D reconstruction. The main diffi-
culty towards this direction is that there are several
outliers, images whose visual content is quite dissim-
ilar with the requested cultural heritage object, in the
retrieved dataset. The existence of outliers deterio-
rates the performance and exponentially increases the
computational time of 3D reconstruction. While there
exists 3D reconstruction algorithms (Wu et al., 2011;
Wu et al., 2012), which present robustness against
noisy data, their computational complexity signifi-
cantly increases with respect to the size of input, mak-
ing direct implementation practically impossible un-
der a cost effective manner. To make things worse,
the volumes of the image data, which are stored over
distributed Web repositories are extremely huge and
varying imposing high computational challenges to
any meta-algorithm that exploits these data for real-
time application scenarios.
To address this difficulty, in this paper, we propose
an incremental structure scheme able to online index,
through the calculation of the visual distance, each
new incoming image datum with respect to already
indexed image volumes in a fast and accurate way.
In this way, we are able to online organize retrieved
image data under a computationally efficient man-
ner. The proposed online indexing structure allows
for an efficient implementation of meta-algorithms
Makantasis, K., Katsaros, Y., Doulamis, A. and Bimpas, M.
Online Indexing Structure for Big Image Data used for 3D Reconstruction.
DOI: 10.5220/0005852207050714
In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 705-714
ISBN: 978-989-758-175-5
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
705
that can incrementally process big and varying im-
age volumes. In this paper, a content-based filtering
approach is presented suitable for selecting appropri-
ate geometric varying images for 3D reconstruction
purposes. In particular, our approach exploits the on-
line structure indexing mechanisms to appropriately
organize new incoming image data and then adopts
geometric properties in a multi-dimensional image
manifold (maximize the geometric volume of image
points) to select those data that optimize 3D recon-
struction operation.
1.1 Previous Works
Content Based Image Retrieval (CBIR) tools are
based on a visual matching process, in order to re-
trieve images from large repositories. They use image
filtering and clustering algorithms to appropriately or-
ganize images into groups of similar visual proper-
ties discarding, therefore, noisy information. A CBIR
scheme requires the user to provide a query image to
the system. The query image acts as a reference im-
age, whose visual information is encoded. Then, the
system responses by retrieving those images from a
database, that present high visual similarity with re-
spect to the reference one.
Towards this direction Murthy et al. (Murthy
et al., 2010) propose a two stage image retrieval pro-
cedure based on the color properties of a reference
image. Starting from an initial image set, most of
the images are filtered by applying hierarchical clus-
tering. Then, k-means is applied on filtered data to
get better favored image results. However, the effi-
ciency of this approach inherently depends on camera
properties and environmental conditions of the scene
at the time the photo was taken. Chum et al. in (Chum
et al., 2007) present a system, whose objective is to re-
trieve all instances of a query object in a large image
database. The authors employ, behind visual similar-
ities, a vocabulary tree for indexing and query expan-
sion. Similar to the previous approaches the system
presented by Philbin et al. in (Philbin et al., 2007)
enables the user to select an object of interest within
a reference image and then it returns a ranked list
of images that contain the selected object. Kekre et
al. in (Kekre et al., 2011) develop image signatures
based on image color properties. Signatures are used
to create clusters which are represented by codebooks
stored in a database. Each new query image is com-
pared against the existing codebooks in order to es-
timate the most relevant visual matching. The main
drawback of the aforementioned approaches is that
they require a reference image or an object of inter-
est to carry out the retrieval process. On the contrary,
our method eliminates outliers and organizes the re-
trieved results under an unsupervised framework.
Simon et al. in (Simon et al., 2007) focus on vi-
sual clustering implemented through an optimization
approach that selects a number of canonical scene
views for constructing a scene summary. However,
the authors assume that an image set that represents
the scene is pre-constructed. In contrast, our approach
is responsible for creating this set.
Besides, visual information, the description of
”digital born” media is enhanced by textual infor-
mation, such as automatically generated geo-tags and
camera exif data (Yiakoumettis et al., 2014; Doulamis
et al., 2012). The works of (Papadopoulos et al., 2010;
Arampatzis et al., 2011; Kalantidis et al., 2011) ex-
ploit geo-tagging and annotation to improve the re-
trieval performance. Particularly, the work of (Pa-
padopoulos et al., 2010) describes an image analysis
algorithm that automates the detection of landmarks
from large multimedia databases in order to improve
content-consumption experience. The idea of geo-
clustering is also exploited by the work of (Zheng
et al., 2009) for retrieving landmark images. This
approach combines geo-information along with hier-
archical agglomerative clustering to obtain dense ge-
ographic clusters. Due to the fact that the retrieved
set contains a lot of image outliers visual clustering is
performed to eliminate noisy images. Agarwal et al.
in (Agarwal et al., 2011) use geo-tagged images and
assume multiple different views of the same object in
each of these datasets. Then, they create a vocabulary
tree for indexing and query expansion to cluster to-
gether similar images. Although, the aforementioned
approaches are useful for CBIR applications, where
the aim is to extract similar images upon a query, they
present many shortcomings when they apply for 3D
reconstruction scenarios.
1.2 Our Contribution
Initially, the on-line indexing structure is constructed
with the aim to scale large image volumes. For this
reason, a pre-defined number of landmark images are
selected to represent as much as possible the image
data points. Particularly, for every image, local de-
scriptors are extracted to encode its visual content. In
this paper, the ORB (Rublee et al., 2011) descriptor
is utilized. Then, an image graph is constructed the
vertices of which correspond to the images while the
edges to a pairwise image similarity matching. The
cMDS algorithm (Cox and Cox, 2008) is adopted to
relate the pairwise similarity of the images with re-
spect to Euclidean distances. Therefore we are able to
represent an image as a point in a multi-dimensional
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
706
Figure 1: Example of two images that were retrieved by using the textual query ”Porta Nigra” and their projection on a 2D
manifold. Their coordinates were computed by using the distance between them, which was established by local descriptor
pair-wise similarity matching. Image A that depicts the monument is positioned in a high density area, while Image B, which
is an outlier, is positioned in a low density area.
manifold. In this multi-dimensional manifold, image
landmarks guarantees that the distance of the new in-
coming image with respect to the remaining indexed
ones is able to be computed both computationally ef-
ficient under a constant time of operations and effec-
tively.
A textual query and/or geo-information are used
to find a subspace in the initial indexed image data
that share the same textual and geo-location informa-
tion. Then, the position of images on the manifold
is a clear indicator of how close the visual content of
two images is, see Fig.1. The distribution of the re-
trieved images on the manifold is expected to form
i) a compact hyperspace on which images depicting
the same object are located and ii) low density areas
containing image outliers. In order to develop a ro-
bust indexing structure image outliers must be elim-
inated. Towards this direction space’s density prop-
erty can be exploited through the application of a den-
sity based clustering algorithm such as SOS (Janssens
et al., 2012). We choose SOS due to its property to
compute the probability that a data point is an outlier.
Outlier probabilities are favorable to unbounded out-
liers scores and to hard classification of data, because
they allow to select an appropriate and rational thresh-
old for outliers selection.
Having discriminated image data to the compact
subspace against the image outliers, the next step is
to incrementally extract a set of images that are most
suitable for 3D reconstruction. A 3D reconstruction
engine exploits different geometric perspectives of an
object. For this reason, redundant information can be
considered as those images presenting similar geo-
metric views of the object to be reconstructed. The
incremental set creation enables us to feed the 3D re-
construction engine with the minimum required num-
ber of appropriate geometric views of an object so as
to achieve a targeted precise reconstruction at a given
scale. The selection technique is based on the fact
that the volume contained by a simplex formed by
the most representative images is larger than any other
simplex volume formed by any other combination of
images (Winter, 1999).
The rest of the paper is organized as follows: Sec-
tion 2 presents how images are modeled as points on a
multi-dimensional manifold. Section 3 focuses on the
indexing structure and Section 4 describes the repre-
sentatives selection technique. Section 5 presents the
experimental framework and Section 6 concludes this
work.
2 IMAGES AS
MULTI-DIMENSIONAL
MANIFOLD POINTS
This section presents our approach to encode visual
information of an initial retrieved image dataset. We
assume that N images, I
(1)
,I
(2)
,...,I
(N)
, are retrieved
from web multimedia repositories using geo-location
information and textual metadata. Initially, through
the adoption of local visual descriptors we represent
images’ content and then we formulate the similar-
ity/distance between pairs of images. Finally, by
exploiting cMDS algorithm we relate the space of
distances with the space of Gram matrices, which
are used to compute image coordinates onto a multi-
dimensional manifold over which each image is rep-
resented.
2.1 Geometric Invariant Visual Content
Modeling
In this paper, we choose to use ORB descriptor
(Rublee et al., 2011) for encoding images’ visual con-
tent. Our choice is justified by the fact that, on the one
hand, ORB performs better than SURF (Bay et al.,
2006) and, on the other, it performs as well as SIFT
(Lowe, 2004), while being almost two orders of mag-
Online Indexing Structure for Big Image Data used for 3D Reconstruction
707
(a)
(b) (c)
Figure 2: (a) Images projected in a 2D manifold. Their coordinates were computed by using their pair-wise distances. (b)
Inliers were selected as landmarks and defined a new 2D subspace. (c) New samples (red triangles and green circles) are
indexed/projected according to landmarks. Green circles correspond to new samples denoted as inliers, while red triangles
correspond to new samples denoted as outliers. New samples that fall into the region of influence of centroid or a landmark
are denoted as inliers. In the first case the indexing structure remains as it is, while in the second it is updated.
nitude faster. ORB builds on the FAST keypoint de-
tector (Rosten and Drummond, 2006) and the BRIEF
descriptor (Calonder et al., 2010) and addresses their
limitations by adding an accurate orientation compo-
nent to FAST and by incorporating a method for de-
correlating BRIEF features under a rotation invariant
framework.
To be more specific, for each image pixel p
c
,
which has been denoted by FAST detector as a cor-
ner pixel, a bit-string is adopted from a set of n bi-
nary tests T = {τ
1
,τ
2
,··· , τ
n
}, where n is a predefined
scalar parameter of the algorithm. The n binary tests
take place in an image patch l(p
c
) around pixel p
c
as
follows:
τ
i
(l(p
c
);q,r) =
(
1 i f I(r) > I(q)
0 i f I(r) I(q)
. (1)
In Eq(1), variables q, r stands for two pixels within the
patch l(p
c
), while I(q) and I(r) correspond to image
intensities at pixels q and r respectively. Based on the
outcome of the n binary tests a feature that describes
the patch l(p
c
) of image I is constructed as:
f
(I)
n
(l(p
c
)) =
n
i=1
2
i1
τ
i
(l(p
c
);q,r). (2)
By utilizing the intensity centroid corner orientation
measure (Rosin, 1999) the orientation angle θ(l(p
c
))
of the patch l(p
c
) can be computed as:
θ(l(p
c
)) = arctan(m
01
(l(p
c
)),m
10
(l(p
c
))), (3)
where m
01
(l(p
c
)) and m
10
(l(p
c
)) stands for the raw
moments of the patch l(p
c
).
The projection of the feature vector f
(I)
n
(l(p
c
)) of
Eq.(2) onto the angle θ(l(p
c
)) results in a rotation in-
variant binary representation vector, ϕ
(I)
n
(l(p
c
)), of
patch l(p
c
). Then, the visual content of an image I
is represented by a matrix Φ
(I)
{0, 1}
K×n
:
Φ
(I)
= [ϕ
(I)
n
(l(p
1
)) ··· ϕ
(I)
n
(l(p
K
))]
T
, (4)
where K is a predefined scalar parameter of ORB de-
scriptor algorithm and stands for the number of de-
tected keypoints in an image.
2.2 Formation Image Graphs
For estimating visual similarity between two images,
A and B, their correspondent points have to be com-
puted. Correspondences can be estimated by per-
forming a nearest-neighbor keypoints matching algo-
rithm between every pair of images. Due to the fact
that ORB keypoints are described by a binary pattern,
multi-probe LSH (Lv et al., 2007) is used exploiting
the Hamming distance, D
H
.
Let us denote as k
(A)
i
the i
th
keypoint of image A,
which is described by the vector ϕ
(A)
n
(l(p
i
)). Then,
the most relevant keypoint k
(B)
ji
of image B with re-
spect to the k
(A)
i
is obtained by the following relation:
j
i
= argmin
j=1,2,...K
(D
H
(ϕ
(A)
n
(l(p
i
)),ϕ
(B)
n
(l(p
j
)))). (5)
Having detected all correspondent points between
two images A and B we can form a set
M
(AB)
= {(k
(A)
i
,k
(B)
ji
)|i = 1, 2, ...K} (6)
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
708
that contains all keypoints k
(A)
i
, k = 1,2,...,K along
with the correspondent points k
(B)
ji
.
For every pair of images in the dataset, a two-way
matching is performed. The set of final matches,
M
(A,B)
= M
(AB)
M
(BA)
(7)
between images A and B is defined as the intersection
of the sets M
(AB)
and M
(BA)
.
Two-way matching compensates inconsistencies
caused by the fact that the nearest neighbor of an ex-
tracted keypoint in image A may be different from the
nearest neighbor of the correspondent keypoint in im-
age B.
Using the M
(A,B)
set, we define a visual similarity
metric between images A and B as:
s
i=A, j=B
=
|M
(A,B)
|
K
, (8)
where |M
(A,B)
| refers to the cardinality of M
(A,B)
set.
The output of the aforementioned process for N
images is an N ×N symmetric matrix S with elements
s
i j
[0,1], i, j = 1,2,...,N. Variable s
i j
takes value
close to zero for two quite dissimilar images and close
to one when two images are similar. As D we denote
the log version of matrix S so as to similar images
receive close to zero while quite dissimilar very high
value;
D = [d
i j
] = log(S). (9)
D is an N × N symmetric matrix with non negative
elements and zeros on the main diagonal.
2.3 Image Graph Projection onto
Multi-dimensional Manifold
Let us denote as x
(i)
R
µ
the coordinates of i
th
im-
age in the µ-dimensional space. The space is de-
fined such that the norm between two points of the
space, represented by the coordinates x
(i)
and x
( j)
,
should be equal to their respective image distance,
d
i j
= log(s
i j
), defined in Eq.(9). The coordinates
of all N images in the dataset can be compactly repre-
sented by a matrix X;
X = [(x
(1)
) (x
(2)
) ··· (x
(N)
)]
T
R
N×µ
. (10)
If we define the Gram matrix B = X · X
T
of im-
ages coordinates, then cMDS algorithm can be used
to establish a connection between the space of the dis-
tances and the Gram matrix B based on the following
theorem (the proof can be found in (Cayton, 2006)).
Theorem 1. A non-negative symmetric matrix D
R
N×N
, with zeros on the diagonal, is an Euclidean
distance matrix if and only if B
1
2
HDH, where
H I
1
N
11
T
, is positive semidefinite. Further-
more, this B will be the Gram matrix for a mean cen-
tered configuration with interpoint distances given by
D.
In cases where dissimilarity matrix D is not Eu-
clidean the matrix B as described by the above the-
orem will not be positive definite, and thus will not
be a Gram matrix. To handle such cases, cMDS algo-
rithm projects the matrix B onto the cone of positive
semi-definite matrices by setting its negative eigen-
values to zero. In order to get matrix X, the ma-
trix B is spectrally decomposed into B = U V U
T
and then X = UV
1/2
. If we denote as q
i
and λ
i
for
i = 1, 2, ...,N the eigenvectors and eigenvalues of B,
then matrix U is a square N ×N matrix whose i
th
col-
umn is the eigenvector q
i
of B and V = [v
ii
] is the di-
agonal matrix whose elements v
ii
are the correspond-
ing eigenvalues, i.e. v
ii
= λ
i
. Finally the dimension µ
of the multi-dimensional space is equal to the multi-
plicity of non-zero eigenvalues of matrix B.
3 THE ONLINE IMAGE
INDEXING STRUCTURE
The number of available images stored on Internet
multimedia repositories is continuously increasing.
For this reason, the proposed method focuses on cre-
ating an indexing structure capable to process online
new retrieved images other than those included in the
initial dataset. However, in order to develop a robust
indexing structure, we must eliminate image outliers
and form a set that will contain only the visually sim-
ilar images.
By using the representation of images as points
onto an µ-dimensional space, we can intuitively note
that outliers must reside to low spatial density areas,
whereas visually similar images must form areas of
high spatial density. Exploiting the density property,
or in other words, the affinity between image points,
the µ-dimensional manifold must be partitioned into
two disjoint subspaces, C and
¯
C , such as all visually
similar images belong to C and all outliers to
¯
C .
3.1 Affinity-based Partitioning
An affinity-based approach for selecting outliers is the
SOS algorithm (Janssens et al., 2012). This algorithm
employs the concept of affinity to quantify the rela-
tionship from one image point to another. Based on
this relationship an image point is denoted as outlier
when all other points have insufficient affinity with it.
By using the distance, d
i j
defined in Eq.(9), be-
tween image points x
(i)
and x
( j)
, the affinity between
Online Indexing Structure for Big Image Data used for 3D Reconstruction
709
these points can be defined as:
α
i j
=
(
e
(d
2
i j
/2σ
2
i
)
i f i 6= j
0 i f i = j
, (11)
where σ
2
i
is scalar variance associated with image
point x
(i)
. As shown by Eq.(11) an image point has
no affinity with itself and the affinity that the point
x
(i)
has with point x
( j)
is proportional to the prob-
ability density at x
( j)
under a Gaussian distribution
N (x
(i)
,σ
2
i
). For determining the variance σ
2
i
for each
image point, SOS uses an adaptive approach. Con-
cretely, it employs the perplexity parameter h, which
is used to set adaptively the variances in such a way
that each point has h effective neighbors (Hinton and
Roweis, 2002). At this point it has to be mentioned
that h is the only parameter that SOS algorithm re-
quires to be pre-defined.
Unlike to distance matrix D, the affinity matrix
A = [α
i j
] is not symmetric. By using the affinity dis-
tribution α
i
= [α
i1
α
i2
... α
iN
] for the point x
(i)
, a dis-
crete probability distribution b
i
that shows the prob-
ability that point x
(i)
chooses any one of the other
points as its neighbors, is defined as
b
i
= [b
i1
b
i2
... b
iN
] where b
i j
=
α
i j
N
k=1
α
ik
. (12)
The probability distribution b
i
corresponds to the nor-
malized affinity α
i
.
After the estimation of probability distribution b
i
the probability the image point x
(i)
to be denoted as
outlier can be estimated by the following theorem (the
proof can be found in (Janssens et al., 2012)).
Theorem 2. If α
i j
is the affinity that data point x
(i)
has with data point x
( j)
and b
i j
is the normalized
affinity between these two points, then the probabil-
ity that data point x
(i)
belongs to the outliers class,
¯
C , is given by:
p(x
(i)
¯
C ) =
j6=i
(1 b
ji
). (13)
The above theorem states that the probability that
an image point x
(i)
belongs to the outliers class,
ˆ
C ,
is the probability that this point is never chosen as a
neighbor of the other image points.
For N images, the output of SOS algorithm can be
compactly represented by a vector ρ R
N
.
ρ = [p(x
(1)
¯
C ) ... p(x
(N)
¯
C )]
T
. (14)
Using Eq.(14) the set Q that will contain the coordi-
nates of the inlier images can be defined as
Q = {x
(i)
| ρ
i
< θ} for i = 1, 2, ...N. (15)
In Eq.(15) ρ
i
stands for the i
th
element of ρ and θ is
a probability threshold to discriminate image outliers
than inliers.
3.2 Indexing Structure Initialization
Let us define the set L = {x
(i)
| x
(i)
Q }, which
contains visual similar images’ coordinates onto the
multi-dimensional space. The image points x
(i)
L
act as landmarks that determine if a new image
ˆ
I must
be denoted as inlier or outlier. The elements of L de-
fine a space with a centroid, c, whose coordinates are
x
(c)
. Regions of influence are defined around the cen-
troid and each one of the landmarks. The region of
influence of centroid, R
c
, is defined as
R
c
(x
(c)
,r
c
) = {x | (xx
(c)
)
T
(xx
(c)
) r
c
}, (16)
where r
c
= max{kx
(c)
x
(i)
k
2
| x
(i)
L}. In a sim-
ilar way is defined the region of influence of a land-
mark x
(i)
R
i
(x
(i)
,r
i
) = {x | (x x
(i)
)
T
(x x
(i)
) r
i
}. (17)
In this case r
i
is defined as
r
i
= min{kx
(i)
x
( j)
k
2
| x
(i)
,x
( j)
L and i 6= j}.
(18)
Regions of influence are used, as described in the next
subsection, for classifying new retrieved images as in-
liers or outliers.
3.3 Online Image Indexing
Let us assume that a new image,
ˆ
I is retrieved. We
define the set Q
I
as:
Q
I
= {I
(i)
| x
(i)
Q } (19)
The distances between
ˆ
I and each one of the images
I
(i)
Q
I
are computed by the method described in
Section 2.
In order to index the new image
ˆ
I, it has to be
projected onto the multi-dimensional geometric space
defined by images belonging to Q
I
. Let ˆx
(
ˆ
I)
be the co-
ordinates of image
ˆ
I after its projection onto the multi-
dimensional space. The objective of assigning coordi-
nates to image
ˆ
I is to minimize the distance distortion
given by the following relation:
e(I
(i)
,
ˆ
I) = | d(I
(i)
,
ˆ
I) k ˆx
(
ˆ
I)
x
(i)
k
2
| (20)
d(I
(i)
,
ˆ
I) is the distance between images I
(i)
and
ˆ
I
computed by Eq.(9) and k · k
2
refers to the L
2
-norm
of a vector. Eq.(20) measures distance distortion by
the absolute error.
The problem of assigning coordinates to image
ˆ
I
can be seen as a typical optimization problem where
the following objective function is minimized.
argmin
ˆx
(
ˆ
I)
s
L
i=1
e(I
(i)
,
ˆ
I)
2
(21)
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
710
For estimating the optimal coordinates ˆx
(
ˆ
I)
we used
simplex downhill method. The time for projecting
a new image onto an µ-dimensional space is deter-
mined by the simplex downhill method. In general
simplex downhill with an objective function g takes
O(mD× f (g)) time, where f (g) is the cost to evaluate
g, D is the number of dimensions and m the number of
iterations. In our case, we have D = µ and f (g) = L·µ,
where L stands for the cardinality of Q
I
. The second
equation holds because we need to calculate the dis-
tances between image
ˆ
I and each one of the images
I
(i)
Q
I
in an µ-dimensional space. In all, the time
complexity for indexing a new image is O(mLµ
2
).
Having defined the regions of influence for the
centroid and each one of the landmarks (Subsection
3.2), a new image,
ˆ
I with coordinates ˆx
(
ˆ
I)
, is de-
noted as inlier only if ˆx
(
ˆ
I)
R
c
or ˆx
(
ˆ
I)
R
i
for some
i = 1,2,...,|L|, where |L| stands for the cardinality of
set L.
If ˆx
(
ˆ
I)
R
c
the L set remains as it is, while Q
and Q
I
sets are updated according to the following
relation:
Q := Q ˆx
(
ˆ
I)
and Q
I
:= Q
I
ˆ
I (22)
If ˆx
(
ˆ
I)
R
i
for some i = 1, 2, ..., |L| and ˆx
(
ˆ
I)
6∈ R
c
the
sets Q and Q
I
are updated according Eq.(22), but in
this case the set L is also updated as:
L := L ˆx
(
ˆ
I)
min{kx
(i)
x
(c)
k
2
| x
(i)
L} (23)
This adaptation takes place for taking into consider-
ation new images visual content, while at the same
time keeping constant the number of landmarks.
4 REPRESENTATIVE OBJECT
GEOMETRIC PERSPECTIVES
After the creation of Q and Q
I
, we need to select the
most representative images corresponding to differ-
ent geometric perspectives of the cultural heritage ob-
ject under 3D reconstruction. The representative im-
ages are fed as input to a 3D reconstruction algorithm
to improve computational time while simultaneously
keeping the same reconstruction accuracy.
4.1 Representatives Selection through
Simplex Volume Expansion
We assume that the µ-dimensional volume formed by
a simplex with vertices specified by the points of the
most representative images should be larger than that
formed by any other combination of image points. Let
us denote as ν
(i)
the i
th
representative image point,
as β the number of representative images required to
generate, as Q
R
= {ν
(1)
,ν
(2)
,...,ν
(β)
} Q the set
that contains the representative images’ points and
as w
( j)
the row vector that equals to ν
( j)
ν
(1)
for
j = 2,3,...,β. Then the volume, V (Q
R
), of the sim-
plex whose vertices are the points ν
(i)
for i = 1,2,...,β
can be computed as:
V (Q
R
) =
|det(W W
T
)|
1/2
(β 1)!
(24)
where W is an (β 1) ×µ matrix whose rows are the
row vectors w
( j)
.
For estimating the most representative images, ini-
tially the set Q
R
is constructed by randomly selecting
β images from set Q and calculate the volume of the
simplex formed by the elements of Q
R
. Then, an it-
erative approach is adopted to test every image in the
set Q as a candidate representative. To be more spe-
cific, each one of the image points of Q
R
is replaced,
one at a time, with an image point ˆν from Q that is
being tested as candidate representative. Then, the
algorithm evaluates if replacing any of the elements
of Q
R
with the image point being tested results in a
larger simplex volume. If this is true, let’s say for the
point ν
( j)
Q
R
, then the ν
( j)
point is replaced by the
image point ˆν and the process is repeated again until
each image from Q set is evaluated.
For making the selection method scalable to large
datasets, we follow an incremental approach. Let us
assume that β representatives are known. Then, the
problem of selecting β + 1 representatives can be re-
duced to finding β+1 representatives given β of them.
This way, only the volumes of simplices formed bythe
elements of the sets Q
R
x
(i)
for x
(i)
Q need to be
evaluated.
5 EXPERIMENTAL RESULTS
In the framework fo this research, we have collected
from Internet image repositories images depicting dif-
ferent cultural heritage monuments, such as Porta Ni-
gra in Germany, Parthenon in Athens and Descobri-
mentos in Lisboa. All these images have been gath-
ered with respect to their textual annotation and geo-
information regardless of the actual type of content
they depict. Thus, for each cultural heritage category,
a large number of image outliers are encountered.
The evaluation of the presented approach took
place in regard to indexing, in terms of accuracy and
time complexity, as well as to 3D reconstruction ac-
curacy after the selection of the most representative
images. The algorithm was developed in Python and
executed on a conventional i5 CPU laptop.
Online Indexing Structure for Big Image Data used for 3D Reconstruction
711
(a) Right predictions
(b) New image indexing
(c) Projection error
(d) New image projection
Figure 3: Diagram (a) shows the ratio of right denotations of new images as inliers or outliers in regard to the number of
landmarks, while diagram (b) presents the time required to classify a new image. Diagram (c) shows the projection error
when assigning coordinates to new images in regard to the number of dimensions of the space onto which the images are
projected. The time required to project a new image onto the multi-dimensional space is presented in (d).
5.1 Indexing Evaluation
In order to evaluate indexing mechanism, we created
an indexing structure using a varying number of land-
marks. Then, we manually selected one hundred out-
lier images and one hundred inlier images. These im-
ages are fed to the indexing mechanism in order to
be classified. Two different versions of the algorithm
were tested; using a fixed indexing structure and an
adaptive indexing structure. In the first case, the in-
dexing structure remains fixed, while in the latter the
adaptation mechanism is enabled and the set of land-
marks is updated in order to include new images vi-
sual information.
Diagram (a) of Fig.(3) presents the ratio of right
denotations of new images as inliers or outliers in re-
gard to the number of landmarks, while diagram (b)
at the same figure shows the time required to classify
a new image. The version that uses the adaptive in-
dexing structure is presented to outperform the one
that uses the fixed structure, due to the fact that it ex-
ploits visual information of new images. However it
requires more time to classify a new image, as it needs
extra time to adapt the indexing structure.
Diagrams (c) and (d) of Fig.(3) present the pro-
jection error when assigning coordinates to new im-
ages and the time required to project a new image
in regard to the number of dimensions of the space
onto which the images are projected. The parame-
ter n in x-axis refers to the number of dimensions of
the space. In this case parameter n was set to 100
at the same value was set and the number of land-
marks used by indexing structure. As shown in dia-
gram (c) the projection error is constantly decreasing
as the number of space dimensions is increasing. In
diagram (d) the time required to project a new image
onto a multi-dimensional space is increasing as the
number of space’s dimensions is getting larger. This
is aligned with the time complexity analysis presented
in subsection 3.3.
5.2 Representatives Selection
Evaluation
For evaluating our representatives selection approach,
we used expert’s assessment in order to select from
the set Q that contain the visual similar images the n
most appropriate for 3D reconstruction: i.e. images
correspond to different views of the under reconstruc-
tion object.
The set of visually similar images contained N
elements, and we selected n = N/5 of them as the
most representatives, set Q
r
. Then, we asked from
our representatives selection algorithm to extract n/5,
2n/5, 3n/5, 4n/5 and n images from the set Q . The
set of extracted images are denoted as
ˆ
Q
i
, where
i {n/5,2n/5,3n/5,4n/5,n} In this framework re-
construction accuracy is defined as A = |Q
r
ˆ
Q
i
|/|Q
r
|,
where | · | represents the cardinality of a set. By the
definition of reconstruction accuracy is obvious that
for the cases of n/5, 2n/5, 3n/5, 4n/5 and n extracted
images, the maximum reconstruction accuracy that
can be obtained is 20%, 40%, 60%, 80% and 100%
respectively.
Furthermore, we compared our representative se-
lection algorithm with two well known algorithms; K-
Means and spectral clustering using normalized cut
and min cut. We request from K-Means and spectral
clustering algorithms to partition the set Q into n/5,
2n/5, 3n/5, 4n/5 and n clusters. Then, from each one
of the clusters we selected as representative image,
the image that belongs to Q and is closer to centroid
than the rest images of the same cluster.
Evaluation results are shown in Fig.(4). As the
number of cluster is getting larger, the performances
of K-Means and spectral clustering is increasing. This
is justified by the fact that as the number of clusters
is increasing, each one of them contains fewer ele-
ments and thus the probability to select the true rep-
resentative is increasing. However, our approach out-
performs both algorithms in all cases. Fig.(5) shows
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
712
Figure 4: This figure presents reconstruction accuracy in regard to the number of selected representatives.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5: (a) - (e) show reconstruction results for ”Porta Nigra” by selecting n/5, 2n/5, 3n/5, 4n/5 and n images using our
representatives selection approach. (f) shows reconstruction when all images selected by an expert were used.
reconstruction results for ”Porta Nigra” by selecting
n/5, 2n/5, 3n/5, 4n/5 and n images using our repre-
sentatives selection approach.
6 CONCLUSIONS
This paper presents an image indexing approach with
application to 3D reconstruction, which is capable to
index new images in a fast and accurate way.
Given a set of images, local descriptors are used
to encode images’ visual content, which, then, is used
for estimating a similarity metric between images.
This results in the construction of a similarity matrix.
Using this similarity matrix images are represented as
points into a multi-dimensional space. Exploiting im-
ages’ coordinates the indexing structure is initialized
by eliminating outliers and forming a set of visually
similar images. Then, based on the indexing struc-
ture, each new retrieved image can be denoted online
as inlier or outlier. Furthermore, an accurate algo-
rithm is described for selecting the most appropriate
images for 3D reconstruction; i.e. images that depict
different views of the same object.
ACNOWLEDGEMENTS
The research leading to these results has been sup-
ported by Marie Curie IAPP project 4D-CH-World:
Four Dimensional Cultural Heritage World. Grant
agreement number324523.
Online Indexing Structure for Big Image Data used for 3D Reconstruction
713
REFERENCES
Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless,
B., Seitz, S. M., and Szeliski, R. (2011). Building
rome in a day. Commun. ACM, 54(10):105112.
Arampatzis, A., Zagoris, K., and Chatzichristofis, S. A.
(2011). Dynamic two-stage image retrieval from large
multimodal databases. In Advances in Information
Retrieval, Lecture Notes in Computer Science, pages
326–337. Springer Berlin Heidelberg.
Bay, H., Tuytelaars, T., and Gool, L. V. (2006). SURF:
speeded up robust features. In Computer Vision
ECCV 2006, number 3951 in Lecture Notes in Com-
puter Science, pages 404–417. Springer Berlin Hei-
delberg.
Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010).
BRIEF: binary robust independent elementary fea-
tures. In Computer Vision ECCV 2010, number 6314
in Lecture Notes in Computer Science, pages 778–
792. Springer Berlin Heidelberg.
Cayton, L. (2006). Algorithms for manifold learning. Tech-
nical Report CS2008-0923, University of California,
San Diego, Tech.
Chum, O., Philbin, J., Sivic, J., Isard, M., and Zisserman,
A. (2007). Total recall: Automatic query expansion
with a generative feature model for object retrieval.
In IEEE 11th Intern. Conf. on Comp. Vision. ICCV,
pages 1–8.
Cox, M. A. A. and Cox, T. F. (2008). Multidimensional
scaling. In Handbook of Data Vis., Springer Hand-
books Comp.Statistics, pages 315–347. Springer.
Doulamis, N., Yiakoumettis, C., and Miaoulis, G. (2012).
On-line spectral learning in exploring 3d large scale
geo-referred scenes. In Progress in Cultural Heritage
Preservation, pages 109–118. Springer.
Hinton, G. E. and Roweis, S. T. (2002). Stochastic neighbor
embedding. In Becker, S., Thrun, S., and Obermayer,
K., editors, Advances in NIPS 15, pages 833–840.
Janssens, J., Huszar, F., Postma, E., and van den Herik, J.
(2012). Stochastic outlier selection. Technical Report
TiCC TR 2012-001, Tilburg University, Netherlands.
Kalantidis, Y., Tolias, G., Avrithis, Y., Phinikettos, M., Spy-
rou, E., Mylonas, P., and Kollias, S. (2011). VIRaL:
visual image retrieval and localization. Multimedia
Tools and Applications, 51(2):555–592.
Kekre, D. H. B., Sarode, T. K., Thepade, S. D., and Vaishali,
V. (2011). Improved texture feature based image re-
trieval using kekres fast codebook generation algo-
rithm. In Thinkquest, pages 143–149. Springer India.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. International Journal of Com-
puter Vision, 60(2):91–110.
Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K.
(2007). Multi-probe LSH: efficient indexing for high-
dimensional similarity search. In P33rd Inter. Conf on
VLDB, VLDB ’07, page 950961, Vienna, Austria.
Murthy, V. S. V. S., Kumar, S., and Rao, P. S. (2010). Con-
tent based image retrieval using hierarchical and k-
means clustering techniques. Intern. Journal of En-
gineering Science and Technology, 2(3).
Papadopoulos, S., Zigkolis, C., Kompatsiaris, Y., and
Vakali, A. (2010). Cluster-based landmark and event
detection on tagged photo collections. IEEE Multime-
dia.
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A.
(2007). Object retrieval with large vocabularies and
fast spatial matching. In IEEE Conf. on Comp. Vision
and Pattern Recognition. CVPR, pages 1–8.
Rosin, P. (1999). Measuring corner properties. Computer
Vision and Image Understanding, pages 291–307.
Rosten, E. and Drummond, T. (2006). Machine learning
for high-speed corner detection. In Computer Vision
ECCV 2006, number 3951 in Lecture Notes in Com-
puter Science, pages 430–443. Springer Berlin Hei-
delberg.
Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.
(2011). ORB: an efficient alternative to SIFT or
SURF. In 2011 IEEE Intern. Conf. on Comp. Vision
(ICCV), pages 2564–2571.
Simon, I., Snavely, N., and Seitz, S. M. (2007). Scene sum-
marization for online image collections. In IEEE 11th
Intern. Conf. on Comp. Vision. ICCV, pages 1–8.
Winter, M. E. (1999). N-FINDR: an algorithm for fast au-
tonomous spectral end-member determination in hy-
perspectral data. volume 3753, pages 266–275.
Wu, C., Agarwal, S., Curless, B., and Seitz, S. (2011). Mul-
ticore bundle adjustment. In IEEE Conf. on Comp.
Vision and Pattern Recognition (CVPR), pages 3057–
3064.
Wu, C., Agarwal, S., Curless, B., and Seitz, S. (2012).
Schematic surface reconstruction. In IEEE Conf. on
Comp. Vision and Pattern Recognition (CVPR), pages
1498–1505.
Yiakoumettis, C., Doulamis, N., Miaoulis, G., and Ghaz-
anfarpour, D. (2014). Active learning of users prefer-
ences estimation towards a personalized 3d navigation
of geo-referenced scenes. GeoInformatica, 18(1):27–
62.
Zheng, Y.-T., Zhao, M., Song, Y., Adam, H., Buddemeier,
U., Bissacco, A., Brucher, F., Chua, T.-S., and Neven,
H. (2009). Tour the world: Building a web-scale land-
mark recognition engine. In IEEE Conf. on Comp.
Vision and Pattern Recognition. CVPR, pages 1085–
1092.
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
714