Detection of Low-textured Objects

Christopher Bulla and Andreas Weissenburger

Institut f

ur Nachrichtentechnik, RWTH Aachen University, Aachen, Germany

Keywords:

Object Detection, Local Features, SIFT.

Abstract:

In this paper, we present a descriptor architecture, SIFText, that combines texture, shape and color information

in one descriptor. The respective descriptor parts are weighted according to the underlying image content, thus

we are able to detect and locate low-textured objects in images without performance losses for textured objects.

We furthermore present a matching strategy beside the frequently used nearest neighbor matching that has

been especially designed for the proposed descriptor. Experiments on synthetically generated images show the

improvement of our descriptor in comparison to the standard SIFT descriptor. We show that we are able to

detect more features in non-textured regions, which facilitates an accurate detection of non-textured objects.

We further show that the performance of our descriptor is comparable to the performance of the SIFT descriptor

for textured objects.

1 INTRODUCTION

A problem often occurring in computer vision tasks is

the detection or recognition of identical objects within

different images. In recent years, invariant local fea-

tures (Lowe, D.G., 2004; Bay et al., 2006) have been

introduced to solve this problem. A local feature is a

n-dimensional vector, that contains a unique descrip-

tion of an image patch. An image typically contains

several local features and the entirety of all local im-

age features can be regarded as the description of the

image. If two or more images contain identical objects

a subset of their features will be identical or similar.

This could be used to detect common objects across

different images (Bulla and Hosten, 2013).

Many local feature approaches (e.g. SIFT, SURF)

are designed to detect meaningful image patches and

describe their gradient distribution. Therefore, they

could be well used to describe textured objects or im-

ages (cf. ﬁgure 1(a)). When they are used to describe

low-textured objects (cf. ﬁgure 1(b)) two problems oc-

cur: Firstly, low-textured objects only have high gradi-

ents at their border. Therefore most of the local image

patches are detected at the border and not on the object

itself. Secondly, when the patches at the border are de-

scribed, a large amount of the information covered by

the feature characterizes the background. Since low-

textured objects usually have a unique gradient-less

surface, the part of the feature that covers the object

doesn’t contribute much to the description.

In the literature several approaches have been in-

troduced to describe low-textured objects. The Bunch

Of Lines Descriptor (Tombari et al., 2013) for exam-

ple approximate the contour of an object by polygons

and describes this polygons. The Boundary Struc-

ture Segmentation approach (Toshev et al., 2012) uses

the chordiogram, a multidimensional histogram, to

describe a by superpixel segmented image and the

Distinct Multi-Colored Region Descriptors approach

(Naik and Murthy, 2007) describes the objects by re-

gions with different colors. All approaches are indeed

able to describe low-textured objects, but have in com-

mon that they lower the ability to describe textured

objects.

In this paper we present a combination of three de-

scriptors in order to describe low-textured, as well as

textured objects. We thereby combine texture, shape

and color information into a single descriptor and

weight each descriptor component based on the im-

age content. In textured regions e.g. we give a higher

weight to the texture descriptor, while in non-textured

regions the shape descriptor gets a higher weight. We

furthermore use a matching strategy apart the well

known nearest neighbor matching which allows an

accurate object localization.

The rest of the paper is organized as follows. In

chapter 2 we present our descriptor architecture. The

matching strategy will be presented in chapter 3. These

are evaluated in chapter 4. Finally conclusions will be

drawn in chapter 5.

265

Bulla C. and Weissenburger A..

Detection of Low-textured Objects.

DOI: 10.5220/0005269602650273

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 265-273

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

2 DESCRIPTOR

ARCHITECTURE

Our descriptor, SIFText has been designed to increase

the performance of SIFT in low-textured regions,

while maintaining its performance in textured regions.

We therefore combine texture, shape and color infor-

mation into a single descriptor and weight each de-

scriptor part by a locally calculated value, derived

from the texture of the image. As texture descriptor

we use SIFT, the shape will be described with ART

coefﬁcients and the color descriptor is a combination

of Hue- and Opponent-Angle histograms. The sin-

gle descriptor parts as well as their weighting will be

described in the following subsections.

2.1 SIFT

Scale Invariant Feature Transform (SIFT) has been

published in 2004 by Lowe (Lowe, D.G., 2004) and

describes the content of an image patch by a histogram

of gradient orientations.

The selection of feature points is done in a scale

invariant way. In a ﬁrst step the image

I(x, y)

is con-

volved with variable scale Gaussians

G(x, y, σ

), and

the Difference-of-Gaussians (DoG)

D(x, y, σ)

is calcu-

lated:

D(x, y, σ) = I(x, y) ∗[G(x, y, kσ) −G(x, y, σ)] (1)

= L(x, y, kσ) −L(x, y, σ) .

Possible feature positions are given by the extrema

of these DoG images across all scales and image reso-

lutions. A reﬁnement step selects the most distinctive

feature points and interpolates their exact position.

To make the descriptor invariant to object orienta-

tion, one or more dominant orientations are calculated

for each feature point. The dominant orientations are

given by the peaks of a weighted orientation histogram,

(a) Textured image and its

gradients

(b) Low-textured image and

its gradients

Figure 1: Gradients of a textured and a low-textured image.

which is formed from the gradient orientations

Θ(x, y)

of sample points within a region around the point:

Θ(x, y) = tan

−1



L(x, y + 1) −L(x, y −1)

L(x + 1, y) −L(x −1, y)



. (2)

The weighting factors are given by the gradient magni-

tude m(x, y):

m(x, y) =

(L(x + 1, y) −L(x −1, y))

) + (3)

(L(x, y + 1) −L(x, y −1))

To build the descriptor the region around the feature is

divided into several sub-region and again, a weighted

histogram of gradient orientations is calculated for

each sub-region. The descriptor is given by the con-

catenation of all sub-histograms.

2.2 Shape

Low-textured objects can be well described by their

shape. We use Maximally Stable External Regions

(MSER) (Matas et al., 2004) to detect signiﬁcant

shapes in an image and describe them with a set of

Angular Radial Transform (ART) (Bober, 2001) coef-

ﬁcients.

Maximally Stable Extremal Regions are connected

regions that are characterized by an approximately

uniform intensity surrounded by a contrasting back-

ground. Furthermore, they are stable across a range

of thresholds on the intensity function. In order to

avoid segmentation errors on the edges of the MSER

regions (e.g. due to blurring artifacts in case of large

scale changes), we detect the MSER regions, similar

to SIFT, on multiple octaves (cf. (Forssen and Lowe,

D.G., 2007)). A morphological closing ﬁlter is applied

on the detected region, subsequent to the detection to

minimize noise within the region.

Figure 3 shows the detected MSER regions for one

image and visualizes the difference between a MSER

region and and a feature that has been detected on the

textured area of the image.

To calculate a descriptor from the maximal stable

regions we use ART, which describes the shape of a

region with the coefﬁcients F

m,n

n,m

2π

∗

n,m

(ρ, Θ)I(ρ, Θ)ρ δρ δΘ (4)

In equation 4,

I(ρ, Θ)

is an image in polar coordinates

(e.g. a MSER region) and V

∗

n,m

a ART basis function:

n,m

(ρ, Θ) =

2π

jmΘ

(ρ) (5)

(ρ) =



1, if n = 0

2cos(πnρ), if n 6= 0

(6)

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

266

0 1 2 3 4 5 6

8 9 10 11 12

Figure 2: ART basis functions (real part).

Figure 2 depicts exemplarily the real parts of the ﬁst

36 ART basis coefﬁcients.

The coefﬁcients are scale and translation invariant.

A rotation causes a shift in the coefﬁcients phase. To

achieve rotation invariance we only use the coefﬁcients

real part.

2.3 Color

The combination of SIFT and color has been exten-

sively studied in the literature (Van De Weijer and

Schmid, C., 2006). We used a combination of Hue-

and Opponent-Angle histograms in our descriptor ar-

chitecture. The Hue descriptor is suited for the descrip-

tion of high-saturated images, while the Opponent-

Angle descriptor is proper for the description of low-

saturated images.

Hue is the angle between the difference of RGB

color channels and can be computed as:

Hue = atan

√

3(G −B)

2R −G −B

. (7)

The Opponent-Angles are the partial derivatives in x-

and y- direction of the Opponent-Colors:

√

2(R −G)

, O

√

(R + G −2B) (8)

2.4 Descriptor Weighting and Fusion

In order to combine the individual descriptors, it is nec-

essary to normalize them. Therefore, each descriptor

will be normalized to unit length and the ﬁnal descrip-

tor is a weighted combination of the three normalized

descriptors. The weighting factors steer the inﬂuence

of each individual descriptor to the total descriptor.

We choose the weighting factor for each feature point

adaptively, based on the image content. In the left part

of ﬁgure 3 a feature on a textured region is depicted.

Here, we have distinctive gradient information, but no

maximal stable region. Therefore the SIFT descriptor

for this feature point should have more inﬂuence on

the total descriptor than the shape descriptor. The op-

posite holds for the feature point in the right part of

ﬁgure 3.

Image

MSER regions

Gradient magnitude

Figure 3: Difference between MSER regions and gradient

based regions. Left: Keypoint in textured area with almost

no shape information. Right: MSER region with almost no

gradient information.

2.4.1 Weighting Factors for SIFT and SHAPE

The weighting factors for textured and non-textured

regions will be deﬁned through the Gray Level Co-

occurrence Matrix (GLCM). The GLCM

i, j

is the

distribution of co-occurring intensity values (

i, j

) at a

given offset δ = (δ

, δ

) in an image of size (I

, I

i, j

∑

p=1

∑

q=1



1, if i(p, q) = i ∧I(p + δ

, q + δ

) = j

0, else.

(9)

The weighting factor

can, with the number of gray

levels N

be computed as:

α =

∑

i=0

1,1

i,i

)

∑

i=0

∑

j=0

1,1

i, j

)

. (10)

For a homogeneous region,

will be equal to one,

while for a checkerboard pattern

will converge to

zero. The behaviour of alpha for some synthetically

generated texture pattern is depicted in ﬁgure 4. To

precisely steer the impact of

on the individual de-

scriptor parts, we mapp

to a sigmoid. As result of

this mapping we get the weighting factor a:

a =

1 + e

−bα−σ

. (11)

The parameter

in equation 11 could be used to create

either linear or threshold-like behavior of the sigmoid,

DetectionofLow-texturedObjects

267

Figure 4: GLCM example.

while

could be used to prefer one descriptor part.

The normalized shape and SIFT descriptors

0SHAPE

and f

0SIFT

are thus given by:

0SHAPE

= a

SHAPE

∑

SHAPE

j=0

SHAPE

(12)

0SIFT

= (1 −a)

SIFT

∑

SIFT

j=0

SIFT

, (13)

with

SHAPE

, N

SIFT

denoting the number of elements

in the shape and texture descriptor

2.4.2 Weighting Factor for COLOR Descriptor

The color descriptor itself has, due to its construc-

tion, an internal weighting, which reacts adaptively on

the saturation of the image region. Nevertheless, for

features where the shape as well as the texture descrip-

tor cover signiﬁcant information (e.g. for

a = 0.5

we lower the inﬂuence of the color descriptor with a

weighting factor c which is a function of a:

c(a) =

1 + e

−|bα+σ

. (14)

The run of

c(a)

for different values of

is depicted in

ﬁgure 5.

Figure 5: Color descriptor weighting factor

c(a)

, for differ-

ent values of b.

2.4.3 Descriptor Fusion

The combination of different types of descriptors could

be done either before (early fusion) or after (late fu-

sion) the matching stage. Depending to the fusion

method, the descriptor weighting has to be done before

SIFT

Shape

Color

Matching

1 −a

(a) Early descriptor fusion

SIFT

Shape

Color

Matching

1 −a

(b) Late descriptor fusion

Figure 6: Descriptor fusion.

or after the matching. Figure 6 depicts schematically

the early and late fusion. Since the quantisation noise

in the early fusion is in general higher that in the late

fusion, we decided to use the latter. The distance

between two feature descriptors

and

is thus given

as:

d =

∑

SIFT

i=0

q,SIFT

−f

t,SIFT

)

(1 −a)

∑

SHAPE

j=0

q,SHAPE

−f

t,SHAPE

)

∑

COLOR

k=0

q,COLOR

−f

t,COLOR

)

, (15)

with

SIFT

, N

SHAPE

and

COLOR

denoting the length

of the descriptors.

3 DESCRIPTOR MATCHING

Usually, feature based object detection algorithms per-

form a nearest neighbor matching followed by an out-

lier rejection algorithm (e.g. RANdom SAmple Con-

sensus (RANSAC) (Fischler and Bolles, 1981)) to de-

termine corresponding features. The outlier rejection

assumes that all correct correspondences conform to

an afﬁne or perspective transformation

. While this

assumption is useful for images that dont’t have repet-

itive structures, it is not practical for images with low-

textured objects, which often have repetitive or similar

structures (cf. ﬁgure 3). These similar structures could

lead to correspondences that will be removed in the

outlier rejection stage, because they don’t conform to

the calculated transformation even if they are correct.

Similar structures could furthermore lead to ambigui-

ties in the matching process.

For example, the left ear of the toy in ﬁgure 3 will have

approximately the same descriptor as the right ear. If the

right ear is matched to it’s left counterpart, the matching is

on the one hand correct, because an ear has been matched to

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

268

In the following we present a matching algorithm

that combines n-nearest neighbor matching, transfor-

mation clustering and classiﬁcation to distinguish be-

tween features on the object and features on the back-

ground in order to calculate an optimal afﬁne trans-

formation matrix that relates the features of the two

images.

Let

and

deﬁne two sets of features (train

and query) with:

∈ F

, 1 ≤ j ≤ T = ||F

|| and

∈ F

, 1 ≤ i ≤ Q = ||F

|| .

Let further denote

d(f

, f

)

the distance between

and

. The set

of nearest neighbors for

is given by:



, . . . , f



, with

= argmin



d(f

, f

), . . . , d(f

, f

)



= argmin



d(f

, f

), . . . , d(f

, f

)\d(f

, f

)



. . .

= argmin{d(f

, f

), . . . , d(f

, f

)

\d(f

, f

), . . . , d(f

n−1

, f

)} (16)

and the set of matches for the entire query M

image

by:

= {M

, . . . , M

} . (17)

We use a delaunay triangulation on the positions

the query features

to determine adjacent features.

Each triangle connects three query features and is uti-

lized to calculate an afﬁne transformation

that maps

the query features into the coordinate system of their

corresponding features:

T =





a b t

c d t

0 0 1





. (18)

Since a query feature is matched to

test features

we get

transformations for each triangle. The

transformations will be used to derive a feature vector

w = (s, φ, u, t, v

)

(Ljubisa and Sladjana, S., 2006), with

the parameters:

• s = sgn(a)

√

+ c

- scaling

• φ = arctan(

) - rotation

• u =

ab+cd

- shear

• t =

ad−bc

- compression

• v

= (t

)

- translation

another ear, on the other hand it is incorrect because it is the

wrong ear.

keypoint is associated to:

cluster 1

cluster 2

cluster 1 and 2

triangle is associated to:

cluster 1

cluster 2

Figure 7: Assignment of each feature to a cluster.

Since the

-nearest neighbor matching introduces

many false correspondences it is reasonable to remove

the correspondences that are obviously wrong. This

could be e.g. done by evaluating the matching distance

or utilizing proximity relationships.

To determine those features, that conform to a sim-

ilar transformation, we normalize each component in

and use a spectral clustering approach to divide the

feature space into two cluster C

and C

Afterwards, we use a backprojection algorithm and

assign each feature to at least one of the clusters

(cf.

ﬁgure 7). Let

be the set of features, and

the

set of transformation matrices, that belong to

. The

average backprojection error

for cluster

over all

transformation matrices T

i, j

∈ T

is then deﬁned as:

||T

||·||M

∑

j∈T

∑

∈M

i, j

−p

) (19)

with

being the, according to equation 16, corre-

sponding point to p

. The cluster representing the ob-

ject

ob j

is then deﬁned as the cluster that minimizes

the backprojection error:

ob j

= argmin{e

, e

}. (20)

Within the cluster, that represents the object, the opti-

mal transformation matrix is then deﬁned as:

opt

= argmin

j∈T

ob j







||M

ob j

∑

∈M

ob j

ob j, j

−p

)







(21)

4 EVALUATION

To evaluate the performance of the architecture a

benchmark has been implemented, which automati-

cally generates synthetic test images and ground-truth

data. Using these the performance of the descriptor

by itself, as well as the performance of the descrip-

tor used in conjunction with the matching algorithm

Since a feature could belong to more than one triangle a

feature may be assigned to more than one cluster.

DetectionofLow-texturedObjects

269

background test images

test object

ALOI images

mask

Figure 8: Illustration of the test image generation process.

are evaluated on a textured and non-textured data set.

Compared to a test, which uses real images, this ap-

proach is able to generate a statistically signiﬁcant

amount of samples.

4.1 Benchmark

The Benchmark uses the Amsterdam Library of Object

Images (Geusebroek et al., 2005) (ALOI). The ALOI

contains images and corresponding masks of a multi-

tude of objects, which were obtained by rotating each

object in the z-plane by an angle

and taking pictures

◦

increments. Hereinafter the process of generating

the test images shall be described brieﬂy :

Two angles

and

are randomly chosen, based

on a threshold γ

with:

−γ

< γ

. (22)

By utilizing the images

(γ

)

and

(γ

)

a change in

perspective is simulated accurately. Afterwards for

each image

∈ {I

, I

}

an angle

, which represents

a rotation in the image plane; a translation by a vector

and a scaling factor

are randomly chosen. Further-

more a rotation matrix

(β

)

and an afﬁne transfor-

mation matrix

(β

, s

, t

)

are calculated.

is rotated

using

, scaled by

and inserted at the location

an output image

. This process is repeated for the ob-

ject masks yielding the transformed masks

and

In conclusion a randomly chosen background is being

added to

. This process is illustrated in ﬁgure 8 .

While it is not possible to predict the exact posi-

tion of the features, since the projection of the features

in the image plane is based on the geometry of each

object which is rotated by

, the approximated trans-

formation matrices T

1,2

and T

2,1

are available. These

describe the transformation of the object in the image

plane:

1,2

= T

−1

(23)

2,1

= T

−1

. (24)

This approximation holds if

is fairly small. Since a

big value of

would lead to occlusions, which cannot

be matched correctly by any approach, this restriction

is acceptable.

4.2 Descriptor Evaluation

To test solely the performance of the descriptors, fea-

tures are extracted for the images

and

. The

resulting sets of features

and

were matched

using a brute-force matcher, which employs the

metric, yielding a set of matches

1,2

. This process

is repeated by switching the query and train feature

sets providing a second set of matches

2,1

. These

matches are of the form

= (m, n), m

∈M

i, j

and

describe the index of the matched features

∈ F

and f

∈ F

respectively.

Based on these results two metrics are calculated.

ob j

denotes the number of features, that describe the

object and were extracted in

. For each feature

∈

, which is located at

, the value of the binary mask

at p

is evaluated, yielding b



1, if I

) = 1

0, else.

(25)

Therefore:

ob j

||F

∑

n=1

. (26)

valid

i, j

represents the number of correct object corre-

spondences. For each match

= (m, n) ∈ M

i, j

calculated:



1, if I

) = 1 and I

) = 1

0, else.

(27)

Therefore:

valid

i, j

||M

i, j

∑

k=1

. (28)

By conducting this experiment one is able to verify

two important aims of this project: The ﬁrst one is

to increase the number of features detected in a non-

textured object, which is represented by

ob j

. The

second aim is the ability to distinguish these features

from those, which belong to the background. This is

expressed by the quotient of correct correspondeces

and all features on the object:

i, j

valid

i, j

+ n

valid

j,i

ob j

+ n

ob j

. (29)

The evaluation described in this section, was per-

formed on a number of textured and non-textured ob-

jects utilizing the SIFT descriptor; the combination of

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

270

Table 1: Sum of

ob j

valid

i, j

and mean of

i, j

¯c

for the

textured and non-textured object set.

textured non-textured

SIFT form SIFText SIFT form SIFText

ob j

5093 6312 12054 3720 248 5992

valid

3973 4622 10044 2450 1943 4844

¯c [%] 78 73 83 66 78 81

MSER and ART as a descriptor (form) and the archi-

tecture, which is proposed in this paper. The sums of

ob j

valid

i, j

and the mean of

i, j

, which were obtained

during this experiment, are shown in table 1.

The number of detected features by SIFText is

signiﬁcantly higher than the number of features ex-

tracted by each partial descriptor. Furthermore the

number of correct correspondences, obtained by em-

ploying SIFText, exceeds the sum of those generated

using the other descriptors. This increase is based on

the weighted combination of the descriptor parts and

would be lost, if a simple thresholding operation was

used instead. Moreover, the values of

calculated on

the textured and non-textured data set, while using

SIFText, are almost identical. This implies that the

descriptor’s performance is similar for both datasets;

hence it can be utilized effectively for the description

of textured and non-textured objects.

4.3 Matching Algorithm Evaluation

Considering the beforementioned ambiguity (cf. sec-

tion 3) a second test was proposed, which is able to

determine the accuracy of the calculation of an ob-

ject’s position and orientation in an image. According

to the procedure described in section 4.1 and 4.2 two

test images are created. For each test image

a set of

features

is extracted. Based on two sets of features

and

an algorithm akin to the algorithm proposed

in chapter 3 is applied, yielding a transformation ma-

trix

i,j

. By using equation 25, a subset of features

ob j

= {f

∈ F

: b

= 1}

is obtained for each image

, consisting of the features, which are located at the

object’s location in said image. For each set of features

ob j

the corresponding oriented bounding box

calculated and transformed by

i,j

and

j,i

, yielding

and

. By intersecting the polygons deﬁned by

and

the area of the intersection, A

t p

is obtained:

t p

= ||

∩

||. (30)

This area represents the portion of the object, which is

correctly transformed. The areas

f p

and

f n

in which

only one polygon exists, and thus are falsly matched,

are obtained, using the area

and

t p

f p

f n

Figure 9: Geometrical interpretation of

t p

, A

f p

, A

and

f n

, based on B

and

respectively:

f p

−A

t p

(31)

f n

−A

t p

. (32)

The portion of the image that is correctly labeled as

background

is calculated according to equation 33

using the area of the whole test image A

image

= A

image

−A

t p

−A

f p

−A

f n

. (33)

A geometrical representation of these calculations

is given by ﬁgure 9.

Based on the values of

t p

f p

and

f n

con-

fusion matrices were created for the textured and non-

textured datasets utilizing a variety of combinations of

descriptor and matching algorithms according to the

following scheme:

object background

object

t p

f p

t p

f p

background

f n

Values on the principal diagonal approaching

indi-

cate a nearly perfect result, while values close to

suggest that the object is detected at the wrong loca-

tion. The resulting confusion matrices are presented

in ﬁgures 10 and 11.

In ﬁgure 10 the performance of SIFT and SIFText

is similar for small values of

, if the experiment

is conducted on the textured dataset. Increasing

yields no signiﬁcant rise in the precision if SIFT is

used, while utilizing SIFText signiﬁcantly increases

the value. The afore-mentioned ambiguities are solved,

by considering a higher amount of neighbours and the

performance is greatly increased. If the non-textured

dataset is used, SIFText clearly outperforms SIFT re-

gardless of

. In ﬁgure 11 very low precision values

are present. Since the background takes up the major-

ity of the image, a correct transformation matrix cannot

be deducted from the preponderance of the features.

RANSAC isn’t suited to be employed in this kind of

experiment. Nevertheless this experiment shows that

SIFText outperforms SIFT using both datasets.

DetectionofLow-texturedObjects

271

K = 2 K = 4 K = 6

SIFTSIFTextSIFTSIFText

texturednon-textured

Figure 10: Confusion matrices generated using the algorithm

proposed in chapter 3, for various values of the parameter

As descriptor algorithm SIFT and SIFText were used.

textured non-textured

SIFTSIFText

Figure 11: Confusion matrices generated using RANSAC

to calculate the homography. As descriptor algorithm SIFT

and SIFText were used. The color coding is the same used

in ﬁgure 10.

5 CONCLUSIONS

In this paper we have implemented and tested a de-

scriptor architecture, which was designed to detect

non-textured objects as well as textured objects, by

combining edge, shape and color description tech-

niques and weighting them by a locally calculated

value, derived from the texture of the image. Further-

more, a matching algorithm was implemented, that

was designed to mitigate the weak points of the de-

scriptor. In table 1 we have shown that for each object

the number of detected features was increased signiﬁ-

cantly compared to SIFT, while the number of correct

correspondences was increased as well. This applies

to both, the textured and non-textured dataset. We

have further shown, that the precision of detecting

an object’s position and orientation in two images is

increased compared to RANSAC, if the matching al-

gorithm, which was developed here, is used (cf. ﬁgure

10 and 11). But even in the case RANSAC is used, the

descriptor surpasses SIFT.

By choosing the weighting factor locally and

adapting the descriptor accordingly features that were

matched incorrectly by the partial descriptors are as-

signed correctly. This architecture is, therefore, well

suited to detect textured and non-textured objects, as

well as objects, that possess textured and non-textured

parts.

REFERENCES

Bay, H., Tuytelaars, T., and van Gool, L. (2006). SURF:

Speeded Up Robust Features. Computer Vision ECCV

2006, 3951:404–417.

Bober, M. (2001). Mpeg-7 visual shape descriptors. Circuits

and Systems for Video Technology, IEEE Transactions

on, 11(6):716–719.

Bulla, C. and Hosten, P. (2013). Detection of false feature

correspondences in feature based object detection sys-

tems. In Proceedings of International Conference on

Image and Vision Computing New Zealand, Wellington,

New Zealand.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: A paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Commun. ACM, 24(6):381–395.

Forssen, P.-E. and Lowe, D.G. (2007). Shape descriptors

for maximally stable extremal regions. In Proceedings

of IEEE 11th International Conference on Computer

Vision, pages 1–8.

Geusebroek, J.-M., Burghouts, G.J., and Smeulders, A.W.M.

(2005). The amsterdam library of object images. Inter-

national Journal of Computer Vision, 61(1):103–112.

Ljubisa, M. and Sladjana, S. (2006). Orthonormal decompo-

sition of fractal interpolating functions. Series Mathe-

matics and Informatics, 21:1–11.

Lowe, D.G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Matas, J., Chum, O., Urban, M., and Pajdla, T. (2004).

Robust wide-baseline stereo from maximally stable

extremal regions. Image and Vision Computing,

22(10):761 – 767.

Naik, S. and Murthy, C. A. (2007). Distinct multicolored

region descriptors for object recognition. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

29(7):1291–1296.

Tombari, F., Franchi, A., and Di Stefano, L. (2013). Bold

features to detect texture-less objects. In Proceedings

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

272

of IEEE International Conference on Computer Vision,

pages 1265–1272.

Toshev, A., Taskar, B., and Daniilidis, K. (2012). Shape-

based object detection via boundary structure segmen-

tation. International Journal of Computer Vision,

99(2):123–146.

Van De Weijer, J. and Schmid, C. (2006). Coloring local

feature extraction. In Proceedings of the 9th European

Conference on Computer Vision - Volume Part II, pages

334–348.

DetectionofLow-texturedObjects

273