Unsupervised Segmentation Evaluation for Image Annotation

Annette Morales-Gonz´alez

, Edel Garc´ıa-Reyes

and Luis Enrique Sucar

Advanced Technologies Application Center, Rpto. Siboney, Playa, La Habana, Cuba

Instituto Nacional de Astrof´ısica,

Optica y Electr´onica, Puebla, Mexico

Keywords:

Unsupervised Segmentation Evaluation, Automatic Image Annotation, Irregular Graph Pyramid.

Abstract:

Unsupervised segmentation evaluation measures are usually validated against human-generated ground-truth.

Nevertheless, with the recent growth of image classiﬁcation methods that use hierarchical segmentation-based

representations, it would be desirable to assess the performance of unsupervised segmentation evaluation to

select the most suitable levels to perform recognition tasks. Another problem is that unsupervised segmenta-

tion evaluation measures use only low-level features, which makes difﬁcult to evaluate how well an object is

outlined. In this paper we propose to use four semantic measures, that combined with other state-of-the-art

measures improve the evaluation results and also, we validate the results of each unsupervised measure against

an image annotation algorithm ground truth, showing that using measures that try to emulate human behaviour

is not necessarily what an automatic recognition algorithm may need. We employed the Stanford Background

Dataset to validate an image annotation algorithm that includes segmentation evaluation as starting point, and

the proposed combination of unsupervised measures showed the best annotation accuracy results.

1 INTRODUCTION

There is a growing tendency to use segmentation-

based image representations for object detection

and recognition. For instance, the winners of the

ILSVRC2013 competition (ImageNet Large Scale

Visual Recognition Challenge 2013) (Russakovsky

et al., 2014), employed image segmentation in their

object detection pipelines. Many other automatic im-

age annotationapproaches start from a superpixelrep-

resentation in order to avoid the complexity of anno-

tation at pixel level (Huang et al., 2011; van de Sande

et al., 2011; Arbelaez et al., 2012; Morales-Gonz´alez

and Garc´ıa-Reyes, 2013; Zhang and Xie, 2013). In

these cases, they use segmentation algorithms to pro-

duce an initial partition of the image pixels, but there

is always the question of which is a good level of seg-

mentation to start the annotation process for a partic-

ular image. If the annotation algorithm uses a hier-

archy of image partitions, it is even more important

to assess the relevance of each partition to the recog-

nition problem, in order to reduce noise (too over-

segmented partitions) and to avoid loosing informa-

tion (too under-segmented partitions). Nevertheless,

the unsupervised segmentation evaluation community

has not focused on this particular problem yet.

Unsupervised segmentation evaluation is usually

addressed without taking into account the ultimate

goal of the task where the segmentation result is to be

used. Most approaches compare the results of auto-

matic segmentation against human-annotated ground-

truth, or against other automatic segmentation results

(Zhang et al., 2008; Csurka et al., 2013). Neverthe-

less, if segmentation is an initial step in a higher level

task, such as image annotation and object recognition,

the ground-truth provided by humans may not be ade-

quate in order to obtain good results in the recognition

process. It has been suggested by human perception

studies (Olson, 2001), that in natural vision the visual

input is divided into primitive objects, instead of well

deﬁned objects. Therefore, the ﬁnal output of human

perception (an object as a whole) might not be the

initial perception cues employed to identify objects

or understand scenes. The same applies for automatic

recognition, where the ﬁnal human segmentation (ob-

jects as wholes) might not be what a machine needs

to perform the classiﬁcation process. Therefore, eval-

uating segmentation results in terms of what’s more

similar to a human segmentation can be misleading

in order to know which measure should be used. In

this case, it would be desirable that the suitability of

a segmentation algorithm would be measured by the

success of the end application.

Furthermore, segmentation evaluation is per-

formed using low-level cues of the images, such as

edges, color and texture uniformity, inter-region dis-

148

Morales-González A., García-Reyes E. and Sucar L..

Unsupervised Segmentation Evaluation for Image Annotation.

DOI: 10.5220/0005314201480155

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 148-155

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

parities, etc. (Zhang et al., 2008; Dogra et al., 2012;

Khan and Bhuiyan, 2014), but in reality it’s very dif-

ﬁcult to assess automatically if an image is well seg-

mented or not if there is no knowledge of the semantic

entities that are represented there.

In this work, we aim at combining unsupervised

segmentation evaluation with an automatic image an-

notation algorithm, in order to ﬁnd which are the best

image partitions to start the recognition process. The

annotation algorithm is based on a segmentation hier-

archy, where annotation and segmentation are reﬁned

in several iterations, taking advantage of each other

information to improve the ﬁnal recognition result.

We propose two contributions: (1) to incorporate se-

mantic information in the unsupervised segmentation

evaluation process and (2) to measure the accuracy

of the segmentation evaluation in terms of recogni-

tion accuracy, which is our ultimate goal. The results

obtained in the Stanford Background Dataset (Gould

et al., 2009) support these two contributions, showing

that the semantic features play an important role in the

evaluation process and also, that the behaviour of the

segmentation evaluation measures is different when

their results are used as input of an image annotation

task.

The remaining of this paper includes an analysis

of previous works related to unsupervised segmenta-

tion evaluation and hierarchical image classiﬁcation

in Section 2. Section 3 describes the hierarchical an-

notation approach chosen to test the output of seg-

mentation evaluation algorithms. The segmentation

evaluation measures used as features in a classiﬁca-

tion approach to determine good and bad segmenta-

tions are presented in Section 4, as well as new mea-

sures to take into account semantic information of the

classes previously predicted in the image. Section

5 shows the experimental results along with several

analysis related to the presented measures which lead

to the ﬁnal conclusions of the paper.

2 RELATED WORK

According to what have been explained in the In-

troduction, segmentation-based image classiﬁcation

and unsupervised segmentation evaluation are two re-

search ﬁelds with a huge disconnection, even though

they can be related in some interesting points, in order

to improve the results of the former. In this Section,

we will analyse several relevant works on both ﬁelds

to illustrate the necessity of combining them and redi-

recting the ﬁnal goal of segmentation evaluation.

2.1 Hierarchical Image Classiﬁcation

Approaches

Many methods for image annotation work at pixel

level, but they face several drawbacks, such as the

complexity of classifying every pixel of an image

and the limited amount of information that a single

pixel and its neighborhoodmay contain (Russell et al.,

2014).

Super-pixel based methods have appeared re-

cently, starting with an initial segmentation of the im-

age based on low-level cues and then, performing the

annotation over these segments. The advantages over

the pixel-based representation can be noted in the pos-

sibility of computing better region-basedfeatures, and

the classiﬁcation can be performed over a reduced set

of entities. Nevertheless, this representation comes

with the problem of ﬁnding a good segmentation of

the image where the annotation process can obtain

good results.

Hierarchical models can be used to incorporate lo-

cal and global image cues. Hierarchical segmenta-

tions, in general, are formed by a stack of image par-

titions (or levels) where each one is built by merging

regions from the level below. Therefore, lower lev-

els are over-segmented and higher levels are under-

segmented. In (Arbelaez et al., 2012) and (Zhang and

Xie, 2013), a hierarchy of segmentations is used for

image annotation but no level selection is performed,

i.e. they use the entire hierarchy for generating can-

didate localizations of objects, producing more than

1300 candidates per image. The proposal of (van de

Sande et al., 2011) does not select levels either, it

starts from the over-segmentation given by a segmen-

tation algorithm and uses the whole hierarchy that’s

built over that for object detection. The work pre-

sented in (Zankl et al., 2012) also deals with the la-

beling of a segmentation hierarchy, but in this case,

they use human input to improve the ﬁnal image an-

notation. No automatic segmentation level selection

is performed, although the interaction with humans in

the labeling process makes it pointless to some extent.

As can be seen, all these methods work with a hi-

erarchy of segmentations, but none of them make an

evaluation of the suitability of the segmentation lev-

els employed. Including levels too over-segmented or

under-segmented in the recognition process may in-

corporate noisy partitions and deﬁnitely will increase

the overall processing cost.

The problem of ﬁnding relevant levels in a hi-

erarchy in order to perform object recognition was

addressed by (Morales-Gonz´alez and Garc´ıa-Reyes,

2013), by selecting levels that better preserve the

edges present in an edge mask of the image (com-

UnsupervisedSegmentationEvaluationforImageAnnotation

149

puted using the Canny edge detector), thus inheriting

the problemsof automatic edge detection, and making

the assumption that this edge mask would preserve the

actual object boundaries.

2.2 Unsupervised Segmentation

Evaluation Methods

In (Zhang et al., 2008) a thorough comparison and

analysis of unsupervised segmentation evaluation

methods is presented. They performed four different

experiments, from which, the closest to our goals is

the second one. In this experiment they compared two

image partitions that were segmented with the same

segmentation method (with different parameters), and

the evaluation measures should decide which partition

is the best. The measures with better performance in

this test were Q, Zeb, and F

in that order. Q mea-

sures the average squared color error of the segments,

using penalization terms to decrease the bias towards

both over-segmentation and under-segmentation. Zeb

uses the internal and external contrast of the regions,

measured in the neighborhood of each pixel, to per-

form the evaluation while F

takes into account

intra-region homogeneity and inter-region disparity.

The MSET evaluation method, reviewed in

(Zhang et al., 2008) also, proposed to combine a lim-

ited set of measures into a classiﬁer, which outper-

formed the individual results of these measures for the

second experiment. The set of measures employed

was composed of E, F, Q, and V

. E uses region

entropy to measure intra-region uniformity and a lay-

out entropy to indicate which pixels belong to which

regions. F employs the average color error of the

regions, similar to the Q measure mentioned before.

uses intra-object measures (e.g. shape regularity,

spatial uniformity, etc.), inter-object measures (such

as contrast) and each object is weighted by how much

attention it received by a human evaluator.

More measures have been introduced recently.

In (Morales-Gonz´alez and Garc´ıa-Reyes, 2013) they

proposed two measures B

and B

that were com-

bined as a weighted sum in order to select the best

levels in a hierarchy of segmentations. They are only

based on the edges of each partition and how well

they match the edges in an edge mask (computed us-

ing the Canny edge detector) of the original image. In

(Khan and Bhuiyan, 2014) they propose the weighted

self-entropy for region homogeneity and the weighted

mutual entropy for evaluating region disparity.

In the work presented by (Song et al., 2010) they

proposed a method for ﬁltering levels of segmenta-

tion in a hierarchy. Levels in the hierarchy are Re-

gion Adjacency Graphs (RAGs) and they compute the

complexity of the graph on each level using Lapla-

cian graph energy in order to keep those levels whose

complexity is smaller than either of the neighboring

levels.

All the aforementioned measures work with low-

level features (color, texture, edges), but in practice,

it is too hard to perform an accurate assessment of

the outlining of objects or entities in an image using

this information alone. Some kind of higher semantic

knowledge should be included in order to know how

well the segmentation was performed. Besides, all

these measures have been evaluated against human-

generated ground truth. The differences of previous

approaches and ours are that we are proposing to in-

clude segmentation evaluation into an image annota-

tion process in a way that the segmentation evalua-

tion can make use of semantic features coming from

predicted classes. Also, the segmentation evaluation

measures will be evaluated according to their suit-

ability to predict good levels for the recognition pro-

cess instead of how well they ﬁt a human-generated

ground truth.

3 HIERARCHICAL IMAGE

ANNOTATION

We use as base annotation system the approach pro-

posed by (Morales-Gonz´alez et al., 2013), named

HMRF-PyrSeg. They use as hierarchical representa-

tion the irregular graph pyramids, proposed by (Hax-

himusa and Kropatsch, 2004). An irregular graph

pyramid is a stack of successfully reduced graphs,

where each level is a RAG, i.e. vertices represent re-

gions and the adjacency between them is represented

by edges. At the base level, each pixel is a vertex and

the edges are the 4-connectivity among pixels. Using

a series of edge contractions and eliminations, each

level is reduced based on the regions internal and ex-

ternal contrast. The result is a segmentation hierar-

chy that can be traverse top-down and bottom-up and

preserves the topological distribution of the regions

throughout all its levels.

Using this representation, the algorithm HMRF-

PyrSeg works in the following way (Morales-

Gonz´alez et al., 2013):

• The whole graph pyramid is built using low-level

cues to segment the image.

• Starting from a predeﬁned level, still over-

segmented, every vertex is labeled with a class,

using a base classiﬁer (BC) that have been trained

previously using low-level features of the regions.

This BC must have a probabilistic output (ex.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

150

Random Forest, SVM, Na¨ıve Bayes classiﬁer),

which will be used as unary potentials to improve

the current BC labeling through a Markov Ran-

dom Field approach (see next step);

• The whole labeling of this level is improved by

means of a Hierarchical Markov Random Field

(HMRF), by imposing local constrains among

neighboring vertices (in the image plane) and par-

ent/child vertices (in the hierarchy structure);

• A new criterion for edge contraction is used to

create a new level of segmentation. This time, the

class assigned to each vertex by the HMRF, the

probabilities given by the BC to it and the distri-

bution of edges in each partition are combined in

order to select which are the vertices that should

be joined in the new level. In this way, semantic

information coming from the annotation process

is combined with low-level information to build

more meaningful segmentation levels.

• Once the new level is created, the whole annota-

tion process with the BC and the HMRF is per-

formed again, and this is repeated in several iter-

ations trying to ﬁnd better image segmentations

that yield ultimately to a better recognition result.

Although this approach showed an improvement

with respect to other proposals that do not combine

the results of annotation and segmentation, problems

still arise due to the selection of the ﬁrst level to be-

gin the whole process. If a level too over-segmented

is selected, it can introduce noise in the classiﬁcation

process and this noise will be propagated through all

the levels due to the hierarchical information used to

classify each level. On the other hand, if an under-

segmented level is selected, the boundaries of the ob-

jects in the image will be lost and the classiﬁcation

result will suffer as well. According to (Morales-

Gonz´alez et al., 2013) they used a ﬁxed level to start

this process for all images, disregarding the the nature

of each independent image, and displaying the afore-

mentioned problem. An example of this can be seen

in Figure 1.

Although the overall result is better, for particular

images the results are quite bad. This issue can be ad-

dressed by selecting, for each image, the most appro-

priate level to begin the annotation process. Neverthe-

less, in the reviewed literature related to segmentation

evaluation, all works compared the results of the mea-

sures with the human evaluation, which, in this case,

doesn’t necessarily coincide with what’s best for an

automatic recognition process.

4 SEGMENTATION EVALUATION

IN A HIERARCHY

Since there are many evaluation methods, they mea-

sure different aspects of the image partition and they

are combined in many different ways (Zhang et al.,

2008), we chose the option of using several unsuper-

vised evaluation measures and combine their output

values with a classiﬁer. Therefore, a segmentation

evaluation classiﬁer (SEC) will be the one who ﬁnds

out which are the most relevant aspects to be mea-

sured and how they should be combined.

4.1 Training Information

In order to obtain training information for our SEC,

we use the training set employed for the image an-

notation process. This training set contains images

and their respective irregular pyramid representation

(i.e. a hierarchy of segmentations per image). In the

HMRF-PyrSeg algorithm, after performing the initial

classiﬁcation of all the regions using the BC, we can

know which are the levels, for each image, that ob-

tained better accuracy results when compared with the

ground-truth of image annotation. Since our ultimate

goal is to improve the annotation results, it sounds

natural that the creation of new levels in the HMRF-

PyrSeg approach should start from the levels that ob-

tained better accuracy results with the BC. That’s why

we decided, for each image, to label the n levels that

obtained better accuracy with the BC as “good lev-

els”, and the rest as “bad levels”. We will train a bi-

nary classiﬁer with these two labels. We compute all

the unsupervised measures for each partition of each

hierarchy and we provide these features with their

corresponding labels to train the SEC.

4.2 Segmentation Features

We chose several evaluators employed in the litera-

ture that measure low-level information of each parti-

tion to serve as features that characterize each image

segmentation. They can be seen in Table 1.

Since in the HMRF-PyrSeg algorithm we can

count with the ﬁrst classiﬁcation of each region with a

BC, we propose to use semantic information related to

this classiﬁcation in order to add some kind of higher-

level information in the evaluation process. These se-

mantic features are 5, 6, 7 and 8 from Table 1.

With the BC, it is possible to have an initial pre-

diction of each region’s class. Using this information,

our proposals H

(Equation 1) and H

(Equation 2)

are the same of H

and H

respectively (referred as

(i) and H

( jk) in (Khan and Bhuiyan, 2014)), but

UnsupervisedSegmentationEvaluationforImageAnnotation

151

Original image Ground-truth Level 10 Level 14 Level 19

Figure 1: Example image segmentation/annotation results using HMRF-PyrSeg. Level 10 of the pyramid was ﬁxed as starting

level for both images. As can be seen, this was a good choice for the ﬁrst row, where the ﬁnal result of segmentation/annotation

is adequate, but for the second row, level 10 had already lost many meaningful edges.

Table 1: Segmentation evaluation measures used as features

for the SEC.

No. Alias Description

1 N

Number of regions

2 S

Average size of regions

3 H

Average region’s self entropy

4 H

Average inter-region’s mutual entropy

5 H

Average class’s self entropy

6 H

Average inter-class’s mutual entropy

7 NP

Number of pixels with high probability

values

8 NP

Number of regions with high probabil-

ity values

9 Zeb Intra-region and inter-region contrast

10 Q Squared color error

11 E Entropy of regions and layout entropy

12 F Squared color error

13 B

Measure of good edges against Canny

edge mask

14 B

Measure of wrong edges against Canny

edge mask

15 P

Average perimeter of regions

16 B

Number of edge pixels against number

of Canny edge pixels

taking into account the whole class area instead of the

segmented region. In these equations, G is deﬁned as

a feature that describe the pixels (ex. pixel intensity)

and G

(g)

is the set of all possible values of feature G

in the area where class c

was annotated. N

(t) is the

number of pixels with value t in the c

class region

and M

is the total number of pixels in this region.

Similarly, Equation 2 uses the same information

but for pairwise class analysis, changing the class

region being analyzed according to the subscripts in

each case. Subscripts c

, c

indicates that the region

is the union of all the pixels from classes c

and c

annotated in the image.

) = −

∑

t∈G

(g)

(t)

log

(t)

(1)

, c

) = −

∑

t∈G

(g)

(t)

log

(t)

(2)

This means that the entropy is computed in the

whole area where the class was detected, disregard-

ing the individual regions that compose that area. In

this case, we are measuring the degree of homogene-

ity in the class area detected (H

) and the disparity

among two different classes detected (H

The other semantic features related to classiﬁca-

tion that we are proposing to use are NP

and NP

They employ the probability output of the base classi-

ﬁer. For the case of NP

we are measuring the amount

of pixels that obtained high probability values, nor-

malized by the total number of pixels in the image.

does the same, but instead of taking pixels, they

count the amount of regions with high probability val-

ues, normalized by the number of regions.

The output of the measures presented in Table

1 can be concatenated into a vector to perform the

classiﬁcation of each image segmentation into ”good

level” or ”bad level”. It would be desirable to employ

a classiﬁer with probabilistic output, making possi-

ble to rank all the scores assigned to the segmentation

levels of an image in order to select the best ones.

5 EXPERIMENTS

The experiments were performed in the Stanford

Background Dataset (Gould et al., 2009), designed for

testing methods developed for geometric and seman-

tic scene understanding. It contains 715 images which

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

152

Figure 2: Example images taken from the Stanford Back-

ground Dataset. First column shows the original images

while second column shows their respective annotation

ground-truth, where each color represents one of the 8 se-

mantic labels present in this dataset.

are split in two subsets of 542 and 143 for training

and testing respectively. These subsets are randomly

generated and the results are averaged. The 8 seman-

tic labels annotated at pixel level are sky, tree, road,

grass, water, building, mountain, or foreground ob-

ject. Two example images from this dataset can be

seen in Figure 2 to the left, and their respective anno-

tation ground truth can be seen to the right.

Our objective in these experiments is to ﬁnd out

the inﬂuence of choosing each segmentation evalua-

tion measure to select the starting levels of the anno-

tation process. Therefore, the ground truth to evaluate

the performance of the measures is given by the ac-

curacy obtained in the annotation process. Although

the evaluation results will certainly depend on the se-

lected ground-truth with respect to annotation, this is

a common weak point for all annotation/segmentation

tasks that are based on a subjective ground-truth cre-

ated by humans.

The irregular pyramids built for these images usu-

ally have around 20 levels, therefore, in order to avoid

severe over-segmentation and under-segmentation,

we decided to remove some lower and higher levels.

In the present case we are analyzing levels from 6 to

16 of each pyramid.

In order to select which is the best combination of

measures to use in the SEC, we employed a wrap-

per feature selection approach (Yang et al., 2013)

evaluating different feature subsets with a predictive

model. The advantages of using such approach has

been stated in (Yang et al., 2013). We exhaustively

inspected all possible combinations, skipping combi-

nations of 2 and 3 features only. Since there are few

features, the combinatorial explosion is not too high

and this can be done in few hours. In this case, the

ground truth is the accuracy obtained with the base

classiﬁer for the test images. We will consider a good

Table 2: Results of the best measure combinations. First

column shows the features employed in each combination,

according to the numbering presented in Table 1. Second

column shows the accuracy of selecting the one level with

highest accuracy of the BC. Third column shows the accu-

racy of selecting a level among the best three with highest

accuracy of the BC.

Feature combination 1 level (%) 3 levels (%)

1, 2, 6, 7, 9, 10, 14 23.78 68.53

1, 6, 7, 8, 14, 15, 16 22.38 68.53

1, 6, 7, 9, 11, 14, 15 20.28 68.53

1, 6, 7, 8, 15, 16 20.28 67.83

1, 4, 6, 9, 10, 12, 16 18.18 65.73

level selection if the evaluation measure chooses as

best level one among the best three with highest ac-

curacy by the BC. If the selected level is not one of

the three with highest accuracy, the selection will be

considered wrong. Since there were several combina-

tions that achieved the same level selection accuracy,

we also measured the accuracy of selecting the one

level that has the best BC accuracy. This can be seen

in Table 2. The last row of the table is the best com-

bination without taking into account features 7 and

8 from Table 1, since these probabilities may not be

available in many approaches. These results were ob-

tained with Random Forest as SEC.

It is important to notice that in the ﬁrst 4 rows of

Table 2, features 1, 6 and 7 are always present, which

might indicate that their contribution to the combina-

tion is very important. Features 6 and 7 are two of the

semantic features proposed in Section 4.2. Also, the

difference among the results of the ﬁrst 4 rows and the

5th row (not using the probabilities of the classiﬁer),

also points to the importance of using these semantic

features in segmentation evaluation.

Once we had the best feature combination for the

SEC, we proceeded to evaluate the performance of

each evaluation measure in selecting the best segmen-

tation level for each image. The results of this ex-

periment can be seen in Table 3 and the accuracies

shown in second column are from selecting as best

level one among the three with highest accuracy of

the BC. The measures selected for the experiments

were H

and H

(Khan and Bhuiyan, 2014), Zeb, Q,

E, F, reviewed in (Zhang et al., 2008), B, which is the

combination of B

and B

as presented in (Morales-

Gonz´alez and Garc´ıa-Reyes, 2013) and the SEC com-

bination proposed in this work.

According to these results, it can be seen that Q

and Zeb, which were the measures with better results

in Experiment 2 of (Zhang et al., 2008), were not the

best for this task. In (Zhang et al., 2008) the ground

truth was obtained by human evaluators while in the

present task the ground truth was obtained according

UnsupervisedSegmentationEvaluationforImageAnnotation

153

Table 3: Results of the level selection accuracy of each mea-

sure.

Evaluation measure Level selection accuracy (%)

43.36

13.29

Zeb 39.16

Q 39.16

E 44.06

F 44.76

B 50.35

SEC 68.53

to an automatic annotation algorithm output. There-

fore, using human-generated ground truth may not be

the right assessment to what a computational algo-

rithm needs. It is important to notice the huge im-

provement displayed by the SEC combination, which

outperformed the best result in 18 % of accuracy. This

is an indicator of the beneﬁts provided by the combi-

nation and the use of semantic features.

Additional information regarding these measures

can be seen in Table 4. Second column shows the

average time to evaluate all the segmentation levels

of one image (currently 11 levels, from level 6 to

16 of each pyramid). Another interesting informa-

tion is how deviated to under or over-segmentation

each measure is, with respect to the ground-truth cor-

rect levels. We computed the difference between the

best levels selected by each measure and the ground-

truth levels, and computed the mean and standard de-

viation of this difference. These values are shown

in columns 3 and 4 respectively. A negative mean

value indicates that the corresponding measure tends

to over-segmentation w.r.t the ground-truth correct

levels. Conversely, a positive value indicates that the

measure tends to select under-segmented levels. Val-

ues closer to zero correspond to measures with out-

puts closer to the ground-truth. In this sense, it can

be seen that most measures, except for H

, tend to

choose levels more over-segmented (at different de-

grees) than the ground-truth. H

has a strong bias to

under-segmentation while H

and F have the stronger

biases to over-segmentation. The SEC combination

displays the mean value closest to zero with a slight

bias to over-segmentation,and the lowest standard de-

viation among all the measures.

Regarding the computational cost of computing

each individual measure, it can be seen in Table 4 that

Zeb is the more time-consuming measure, followed

by B and H

. The time shown for the SEC com-

bination corresponds to the second combination pre-

sented in Table 2. The reason for this is that the ﬁrst

combination employs the Zeb measure, which greatly

increases the computation time (10.58 seconds). It

Table 4: Additional information of each measure.

Evaluation measure Time (s) Mean Stdev

0.702 -2.34 3.38

2.271 4.85 2.74

Zeb 6.296 -0.69 4.73

Q 0.214 -1.73 4.12

E 0.206 -0.89 3.44

F 0.215 -2.55 2.84

B 3.952 -1.86 2.72

SEC 4.505 -0.67 2.43

Table 5: Accuracy of the annotation process when choosing

as starting levels the ones selected by each measure.

Evaluation measure. Annotation accuracy (%)

Base Classiﬁer 73.0

Fixed Level 75.2

72.34

63.78

Zeb 71.41

Q 69.75

E 71.63

F 71.91

B 74.72

SEC 76.86

is important to notice that the SEC combination em-

ploys other measures that also contribute to the total

time. Nevertheless, since many of these individual

measures work with common information, they can

be computed together, reducing the total time with re-

spect to the sum of their individual times. Also, the

cost and the accuracy information can be used to ﬁnd

an appropriate trade-off between these two aspects in

speciﬁc applications.

Using as starting levels the ones selected by each

measure, we ran the whole annotation algorithm and

the ﬁnal annotation accuracy in each case can be seen

in Table 5. Also, the ﬁrst two rows show the annota-

tion accuracy of the base classiﬁer and the annotation

accuracy of the hierarchical annotation process using

a ﬁxed level (level 10 in this case) for starting the an-

notation process.

In this experiment can be seen that, in most cases,

selecting the starting levels with the segmentation

evaluation measures deteriorates the ﬁnal annotation

accuracy with respect to the base classiﬁer accuracy

and using a ﬁxed level. The only measures that im-

proved the BC results were B and the SEC combina-

tion, while the ﬁxed level approach was only outper-

formed by the SEC combination. here is a signiﬁcant

improvement of the SEC combination over the second

best evaluation measure (B), of 2.14 %.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

154

6 CONCLUSIONS

In this paper we addressed two usually unrelated re-

search ﬁelds: unsupervised segmentation evaluation

and automatic image annotation. Our proposal of in-

cluding semantic measures in the segmentation evalu-

ation process and the combination of several individ-

ual evaluators displayed better results than the most

relevant measures found in the literature. We also

showed that the measures that evaluate a segmenta-

tion more similar to humans, are not the best for se-

lecting partition levels to perform automatic recogni-

tion tasks. Therefore, in our opinion, more efforts

should be devoted to develop segmentation evaluation

measures that work better for automatic image anno-

tation, instead of focusing on the best segmentation

output for humans.

The ﬁnal results of the annotation process showed

that selecting ”good” levels at the beginning provides

better annotation accuracy. As future work, we plan

to include saliency maps in the segmentation evalu-

ation process, trying to ﬁnd partitions that preserve

distinctive objects or parts.

ACKNOWLEDGEMENTS

This work was supported in part by CONACYT

project 215546.

REFERENCES

Arbelaez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev,

L. D., and Malik, J. (2012). Semantic segmentation

using regions and parts. In CVPR, pages 3378–3385.

IEEE.

Csurka, G., Larlus, D., and Perronnin, F. (2013). What is a

good evaluation measure for semantic segmentation?

In 24th British Machine Vision Conference (BMVC),

University of Bristol, United Kingdom.

Dogra, D. P., Majumdar, A. K., and Sural, S. (2012). Eval-

uation of segmentation techniques using region area

and boundary matching information. J. Vis. Comun.

Image Represent., 23(1):150–160.

Gould, S., Fulton, R., and Koller, D. (2009). Decomposing

a scene into geometric and semantically consistent re-

gions. In ICCV, pages 1–8. IEEE.

Haxhimusa, Y. and Kropatsch, W. G. (2004). Segmenta-

tion graph hierarchies. In Proceedings of Joint In-

ternational Workshops on Structural, Syntactic, and

Statistical Pattern Recognition S+SSPR 2004, volume

LNCS 3138, pages 343–351. Springer, Berlin Heidel-

berg, New York.

Huang, Q., Han, M., Wu, B., and Ioffe, S. (2011). A hierar-

chical conditional random ﬁeld model for labeling and

segmenting images of street scenes. 2013 IEEE Con-

ference on Computer Vision and Pattern Recognition,

0:1953–1960.

Khan, J. F. and Bhuiyan, S. M. (2014). Weighted entropy

for segmentation evaluation. Optics and Laser Tech-

nology, 57(0):236 – 242. Optical Image Processing.

Morales-Gonz´alez, A. and Garc´ıa-Reyes, E. B. (2013).

Simple object recognition based on spatial relations

and visual features represented using irregular pyra-

mids. Multimedia Tools Appl., 63(3):875–897.

Morales-Gonz´alez, A., Reyes, E. B. G., and Sucar, L. E.

(2013). Improving image segmentation for boosting

image annotation with irregular pyramids. In CIARP

(1), volume 8258 of LNCS, pages 399–406. Springer.

Olson, C. R. (2001). Object-based vision and attention in

primates. Current Opinion in Neurob., 11:171–179.

Russakovsky, O., Deng, J., Krause, J., Berg, A., and Li, F.

(2014). Results of ILSVRC2013. http://www.image-

net.org/challenges/LSVRC/2013/results.php.

Russell, C., Ladicky, L., Kohli, P., and Torr, P. H. S. (2014).

Associative hierarchical random ﬁelds. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

36(6):1–1.

Song, Y.-Z., Arbelaez, P., Hall, P. M., Li, C., and Balikai,

A. (2010). Finding semantic structures in image hi-

erarchies using laplacian graph energy. In ECCV (4),

volume 6314 of LNCS, pages 694–707. Springer.

van de Sande, K. E. A., Uijlings, J. R. R., Gevers, T., and

Smeulders, A. W. M. (2011). Segmentation as selec-

tive search for object recognition. In Proceedings of

ICCV ’11, pages 1879–1886. IEEE Computer Society.

Yang, P., Liu, W., Zhou, B. B., Chawla, S., and Zomaya,

A. Y. (2013). Ensemble-based wrapper methods for

feature selection and class imbalance learning. In

PAKDD (1), volume 7818 of LNCS, pages 544–555.

Springer.

Zankl, G., Haxhimusa, Y., and Ion, A. (2012). Interac-

tive labeling of image segmentation hierarchies. In

DAGM/OAGM Symposium, volume 7476 of LNCS,

pages 11–20. Springer.

Zhang, H., Fritts, J. E., and Goldman, S. A. (2008). Image

segmentation evaluation: A survey of unsupervised

methods. Comput. Vis. Image Underst., 110(2):260–

280.

Zhang, S. and Xie, M. (2013). Beyond sliding windows:

Object detection based on hierarchical segmentation

model. In International Conference on Communica-

tions, Circuits and Systems (ICCCAS), pages 263 –

266. IEEE.

UnsupervisedSegmentationEvaluationforImageAnnotation

155