CASCADE OF MULTI-LEVEL MULTI-INSTANCE

CLASSIFIERS FOR IMAGE ANNOTATION

Cam-Tu Nguyen

, Ha Vu Le

and Takeshi Tokuyama

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

VNU University of Engineering and Technology, Hanoi, Vietnam

Keywords:

Image annotation, Cascade algorithm, Multi-level feature extraction.

Abstract:

This paper introduces a new scheme for automatic image annotation based on cascading multi-level multi-

instance classiﬁers (CMLMI). The proposed scheme employs a hierarchy for visual feature extraction, in

which the feature set includes features extracted from the whole image at the coarsest level and from the

overlapping sub-regions at ﬁner levels. Multi-instance learning (MIL) is used to learn the “weak classiﬁers” for

these levels in a cascade manner. The underlying idea is that the coarse levels are suitable for background labels

such as “forest” and “city”, while ﬁner levels bring useful information about foreground objects like “tiger”

and “car”. The cascade manner allows this scheme to incorporate “important” negative samples during the

learning process, hence reducing the “weakly labeling” problem by excluding ambiguous background labels

associated with the negative samples. Experiments show that the CMLMI achieve signiﬁcant improvements

over baseline methods as well as existing MIL-based methods.

1 INTRODUCTION

Only after a couple of years, online photo-sharing

websites (Flickr, Picassa web, Photobucket, etc.),

which host hundreds of millions of pictures, have

quickly become an integral part of the Internet. As

a result, the need for tagging images and multimedia

data with semantic labels becomes increasingly im-

portant in order to make the Web more well-organized

and accessible. On the other hand, the enormous

amount of photos taken everyday makes the task of

manual annotation an extremely time-consuming and

expensive task. Automatic image annotation there-

fore receives signiﬁcant interest in image retrieval and

multimedia mining.

Although image classiﬁcation and object recogni-

tion also assign meta data to images, the difference

of image annotation from classiﬁcation and recogni-

tion deﬁnes its typical challenging issues. In gen-

eral, the number of labels (classes/objects) is usually

larger in image annotation compared to classiﬁcation

and recognition. Because of the dominating number

of negative examples, both the one-vs-one and one-

vs-all schemes in multi-class supervised learning do

not scale very well for image annotation. Unlike ob-

ject recognition, image annotation is “weakly label-

ing” (Carneiro et al., 2007), that is a label is assigned

Level 1 Level 2

Level 3

Figure 1: Level 1: the whole image; Level 2: 2x2 grid + 1

subregion in the center; Level 3: 4x4 grid + 5 overlapping

subregions (blue border rectangles).

to one image without indication of the region corre-

sponding to that label. Moreover, scalability require-

ment prevents researchers to investigate feature ex-

traction for every label in image annotation. This,

however, can be performed with a limited number of

objects in object recognition . On the other hand, the

variety of visual representations of objects suggests

that we should not depend on one feature extraction

method to work well with a large number of labels

(Akbas and Vural, 2007; Makadia et al., 2010).

Motivated by the aforementioned issues, we pro-

pose a new learning method - a cascade of multilevel

multi-instance classiﬁers (CMLMI) for image anno-

tation. The idea behind our approach is that coarser

levels provide better description for background and

common concepts such as “forest, building, moun-

Nguyen C., Le H. and Tokuyama T..

CASCADE OF MULTI-LEVEL MULTI-INSTANCE CLASSIFIERS FOR IMAGE ANNOTATION.

DOI: 10.5220/0003634400140023

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 14-23

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

tain”, while ﬁner levels bring useful information to

speciﬁc objects such as “tiger, cars, bear”. Given an

object, the cascade method ensures that we ﬁrst detect

the object’s related scene, then focus on the “likely”

scene to further recognize the object in that context.

Formally, cascading means that learning classiﬁers

at ﬁner levels (e.g. level 3) is dependent of classi-

ﬁers at coarser levels (e.g. level 1,2) (learning from

coarse-to-ﬁne). By so doing, when learning classi-

ﬁers for speciﬁc objects at ﬁner levels, we can ignore

(negative) samples of non-related scenes, thus reduce

training time. Since negative examples are those of

the same scene without the considered object, there

is more chance for us to separate the object from the

background. For instance, since a “tiger” usually ap-

pears in a forest, the negative examples of forest back-

ground, which does not contain “tiger”, helps recog-

nize “negative”regions (forest regions) in the positive

examples of “tiger”. As a result, it improvesthe selec-

tion of regions corresponding to “tiger”, and reduces

the ambiguity of “weakly labeling”.

Speciﬁcally, our propose contains two main parts:

1) multi-level feature extraction; and 2) cascade of

multi-instance classiﬁers over multiple levels. Multi-

level means we divide images into different levels of

granularity from the coarsest one (the whole image) to

increasingly ﬁne subregions (Figure 1). Several fea-

ture extraction algorithms are performed at each level,

each algorithm produces a set of feature vectors cor-

responding to subregions of the image. Given a label,

a cascade of multi-level multi-instance classiﬁers is

then built across levels, from cheapest (coarsest) fea-

tures to the most expensive (ﬁnest) features. Here,

features extracted from the whole image (level 1) are

called global features.

In the literature, cascade of classiﬁers were suc-

cessfully used to design fast object detectors (Viola

and Jones, 2001) while multi-level features were ap-

plied to image classiﬁcation (Lazebnix et al., 2009)

and object recognition (Torralba et al., 2010). To the

best of our knowledge, however, this is one of the ﬁrst

attempts that adopts the hierarchy of multi-level fea-

ture extraction to group features according to acqui-

sition cost so as to develop a cascade learning algo-

rithm for image annotation. In comparison with pre-

vious cascading algorithms, we take into account the

“weakly labeling” problem by using MIL and make

the cascading algorithm suitable to image annotation.

In addition, our approach is more robust than previous

MIL methods because we consider multi-level feature

extraction which allows us to cope with the variety in

visual representation among labels. The advantages,

thus, lie in threefold: 1) reducing training time by a

cascade learning algorithm; 2) relaxing the ambiguity

of “weakly labeling” problem of image annotation;

and 3) obtaining strong classiﬁers, which are robust

to multiple resolution.

The rest of this paper is organized in 6 sections.

Section 2 summarizes typical approachesto image an-

notation and related tasks. Multi-level feature extrac-

tion and multi-instance learning are presented in Sec-

tion 3 and Section 4. Our proposed method for image

annotation is givenin details in Section 5. Experimen-

tal results will be given in Section 6. Finally, Section

7 concludes the important remarks of our work.

2 RELATED WORKS

Image annotation and related tasks (object recogni-

tion, image classiﬁcation and retrieval) have been the

active topics for more than a decade and led to sev-

eral noticeable methods. In the following, we present

an overview of typical approaches, which are cate-

gorized into 1) classiﬁcation-based methods; and 2)

joint-distribution based methods.

2.1 Classiﬁcation-based Approach

The early effort in the area is to formalize image anno-

tation as the task of binary classiﬁcation. Some exam-

ples are to classify images into “indoor” or “outdoor”

(Szummer and Picard, 1998). In object recognition,

Viola and Jones (Viola and Jones, 2001) proposed a

method for face detection (face/non face classiﬁca-

tion) using Adaboost, which is very fast in dropping

non face windows in images, thus results in fast face

detectors.

For image retrieval, the two-class formalization is

not enough to meet searching requirements. Lyndon

et al. (Kennedy and Chang, 2007) used a rerank-

ing method to combine binary classiﬁers. Akbas et

al. (Akbas and Vural, 2007) fused binary classi-

ﬁers by learning a new meta classiﬁer from category-

membered vectors, which are generated from the bi-

nary classiﬁers. Nguyen et al. (Nguyen et al., 2010)

proposed a feature-word-topicmodel in which one in-

dividual classiﬁer is learned for each label based on

visual features. By modeling topics of words, the

authors then reﬁne the results from binary classiﬁers

to obtain topic-oriented annotation for later image re-

tireval.

In order to apply classiﬁcation approach to im-

age annotation, we need to take the “weakly label-

ing” problem into account. Typically, this can be done

by adopting multi-instance learning (MIL) instead of

single-instance learning. Ansdrew et al. (Andrews

et al., 2002) adapted single-instance learning version

CASCADE OF MULTI-LEVEL MULTI-INSTANCE CLASSIFIERS FOR IMAGE ANNOTATION

of Support Vector Machine (SVM) to multi-instance

learning versions namely MI-SVM and mi-SVM and

applied to image annotation with 3 classes (tiger, fox,

elephant). On the other attempt, Yang et al. (Yang

et al., 2006) introduced Asymmetric SVM (ASVM) to

pose different loss functions to 2 types of error (false

positive and false negative) for annotation. ASVM

has been applied to 70 common labels of Corel5K,

which is the common benchmark for image annota-

tion, and shown comparative results. Also follow-

ing the idea of MIL but supervised multiclass label-

ing (SML) [5] does not consider negative examples

in learning binary classiﬁers. Given a label, SML is

based on hierarchical Gaussian mixture to train a bi-

nary classiﬁer using only positive examples. Since

only global features are used in SML, it is not clear

whether SML works well for speciﬁc objects or not

although on average it showed state-of-the-art perfor-

mance on Corel5K. All in all, current MIL-based im-

age annotation systems do not exploit the beneﬁt of

combining global and region-based features.

2.2 Joint Distribution-based Approach

Statistical generative models introduce a set of latent

variables to deﬁne a joint distribution between vi-

sual features and labels for image annotation. Jeon

et al. (Jeon et al., 2003) proposed Cross-Media Rel-

evance Model (CMRM) for image annotation. This

work relies on normalized cuts to segment images

into regions then clusters visual descriptors of seg-

ments to build blobs. CMRM uses training images

as latent variables to estimate the joint distribution be-

tween blobs and words. Continuous RelevanceModel

(CRM) (Lavrenko et al., 2003) is another relevance

model but different from CMRM by the fact that it

models directly the joint distribution between words

and continuous visual features using non-parametric

kernel density estimate. As a result, it is not as sen-

sitive to quantization errors as CMRM. These meth-

ods (CMRM, CRM) are also referred as keyword

propagation methods since they transfer keywords of

the nearest neighbors (in the training dataset) to the

given new image. The drawback of those methods is

that the annotation time depends linearly on the num-

ber of training set, thus have the scalable limitation

(Carneiro et al., 2007).

Following this approach, topic model-based meth-

ods (Blei and Jordan, 2003; Monay and Gatica-Perez,

2007) do not use training images but hidden topics

(concepts/aspects) as latent variables. These meth-

ods also rely on either quantized features (Monay

and Gatica-Perez, 2007) or continuous variables (Blei

and Jordan, 2003). The main advantages of the topic

model-based approach lies in two points: 1) the better

scalability in comparison with propagation methods;

and 2) the ability to encode scene settings (via topics)

into image annotation.

Despite of topic-based methods or propagation

methods, the disadvantage of joint distribution-based

approach is its lack of direct modeling between vi-

sual features and labels, which makes it difﬁcult to

optimize annotation (Carneiro et al., 2007). In or-

der to study the impact of feature extraction on dif-

ferent types of labels,it is more appropriate to follow

the multiple instance learning methods as mentioned

in the section of classiﬁcation-based approach.

3 MULTI-LEVEL FEATURE

EXTRACTION

As stated previously, our method consists of 2 main

parts: 1) multi-level feature extraction; and 2) cascade

of multi-instance classiﬁers over levels. This section

reviews noticeable methods to extract visual descrip-

tors for image annotation, classiﬁcation and retrieval

as a fundamental for our multi-level feature extrac-

tion described later. We distinguish 3 types of visual

descriptors, which are global features, region-based

features, and hybrid.

Global Feature Extraction: an image is not divided

into subregions. As a result, we obtain only one fea-

ture vector (one instance) for each image. Many low-

level features can be extracted and concatenated from

the whole image such as color histogram, texture, or

edge histogram (Deselaers et al., 2008; Makadia et al.,

2010; Douze et al., 2009; J´egou et al., 2010; Akbas

and Vural, 2007). Bag-of-feature (Hofmann, 1999;

Deselaers et al., 2008) obtained by quantizing features

at interest points can also be classiﬁed to this cate-

gory because one image is not divided into smaller

regions, and an image has only one histogram feature

vector. Recent baseline in image annotation (Maka-

dia et al., 2010) also relied on global feature extrac-

tion. However, they did not concatenate feature vec-

tors but combined similarities from different feature

types to measure similarity between images for K-

nearest-neighbor based image annotation.

Local Feature Extraction: an image is divided into

smaller regions using image segmentation (Barnard

et al., 2003; Duygulu et al., 2002; Jeon et al., 2003)

or grid-based division. A feature vector is then ex-

tracted from each subregion (Feng et al., 2004). As a

result, an image has several feature vectors, one corre-

sponds to one subregion. Since image segmentation is

still a difﬁcult task, many of current works avoid this

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

task and divide images into grids instead (Feng et al.,

2004; Jeon et al., 2004). Previous study (Feng et al.,

2004) have shown that grid-based division can obtain

better results than segmentation on Corel5K bench-

mark.

Hybrid Method: Spatial pyramid method (Lazebnix

et al., 2009) can be considered as a hybrid of local

and global representations. Informally, an image is di-

vided to increasingly coarser grids. We are then able

to concatenate weighted histograms of all cells (of the

grids) into one vector. This method has been applied

to scene classiﬁcation and image classiﬁcation with

little ambiguity, which does not have “weakly label-

ing” as in image annotation. Even our approach also

divides images into different coarse grids (coarse lev-

els) and extract features from levels, the difference is

that we do not concatenate the feature vectors from

different levels but exploit the hierarchy to group fea-

ture sets according to acquisition cost. As a result,

we are able to develop a cascade algorithm for image

annotation.

4 MULTI-INSTANCE LEARNING

WITH SUPPORT VECTOR

MACHINES

Multi-instance learning is essential in our propose.

This section begins with standard supervised learning

with Support Vector Machine (SVM), which is single

instance learning, then presents one extension to turn

SVM into multi-instance SVM.

In standard supervised learning, it is often the case

that we are given a training set of labeled instances

(samples) D = {(x

)|i = 1,...,N;x

∈ R

∈ Y =

{+1, −1}} and the objective is to learn a classiﬁer,

i.e., a function from instances to labels: h : R

→

Y . This class of supervised learning belongs to

single-instance learning, where Support Vector Ma-

chine (SVM) (Sch¨olkopf et al., 1999) is one of the

most successful methods.

Multiple Instance Learning (MIL) generalizes the

single instance learning to cope with the ambiguity

in training dataset. Instead of receiving a set of la-

beled instances, we are given a set of negative/ posi-

tive bags, each contains many instances. A negative

bag contains all negative instances, while a positive

bag has at least one positive instance but we do not

know which one it is. The formalization of MIL nat-

urally ﬁts the “weakly labeling” in image annotation

where a positive bag (w.r.t a label) corresponds to an

image annotated with that label. There were several

methods for MIL. For simplicity, we will discuss one

=+1

=-1

{x|(w.x)+b=+1

{x|(w.x)+b=0}

{x|(w.x)+b= -

}

(b)

(a)

=+1

Figure 2: Support Vector Machines: (a) Single Instance

Learning; (b) Multiple Instance Learning: positive and neg-

ative bags are denoted by circles and triangles respectively.

simple formalization to apply SVM for MIL namely

MI-SVM (Andrews et al., 2002).

4.1 Support Vector Machines

In Support Vector Machines (Sch¨olkopf et al., 1999),

a class of hyperplanes that separate negative and pos-

itive patterns (Figure 2) is considered. For separa-

ble case, the hyperplane represented by a pair (a,b)

(a ∈ R

and b ∈ R) satisﬁes:



ax+ b ≥ +1 if y

= +1

ax+ b ≤ −1 if y

= −1

The corresponding decision function becomes

f(x) = sgn(ax + b). Among the hyperplanes that

are able to separate positive and negative patterns,

the optimal hyperplane is the one with maximum

margin and most likely to have minimum test error

(Sch¨olkopf et al., 1999). It has been proved that the

margin of a hyperplane is reversely proportional to

||a||. In practice, a separating hyperplane may not ex-

ist, i.e. data is non-separable, slack (positive ) vari-

ables ξ are introduced to allow misclassiﬁed exam-

ples. The optimization turns into:

minimize:

||a|| +C

∑

i=1

subject to: y

(ax+ b) ≥ 1 − ξ

,i = 1, . . . , N

where C is the constant determining the trade-off.

SVMs also can carry out the nonlinear classiﬁcation

by using kernel functions that embed data into higher

space where the nonlinear pattern now appears linear.

4.2 Multiple Instance Support Vector

Machines

Let D

= {(X

)|i = 1,...,N,X

= {x

};Y

{+1, −1}} be a set of images (bags) with/without

word w, where a bag X

of instances (x

) is posi-

tive (Y

= 1) if at least one instance x

∈ X

has its

label y

positive (the subregion in the image corre-

sponds to word w). As shown in Figure 2b, pos-

itive bags are denoted by circles and negative bags

CASCADE OF MULTI-LEVEL MULTI-INSTANCE CLASSIFIERS FOR IMAGE ANNOTATION

are marked as triangles. The relationship between

instance labels and bag labels can be compressed as

= max(y

), j = 1,... , |X

MI-SVM (Andrews et al., 2002) extends the no-

tion of the margin from an individual instance to a set

of instances (Figure 2b). The functional margin of a

bag with respect to a hyperplane is deﬁned in (An-

drews et al., 2002) as follows:

max

∈X

(ax

+ b)

The prediction then has the form Y

sgnmax

∈X

(ax

+ b). The margin of a positive

bag is the margin of the most positive instance,

while the margin of a negative bag is deﬁned as the

“least negative” instance. Keeping the deﬁnition

of bag margin in mind, the Multiple Instance SVM

(MI-SVM) is deﬁned as following:

minimize:

||a|| +C

∑

i=1

subject to: Y

max

∈X

(ax

+b) ≥ 1−ξ

,i = 1, . . . , N,ξ

≥ 0

By introducing selector variables s

which denotes the

instance selected as the positive “witness” of a posi-

tive bag X

, Andrews et al. has derived an optimiza-

tion heuristics. The general scheme of optimization

heuristics alternates two steps: 1) for given selector

variables, train SVMs based on selected positive in-

stances and all negative ones; 2) based on current

trained SVMs, updates selector variables. The pro-

cess ﬁnishes when no change in selector variables.

5 CASCADE OF MULTI-LEVEL

MULTI-INSTANCE

CLASSIFIERS

5.1 Notation and Learning Algorithm

Let D = {(I

),... , (I

)} be a training dataset,

in which w

is a set of words associated with

image I

and sampled from a vocabulary V =

,...,w

|V|

}. The objective is to learn a map-

ping function from visual space to word space so that

we can index and rank new images for text-based re-

trieval. The two main components of our propose are

described as follows:

• Extracting Multi-level Features: we divide each

image in T different levels then perform M feature

extraction algorithms F

as in Figure 3. Here, we

…

M-1

…

M-1

Level 1

Level 2

Level 3

Figure 3: An image is divided into different levels of gran-

ularity. For a level, we perform one or more feature extrac-

tion methods. We then obtain M feature extraction methods.

can choose any suitable feature extraction such as

color, texture, shape description, gist, etc. for F

Let M (l) (l = 1,... , T) be indexes of feature ex-

tractions at level l, e.g. M (1) = 1,2;M (2) =

3,4,5 (Figure 3). From this notation, we have

∑

l=1

|M (l)| = M. Also, we can infer that all the

feature extraction algorithms at previous levels of

level l are indexed from 1 to min{M (l)} − 1.

• Cascade of Multi-instance Classiﬁers Over

Levels: given a label w, D

= {B

−

} denotes

a training dataset where B

−

) is the set of im-

ages with (without) w. Let Y be a vector of cor-

responding classes of images in D

, i.e. Y

= 1

if I

∈ B

and Y

= −1 otherwise. Let score be

the output (conﬁdence) vector generated by ma-

chines (classiﬁers), where score

> 0 (or absolute

value of score

(< 0)) is the conﬁdence of assign-

ing (not assigning) w to I

∈ D

. We denote h

the

weak classiﬁer, which maps from feature space

of feature extraction algorithm F

to {−1,1}.

The conﬁdence score posed by h

on the image

I is denoted by h

(I)), that is we apply h

on feature vectors obtained by F

on I. Based

on these notations, CMLMI is presented in Algo-

rithm 1. Note that multi-instance learning turns

into single-instance learning at the coarsest level

when global feature vector is in use.

For global feature extractions at level l = 1, an im-

age has one instance (one feature vector), the problem

turns into normal supervised learning. We applied

SVM for this case. At ﬁner level (l > 1), one image

has a set of instances, one corresponds to one subre-

gion. Due to weakly labeling, we do not know which

instance best represents the given label. The multiple-

instance version of SVM (MI-SVM) (see Section 4) is

used to address this ambiguity.

We update scores of images in D

at level l using

the following recursion:

score = H

= γ

∗ H

l−1

∑

m∈M (l)

∗ h

+ c

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

Algorithm 1: A Cascade of Multi-Level Multi-Instance Classiﬁers.

Input : A set D

= {B

−

} of positive and negative examples for word w.

Output: A strong classiﬁer H

for w

1 Initialize score

= 0, θ

= 1/|B

−

|, c = 0, and α

= 0 for n = 1,... , |B|, i = 1, ...,|B

−

|;, and m = 1, ...,M.

2 //Learning weak classiﬁers over T levels

3 for l ← 1 to T do

4 if l == 1 then

5 Learn classiﬁers h

using SVM from D

for all m ∈ M (l)

6 if l > 1 then

7 Sample a smaller set SB

−

from B

−

according to θ

8 Learn classiﬁers h

using MI-SVM from SD

= {B

,SB

−

} for all m ∈ M (l)

9 end

10 //Update score for all images in D

11 Set score

= γ

∗ score

∑

m∈M (l)

∗ h

)) + c

for n = 1, ...,|D

12 Find coefﬁcients γ

> 0, α

and c

to minimize ||score−Y||

13 //Update coefﬁcients of classiﬁers in previous levels

14 for m

′

= 1 to min{M (l)} − 1 do

15 α

′

= α

′

∗ γ

16 end

17 Update the overall threshold c = c∗ γ

+ c

18 Sort score in descending order, and let r

be the ranking position of I

∈ B

−

in sorted score

19 Update θ

← θ

∗ 1/r

for all j = 1, ...,|B

−

| and normalize θ so that

∑

θ = 1

20 end

21 Final robust classiﬁer:

∑

m=1

∗ h

+ c

∑

m=1

+ c

Since we have the constraint that γ

> 0, the rank-

ing of images is based on previous ranking (H

l−1

) but

modiﬁed by the additional classiﬁers of current level

(the second term). The constant term c

is used as

the constant threshold for level l. We then ﬁnd coefﬁ-

cients for classiﬁers of level l using linear regression

that is minimizing square error ||H −Y||

(lines from

10 to 11 in Algorithm 1). Here, scores for images in

are accumulated from level 1 to level l − 1 and

stored in score.

Unlike previous boosting methods, the sampling

distribution θ on B

−

is updated based on the rank-

ing positions of negative samples on the sorted score

instead of the score itself (line 18,19). As a result,

a negative example at higher rank will be weighted

more than negativeexamples at lower ranks. From the

experiments, we see that this ranking-based scheme is

better than score-based for unbalanced training set.

5.2 Detailed Analysis

This section presents theoretical analysis for our al-

gorithm, which focuses on the beneﬁt of CMLMI in

training time and shows that our algorithm is suitable

to image annotation.

Based on cascading scheme, it is obvious that our

method requires less training time than learning all in-

dividual classiﬁers independently. The training time

of MI-SVM depends on |B

| + NR∗ |B

−

|, where NR

is the number of subregions per image. That NR is

larger on ﬁner levels makes the domination of nega-

tive instances over positive ones even more serious.

Training MI-SVM in cascade with SB

(Line 7 in Al-

gorithm 1) is more efﬁcient than training an indepen-

dent one with D

Not only having advantage in training time, but

also our method is suitable to image annotation and

able to reduce the ambiguity of weakly labeling.

When the coarse levels are in charge of detecting

related context of the given level, the ﬁner levels

are able to focus on sample images of similar scene

to separate the object from the background, and re-

duce ambiguity caused by weakly labeling. Figure 4

demonstrates our idea. Here, circles still denote pos-

itive bags, in which we know positive instances are

available but do not know which ones, and triangles

denote negativebags, of which we have guarantee that

all instances are negative. The negative bag selected

here is the one with instances close to some other in-

stances of one positive bag (the red circle). The com-

CASCADE OF MULTI-LEVEL MULTI-INSTANCE CLASSIFIERS FOR IMAGE ANNOTATION

Figure 4: Negative bags that share common negative in-

stances with positive bags reduce ambiguity. Here the stars

denote unknown classes (either positive (+) or negative (-).

mon/similar instances correspond to subregions of the

shared/similar background of the two bags. Since we

have the knowledge that all instances of the negative

bag are negative, we can conclude that the instances

of the red circle, which are close to or even included

in the negative bag, are negative. Along with the sim-

ilarity among positive bags, which contain the same

object, this information helps to obtain better hyper-

plane to separate negative and positive instances. To

our best knowledge, this is one of the ﬁrst attempts

that makes use of the similarity between negativebags

and positive bags to reduce ambiguity in MIL. Most

of previous approaches in MIL only made use of sim-

ilarity among positive bags to deal with the ambigu-

ity. For example, (Carneiro et al., 2007) only uses

positive bags to generalize a dominating distribution

over positive bags. (Maron and Lozano-P´erez, 1998)

ﬁnds regions in the instance space with instances from

many different positive bags and far away from in-

stances from negative bags. In (Yang et al., 2006;

Andrews et al., 2002), negative bags are sampled ran-

domly only to cope with the domination of negative

examples over positive examples without giving no-

tice to negative bags that share backgrounds with pos-

itive bags. Recently, (Deselaers and Ferrari, 2010)

also follows the idea that the signiﬁcant portion of

positive instances will result in a reasonable classiﬁer

performing better than by change. However, we ob-

serve that some negative instances also amount to sig-

niﬁcant portion, which are the instances correspond-

ing to common backgrounds. This problem becomes

more serious when more and more labels are taken

into consideration like those in image annotation.

6 EXPERIMENTS

6.1 Corel5K Dataset

The Corel5k benchmark is obtained from Corel im-

age database and commonly used for image annota-

tion (Duygulu et al., 2002). It contains 5,000 images

and were pre-divided into a training set of 4,000 im-

ages, a validation set of 500 images, and a test set of

500 images. Each image is labeled with from 1 to 5

captions from a vocabulary of 374 distinct words.

6.2 Evaluation

Given a testing dataset, we can measure the effective-

ness of the algorithm. Regarding a label w, the typical

measures for retrieval are precision P

, recall R

Number of images correctly annotated with w

Number of images annotated with w

Number of images correctly annotated with w

Number of images manually annotated with w

We calculate P and R, which are means of P

and

over all labels. To balance the trace-off between

P and R, F

= 2 ∗ P ∗ R/(P + R) is usually used as

another measure for evaluation. In order to measure

retrieval performance, we also calculate the average

precision (AP) for one label w as follows:

∑

r=1

P(r) × rel(r)

Number of images annotated manually with w

where r is a rank, N is the number of retrieved im-

ages, rel(r) is a binary function to check the word at

r is in the manual list of words or not, and P(r) is

the precision at r. Note that, the denominator of AP

is independent with N. Finally, mAP is obtained by

averaging APs over all labels of the testing dataset.

Table 1: Feature extractions & classiﬁers.

Level 1 F

: “gist” of scene SVM-GIST

- F

: color histogram SVM-color

Level 2 F

: color histogram MISVM-color

- F

: Gabor texture MISVM-texture

6.3 Experimental Settings

For the experiments, we performed a cascade of 4

classiﬁers with 2 levels. Here, we worked with only 2

levels because the images of Corel5K are all in small

size. Moreover, we would like to focus on the ba-

sic case to analyze the impact of global features on

reducing the weakly labeling problem. At the ﬁrst

level, global features were extracted from the whole

image. We exploited Gist (Oliva and Torralba, 2001),

and color histogram in RGB color space with 16 chan-

nels. For each region in the second level, we also per-

formed color histogram extraction but with 8 channels

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

Table 2: CMLMI vs. various MIL methods.

(a) In comparison with other standalone MIL methods. Results of ASVM-

MIL and mi-SVM are reported in (Yang et al., 2006)

Method P R F1 mAP

ASVM-MIL 0.31 0.39 0.35 -

mi-SVM 0.28 0.35 0.31 -

MISVM-Color 0.13 0.55 0.21 0.19

MISVM-Texture 0.07 0.36 0.13 0.86

CMLMI 0.30 0.52 0.38 0.35

(b) In comparison with standalone SVM with global features

Method P R F1 mAP

SVM-Color 0.20 0.39 0.27 0.19

SVM-Gist 0.27 0.47 0.34 0.28

CMLMI 0.30 0.52 0.38 0.35

and texture extraction using Gabor ﬁlter as in (Maka-

dia et al., 2010). Summary of feature extraction meth-

ods and their relationship with levels are is given in

Table 1. The numbers of dimension in correspond-

ing feature spaces of algorithms F

, and F

are 960; 4096; 192; and 512 respectively.

We name classiﬁers trained on feature spaces

of F

, and F

as SVM-Gist, SVM-color,

MISVM-color, and MISVM-texture. Conventionally,

CMLMI is used to indicate the strong classiﬁer H

learned according to Algorithm 1, in which classiﬁers

of level 2 (MISVM-color, and MISVM-texture) are

dependent of classiﬁers of level 1 (SVM-Gist, and

SVM-color). In the following, we refer to, for ex-

ample, MISVM-color (or standalone MISVM-color)

to indicate an independent classiﬁer trained on D

and MISVM-color of CMLMI to imply the MISVM-

color learned in the cascade according to Algorithm

1. In the other words, MISVM-color of CMLMI is the

classiﬁer trained on SD

sampled based on the results

of level 1 (SVM-Gist and SVM-color of CMLMI).

6.4 Experimental Results on 70 Most

Common Labels

Like (Yang et al., 2006), we selected 70 most common

labels from Corel5K dataset for experiments. The rea-

son is that labels with a small number of the positive

samples (for example: 5 10 positive samples) are not

efﬁcient to train a classiﬁer.

Table 2(a) shows that CMLMI outperforms other

MIL methods. As observable from the table, we

obtain improvements of 17.35% in F

measure and

16.14% in mAP compared to MISVM-color. In con-

trast to MISVM-texture, CMLMI signiﬁcantly in-

creases F

measure by 25.64% and mAP by 26.71%.

Comparing to previous works, CMLMI obtains bet-

ter results than mi-SVM both in precision and recall,

which leads to a raise of 7.54% in F

measure. Also,

0.34

0.56

0.28

0.25

0.65

0.34

0.11

0.52

0.18

0.22

0.04

0.16

0.66

0.57

0.43

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Tiger Horses Bear

Mean Average Precision

Standalone SVM-Gist Standalone SVM-Color

Standalone MISVM-Color Standalone MISVM-Texture

CMLMI

Figure 5: mAP of CMLMI in comparision with different

standalone methods.

CMLMI outperforms ASVM-MIL in recall while ob-

taining comparable precision (P of 0.30 with CMLMI,

and P of 0.31 with ASVM-MIL). This results in an

improvement 3.54% of our method over ASVM-MIL

in F

measure.

Table 2(b) compares CMLMI to SVM with global

features. We can see that CMLMI also obtain bet-

ter results in F

and mAP (F

of 0.38 and mAP of

0.35) compared with SVM-color (F

of 0.21 and mAP

of 0.20), and SVM-Gist (F

of 0.34, mAP of 0.27).

Among the standalone classiﬁers (SVM-color, SVM-

gist, MISVM-color, and MISVM-texture), SVM with

global features outperform MISVM with region-

based feature extractions. Interestingly, SVM-Gist

is even comparable to ASVM-MIL although image

segmentation, which is more expensive than global

feature, has been used in ASVM-MIL. However,

combining the classiﬁers in our cascading algorithm

yields the best results.

6.5 Experimental Results on Sample

Foreground Labels

We conducted carefully analysis for “tiger”, “horse”

and “bear” in Corel5K since the concepts correspond

to foreground objects which might beneﬁt from ﬁner

levels. Figure 5 shows mAP of standalone classi-

ﬁers and CMLMI for three labels. It can be seen

that individual feature types have different inﬂuences

on different labels. Except for Gist (F

) that shows

its importance for all three labels, global color his-

togram (F

) has more impact on annotating images

with “horses” and “bear” than with “tiger”. Texture

feature at level 2 (of MISVM-texture) performs better

than the other feature extraction methods only with

“tiger”. CMLMI signiﬁcantly outperforms other stan-

dalone classiﬁers on “tiger” and “bear” while falls a

little on “horses” compared with SVM-color. Inter-

estingly, standalone MISVM-color is comparable to

CASCADE OF MULTI-LEVEL MULTI-INSTANCE CLASSIFIERS FOR IMAGE ANNOTATION

(a) Subregions selected by standalone MISVM-color

(b) Subregions selected by MISVM-color of CMLMI

Figure 6: The subregions selected by standalone MISVM-color for label “tiger”, and the subregions selected by MISVM-color

of CMLMI from the corresponding images. Here, the numbers under each subregion indicate image IDs.

(a) Subregions selected by standalone MISVM-texture

(b) Subregions selected by MISVM-texture of CMLMI

Figure 7: The subregions selected by standalone MISVM-texture for label “horses” and the subregions (of corresponding

images) selected by MISVM-texture at the 2-nd level of CMLMI.

CMLMI for “horses”. In order to uncover the ques-

tion in the “horse” case, we conducted detailed anal-

ysis, and found that MISVM-color and SVM-color

captured grass ﬁelds in the background instead of

horses. Indeed, no subregion with the color of a horse

was considered in MISVM-color. Thus, the good

performance of standalone MISVM-color and SVM-

color owes to special feature of the Corel5K dataset in

which horses are on grass ﬁelds in most of pictures.

As previously mentioned, the negative examples

for ﬁner levels are drawn based on the ambiguity

of coarser levels, which are able to detect the back-

ground better. By considering the negative examples

of similar background, we are able to add “negative

instances”, which usually appear with the real positive

instances of positive examples. As a result, there is

more chance for us to separate the ”positive instance”

from “negative instance” in positive examples. Figure

6 and Figure 7 show the examples of selecting posi-

tive instances from corresponding positive bags with

standalone MISVM and MISVM of CMLMI. We can

see from the ﬁgures that MISVM of CMLMI is able

to select more relevant subregions. For the case of

“tiger”, MISVM-color of CMLMI is given more in-

formation about background (grass, forest, stone, wa-

ter), it has successfully avoided selecting background-

related instances as positive ones.

7 CONCLUDING REMARKS

In this paper, we have presented an overview of im-

age annotation: its typical problems, feature extrac-

tion methods and typical methodologies. By analyz-

ing the main problems of image annotation, we pro-

posed a method based on cascading multi-level multi-

instance classiﬁers, which has main advantages as fol-

lows:

• Our cascade of MLMI classiﬁers is able to reduce

training time since we can remove some negative

examples, which are “easily” detected as negative

based on the scene, in ﬁner levels.

• Multi-level feature extractions allow us to anno-

tate images with multiple resolutions. One exam-

ple is that a photo of tiger might be a close-up

photo or the photo of a tiger in its context. Multi-

level feature extractions bring more chance to cap-

ture all of this variety.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

• We also show experimentally that it is able to re-

duce the ambiguity of “weakly labeling” in im-

age annotation, and separate the foreground ob-

jects from the scene in ﬁner levels of the cascade.

The experiments show promising results of the

proposed method in comparison with several base-

lines on Corel5K. Experiments suggest that as long as

the ﬁner levels can bring “newinformation”, they help

to obtain better detection of foreground objects. For

the future work, we would like to focus more on the

role of context in reducing the ambiguity of “weakly

labeling”.

REFERENCES

Akbas, E. and Vural, F. T. Y. (2007). Automatic image an-

notation by ensemble of visual descriptors. In IEEE

Conf. on CVPR, pages 1–8, Los Alamitos, CA, USA.

Andrews, S., Hofmann, T., and Tsochantaridis, I. (2002).

Multiple instance learning with generalized support

vector machines. In 18th AAAI National Conference

on Artiﬁcial intelligence, pages 943–944, Menlo Park,

CA, USA.

Barnard, K., Duygulu, P., Forsyth, D., Freitas, N. D., Blei,

D. M., K, J., Hofmann, T., Poggio, T., and Shawe-

taylor, J. (2003). Matching words and pictures. Jour-

nal of Machine Learning Research, 3:1107–1135.

Blei, D. M. and Jordan, M. I. (2003). Modeling annotated

data. In Proc. of the 26th ACM SIGIR, pages 127–134.

Carneiro, G., Chan, A. B., Moreno, P. J., and Vasconcelos,

N. (2007). Supervised learning of semantic classes for

image annotation and retrieval. IEEE Trans. PAMI,

29(3):394–410.

Deselaers, T. and Ferrari, V. (2010). A conditional random

ﬁeld for multiple-instance learning. In Proc. of The

27th ICML, pages 287–294.

Deselaers, T., Keysers, D., and Ney, H. (2008). Features

for image retrieval: an experimental comparison. Inf.

Retr., 11:77–107.

Douze, M., J´egou, H., Sandhawalia, H., Amsaleg, L., and

Schmid, C. (2009). Evaluation of gist descriptors for

web-scale image search. In Proc. of the ACM CIVR,

pages 1–8, New York, NY, USA.

Duygulu, P., Barnard, K., de Freitas, J. F. G., and Forsyth,

D. A. (2002). Object recognition as machine transla-

tion: Learning a lexicon for a ﬁxed image vocabulary.

In Proc. of the 7th ECCV, pages 97–112, London, UK.

Springer-Verlag.

Feng, S. L., Manmatha, R., and Lavrenko, V. (2004). Mul-

tiple bernoulli relevance models for image and video

annotation. In Proc. of the 2004 CVPR.

Hofmann, T. (1999). Probabilistic latent semantic indexing.

In Proc. of the 22nd ACM SIGIR, pages 50–57, New

York, NY, USA.

J´egou, H., Douze, M., and Schmid, C. (2010). Improving

bag-of-features for large scale image search. Int. J.

Comput. Vision, 87(3):316–336.

Jeon, J., Lavrenko, V., and Manmatha, R. (2003). Au-

tomatic image annotation and retrieval using cross-

media relevance models. In Proc. of the 26th int. ACM

SIGIR, pages 119–126.

Jeon, J., Lavrenko, V., and Manmatha, R. (2004). Auto-

matic image annotation of news images with large vo-

cabularies and low quality training data. In Proc. of

ACM Multimedia.

Kennedy, L. S. and Chang, S.-F. (2007). A reranking ap-

proach for context-based concept fusion in video in-

dexing and retrieval. In Proc. of the 6th ACM int. on

CIVR, pages 333–340, New York, NY, USA. ACM.

Lavrenko, V., Manmatha, R., and Jeon, J. (2003). A model

for learning the semantics of pictures. In Advances

in Neural Information Processing Systems (NIPS’03).

MIT Press.

Lazebnix, S., Schmid, C., and Ponce, J. (2009). Object Cat-

egorization: Computer & Human Vision Perspectives,

chapter Spatial Pyramid Matching. Cambridge Uni-

versity Press.

Makadia, A., Pavlovic, V., and Kumar, S. (2010). Base-

lines for image annotation. Int. J. Comput. Vision,

90(1):88–105.

Maron, O. and Lozano-P´erez, T. (1998). A framework for

multiple-instance learning. In Proc. of the Conf. on

Advances in Neural Information Processing Systems,

NIPS ’97, pages 570–576, Cambridge, MA, USA.

MIT Press.

Monay, F. and Gatica-Perez, D. (2007). Modeling semantic

aspects for cross-media image indexing. IEEE Trans.

Pattern Anal. Mach. Intell., 29(10):1802–1817.

Nguyen, C.-T., Kaothanthong, N., Phan, X.-H., and

Tokuyama, T. (2010). A feature-word-topic model for

image annotation. In Proc. of the 19th ACM CIKM,

pages 1481–1484.

Oliva, A. and Torralba, A. (2001). Modeling the shape of

the scene: A holistic representation of the spatial en-

velope. Int. J. of Comput. Vision, 42:145–175.

Sch¨olkopf, B., Burges, C. J. C., and Smola, A. J., editors

(1999). Advances in kernel methods: support vector

learning. MIT Press, Cambridge, MA, USA.

Szummer, M. and Picard, R. W. (1998). Indoor-outdoor

image classiﬁcation. In Proc. of the 1998 Int. Work-

shop on Content-Based Access of Image and Video

Databases, page 42, Washington, DC, USA.

Torralba, A., Murphy, K. P., and Freeman, W. T. (2010).

Using the forest to see the trees: exploiting context

for visual object detection and localization. Commun.

ACM, 53(3):107–114.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Proc. of IEEE

CVPR, volume 1, pages I–511 – I–518 vol.1.

Yang, C., Dong, M., and Hua, J. (2006). Region-based

image annotation using asymmetrical support vector

machine-based multiple-instance learning. In Proc. of

the 2006 IEEE CVPR, pages 2057–2063, Washington,

DC, USA.

CASCADE OF MULTI-LEVEL MULTI-INSTANCE CLASSIFIERS FOR IMAGE ANNOTATION