LEARNING SIMILARITY FUNCTIONS FOR EVENT

IDENTIFICATION USING SUPPORT VECTOR MACHINES

Timo Reuter and Philipp Cimiano

Semantic Computing Group, CITEC, Universit

at Bielefeld, Bielefeld, Germany

Keywords:

Clustering/classiﬁcation, Machine learning, Similarity function, Event identiﬁcation, Weight adjustment, Sup-

port vector machine, Data mining.

Abstract:

Every clustering algorithm requires a similarity measure, ideally optimized for the task in question. In this

paper we are concerned with the task of identifying events in social media data and address the question of

how a suitable similarity function can be learned from training data for this task. The task consists essentially

in grouping social media documents by the event they belong to. In order to learn a similarity measure using

machine learning techniques, we extract relevant events from last.fm and match the unique machine tags for

these events to pictures uploaded to Flickr, thus getting a gold standard were each picture is assigned to its

corresponding event. We evaluate the similarity measure with respect to accuracy on the task of assigning

a picture to its correct event. We use SVMs to train an appropriate similarity measure and investigate the

performance of different types of SVMs (Ranking SVMs vs. Standard SVMs), different strategies for creating

training data as well as the impact of the amount of training data and the kernel used. Our results show that

a suitable similarity measure can be learned from a few examples only given a suitable strategy for creating

training data. We also show that i) Ranking SVMs can learn from fewer examples, ii) are more robust com-

pared to standard SVMs in the sense that their performance does not vary signiﬁcantly for different sizes and

samples of training data and iii) are not as prone to overﬁtting as standard SVMs.

1 INTRODUCTION

As the amount of data uploaded to social media por-

tals such as Flickr

, Youtube

, Panoramio

, etc. keeps

proliferating, techniques for structuring this massive

content become crucial. Clustering approaches rep-

resent an important technique to organize social me-

dia data according to topics, for example by group-

ing Flickr data into clusters of pictures describing the

same event (Becker et al., 2010; Firan et al., 2010).

The latter task has been dubbed event identiﬁcation

(Becker et al., 2010).

While this task can be formulated as a supervised

classiﬁcation task (Firan et al., 2010), it is most natu-

rally modeled as a clustering task as the set of events

is not ﬁxed, but changes over time. Previous re-

search has in fact shown that incremental clustering

approaches which assign a new data point to the most

similar cluster can be applied successfully to the event

identiﬁcation problem (Becker et al., 2010).

http://ﬂickr.com

http://youtube.com

http://panoramio.com

In order to apply clustering algorithms to social

media data, an appropriate similarity measure sim :

D × D → ℜ is needed, where D is the representation

of some social media document.

This similarity measure can then be used in a clus-

tering algorithm of our choice. In order to improve

the performance of a clustering approach on a given

problem, it seems natural to optimize the similarity

measure and other parameters of the clustering algo-

rithm for the task at hand. This has been referred to

as supervised clustering (Finley and Joachims, 2005).

While some approaches only optimize the similarity

measure (Eick et al., 2005), other approaches even op-

timize the clustering parameters in a supervised fash-

ion by using a genetic algorithm to search efﬁciently

for a good clustering (Demiriz et al., 1999).

In this paper we are concerned with the optimiza-

tion of the similarity measure in the context of the

event identiﬁcation task. Building on state-of-the-

art machine learning techniques, Support Vector Ma-

chines in particular, we analyze different approaches

to learn a similarity measure for the task of event iden-

tiﬁcation. First of all, we analyze the impact of using

208

Reuter T. and Cimiano P..

LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES.

DOI: 10.5220/0003654602000207

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 200-207

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

different types of Support Vector Machines, compar-

ing standard SVMs to the RankingSVMs by Joachims

(Joachims, 2002). Further, we show the impact of dif-

ferent strategies for constructing the training dataset

and investigate the performance of the learned models

depending on the number of training examples used.

We also analyze the impact of using a linear vs. RBF

kernel on the task and carry out a feature analysis.

In our experiments, we use a dataset derived from

Flickr consisting of 300,000 pictures. These pictures

have been uploaded to Flickr by users together with

a so-called machine tag that uniquely assigns these

pictures to an event from last.fm

. As we have knowl-

edge about which events pictures belong to, this data

is a natural choice to optimize the similarity measure

in a supervised fashion. We evaluate the similarity

function on the task of predicting the right cluster that

a new data item should be assigned to, using a sim-

ilar algorithm as presented by Becker et al. (Becker

et al., 2010). The main evaluation criteria is accuracy,

i.e. the percentage of data points that have been as-

signed to the correct cluster. In particular, we provide

the following contributions in this paper:

1. We directly compare standard SVMs and Ranking

SVMs on the task of inducing an appropriate simi-

larity measure for event identiﬁcation on the basis

of training data consisting of assignments of pic-

tures to events. We show that the performance of

the RankingSVM is much more robust compared

to the performance of the standard SVM as the

results of the SVM vary signiﬁcantly depending

on the amount of training data. In particular, we

show that the SVM is more prone to overﬁtting.

2. The ranking SVMs can learn very good models

with a small number of training examples, while

a standard SVM needs considerably more exam-

ples.

3. We investigate different strategies for creating

training datasets and show that strategies which

use the temporal order of the data are better than a

random strategy for sampling training examples.

4. Finally, we show that linear kernels outperform

RBF kernels on both a standard SVM as well as

the RankingSVMs of Joachims (Joachims, 2002).

The paper is structured as follows: in Section 2

we describe the event identiﬁcation problem more for-

mally and show how standard SVMs and Ranking

SVMs can be used to optimize the parameters of a

similarity measure. We describe our dataset and show

the results of our experiments in Section 3. In Section

4 and 5 we discuss related work and conclude.

http://www.last.fm

2 PROBLEM STATEMENT

The problem of event identiﬁcation in social media

has been introduced by (Becker et al., 2010). They

formulate this problem as a supervised clustering

problem in which the underlying similarity function

and other parameters of the clustering algorithm are

optimized using labeled data.

Let E be the set of all event clusters. In princi-

ple we need a function a

: D → E

which assigns a

document d ∈ D to some of the events E (clusters)

available at time t. As the number of documents and

event clusters is not known, we want our decision to

depend on a similarity function sim which assigns the

document to the event cluster maximizing the similar-

ity:

(d) = arg max

e∈E(t)

sim(d, e)

The similarity function to be learned has the fol-

lowing form:

sim : D × E → [0..1]

A central question is which model is assumed for

such a function. In our approach we follow Becker et

al. who assume a linear model for this function, using

features such as time, location, tags etc. The features

correspond essentially to simple similarity measures

for different dimensions. The similarity between a

picture and a centroid is calculated using the follow-

ing linear model:

sim(d, e) = ~w ·~v

sim

(d, e),

where ~w is a weight vector and ~v is the feature

vector for a document pair (see Section 2.1).

Therefore, the problem is to optimize the weight

vector ~w (and thus the function sim) for the task of

assigning documents to the right event.

In order to formalize the idea, let us introduce

some terminology. Let d

be a datapoint to be as-

signed to some cluster. Let e(d

) be the correct cluster

for d

according to our labeled training data. We de-

ﬁne the following functions:

• a function cent assigning each event a centroid

vector averaging over all data items belonging to

this event: cent : E → R

• a function e assigning each document d

to its cor-

responding event: e : d

→ E (only used for eval-

uation purposes)

• a function deﬁning the extension ext of a certain

event e ∈ E consisting of all the data items belong-

ing to this event: ext : E → P(D)

LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES

209

2.1 Pairwise Features

We use 3 features to build up the feature vector

sim

(d, c): timestamp, geographic feature, and tags.

The single features are deﬁned as described in Reuter

et al. (2011). Overall, a feature vector for a pair of a

document and centroids looks as follows:

sim

(d, c) =





sim

time

(d, c)

sim

geo

(d, c)

sim

tags

(d, c)





2.2 Problem Formulation using a

“Standard” SVM

Becker et al. show that using SVMs to learn such a

similarity function for the event identiﬁcation prob-

lem delivers satisfactory results compared to more

naive strategies. A “standard” SVM determines a hy-

perplane h~w,~v

sim

(d, c)i + b = 0 which separates two

classes in a way that the margin to the hyperplane is

maximized. As a perfect separation between positive

and negative examples can not always be achieved,

classiﬁcation errors are allowed, the goal being to

determine a hyperplane which minimizes errors and

maximizes the margin that separates the data. Ac-

cording to Cortes and Vapnik, this problem can be

formulated as follows (Cortes and Vapnik, 1995):

min

w,ξ

||w||

∑

i=1

(1)

(hw, Φ(x

)i + b) ≥ 1 − ξ

,ξ

≥ 0 (2)

As usual, Φ(x

) is some kernel, the dot product in

the simplest case, C is a user-deﬁned constant allow-

ing to trade-off margin size against the training error,

are slack variables and y

∈ {−1,1} is the class la-

bel.

In our case the training examples x

are deﬁned as

follows:

(



,e),−1



i f e

6∈ e(d

)



,e),+1



else

Assuming a functional margin of 1 as typically

used in Support Vector Machines, the weight vector

~w needs to fulﬁll the following conditions:

∀d

h~w,Φ(d

,e(d

))i > 1(−ξ

) (3)

∀d

∀e

6= e(d

) h~w,Φ(d

)i < −1(+ξ

) (4)

In Becker et al. such a function is applied to com-

pare a new document to the centroid of every cluster

constructed so far, assigning the new data point to the

cluster maximizing this similarity, provided it is over

a threshold θ

sim

. If the similarity is under the thresh-

old, a new event cluster is created. If a new cluster e

is created at time t, then E(t +1) = E(t)∪e

. Learning

the similarity function is thus formulated essentially

as a standard classiﬁcation problem, separating those

pairs of documents that belong to the same event from

those that do not.

2.3 Event Identiﬁcation as a Ranking

SVM Problem

According to the formulation of the event identiﬁca-

tion problem according to Becker et al., the similarity

function is used to determine that cluster having the

highest similarity to the given data point. This can

be seen in fact as a ranking problem, the task being to

rank all the clusters according to their similarity to the

data point. This would provide us with an alternative

formulation of the problem, i.e. the one of learning a

similarity function that always assigns a higher value

to the right cluster compared to all the other clusters.

The task is to learn a similarity function that satisﬁes

the following condition:

∀d

∀e

6= e(d

) sim(d

,e(d

)) > sim(d

) (5)

or, which is equivalent:

∀d

∀e

6= e(d

)

(sim(d

,e(d

)) − sim(d

)) > 0 (6)

Assuming that we have a linear model for sim and the

weight vector is denoted by ~w, we have the following

condition:

∀d

∀e

6= e(d

)

~w(~v

sim

,e(d

)) −~v

sim

)) > 0 (7)

Note that this problem is essentially equivalent

to the optimization problem solved by the ranking

SVMs of Thorsten Joachims (Joachims, 2002). As-

suming a margin of 1 we would get the equivalent

formulation, which is the one used by Joachims:

∀d

∀e

6= e(d

)

~w(~v

sim

,e(d

)) −~v

sim

)) > 1(−ξ

) (8)

Given this formulation of the problem of learning

an adequate similarity as a learning to rank problem,

we can apply the Ranking SVM of Joachims in an off-

the-shelf manner directly to the problem of learning a

suitable similarity function for the event identiﬁcation

problem.

Clearly, learning a ~w that fulﬁlls the condition in

(8) also implies ﬁnding a ~w that fulﬁlls the conditions

(1) and (2). In some sense, the conditions in (2) might

be too restrictive for the problem at hand. In this

sense, we might be solving a more difﬁcult problem

than the one we indeed have to solve when using the

standard SVM criteria.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

210

2.4 Sampling Strategies

In this section we describe three different strategies

to construct the training dataset on the basis of which

the similarity function is learned. Each data split con-

sists of a subset of all the documents in question.

For each document and each centroid representing an

event, the corresponding similarity vector~v

sim

(d, c) is

computed. Pairs (d

,e(d

)) consisting of a document

and the event e(d

) it belongs to are used as posi-

tive examples. For all strategies, we make sure that

the training data set is balanced, generating a num-

ber of negative examples that matches the number of

positive examples. We describe below how this is

achieved for the three different strategies we consider:

• Random. Using this method, we retrieve the sub-

set of documents from the training set randomly.

In order to create the negative examples, we com-

pute the similarity vector using the document and

a randomly chosen centroid to which the docu-

ment does not belong.

• Time-based. As documents of events are up-

loaded in a time-ordered manner to Flickr, it

seems natural to choose n consecutive documents

from the training dataset. For that, we use the ﬁrst

documents appearing in each split. To create the

negative examples we use the random strategy as

described above.

• Nearest. Here, we also choose n consecutive doc-

uments from the training dataset as we did in the

time-based strategy. However, in this case, the

negative examples are chosen using the nearest

event class where the document does not belong

to:

neg(d

) = max

−

6∈ext(e(d

)

∑

i=1

sim

−

,e(d

))

3 EXPERIMENTS

In this section we describe our dataset which has been

derived from Flickr. Furthermore, we present our ex-

perimental settings and ﬁnally we describe the results

for the experiments.

3.1 Experimental Settings

3.1.1 Dataset

For our experiments we use a dataset derived

from Flickr. We consider only pictures assigned

to a speciﬁc event via a so-called machine tag.

Machine tags represent unique event IDs – such

as “lastfm:event=679065” originating from last.fm.

Each event on last.fm has an unique ID assigned

which can thus be used by users when uploaded a

picture on Flickr to mark it as belonging to a certain

event. This assignment of pictures to events via ma-

chine tags can be used to construct a gold standard by

assuming that all pictures with the same machine tag

belong to the same event.

We downloaded a large number of Flickr pictures

with machine tags that have a lastfm:event as pre-

ﬁx using the Flickr API. In particular, we are inter-

ested in the assigned metadata like the capture time,

geographic information, user-deﬁned tags and many

more. Previous work showed that certain metadata is

more useful than other for clustering (Reuter et al.,

2011). Therefore, we concentrate on temporal and

spatial features as well as tags and titles. The titles are

tokenized and the resulting tokens are added to the set

of tags. All tags consisting of only a single character

are removed in order to reduce noise. Furthermore,

we remove all pictures whose timestamp is not plausi-

ble at the moment of download (taken before January

2006 or after April 2011). As we desire only pictures

to be in the dataset which are feature-complete, i.e.

have time and geographic information as well as tags

assigned to it, we ﬁlter out all pictures which did not

contain these features. The ﬁnal dataset consists of

300,000 documents spread over 11,730 unique event

clusters. An event cluster thus contains 25.6 pictures

on average.

All machine tags were removed from the data and

we use them only to create the gold standard.

3.1.2 Data Splits

The documents in the dataset are ordered by their cap-

ture time. We divide the dataset into three equal parts

which we use for cross validation. This allows us to

train on one split and to test this model on the two

other splits.

3.1.3 Creation of Training Examples

For the creation of our training examples, we con-

sider each data split individually. We choose m posi-

tive and m negative examples from each split accord-

ing to the three sampling strategies presented in sec-

tion 2.4. For m we consider the following values:

m ∈ 100,200, 300,400, 500,1000, 2000,4000, 8000,

16000,32000,100000. In order to be able to compare

the results between the standard SVM and the ranking

one, we use the same training examples for both.

For the standard SVM we use the class labels 1

for positive and −1 for negative examples. For the

LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES

211

ranking SVM we have to group positive and nega-

tive examples into a ranked group where the correct

document-centroid pair (d

,e(d

)) gets a higher rank

(label 2) than the pair (d

) where the same doc-

ument is compared to a centroid of an event cluster

where it does not belong to (label 1).

3.1.4 Leave-One-Out

To conduct our experiments we use a leave-one-out

strategy. In particular, we choose k event clusters

from our test set and take one document out. The cen-

troids of the event clusters are thus recomputed as if

the document taken out was never part of it. We do

this for each data split individually.

Actually, we select 1000 event clusters from each

data split and choose a random document from each

cluster to be taken out. We end up with 1000 docu-

ments and 1000 clusters. In our experiments we thus

compare every document to every event cluster cen-

troid. As we know the correct assignment for each

document, we are able to measure the correct assign-

ment rate for each method. The accuracy is averaged

over all 6 train-test split pairs.

3.1.5 SVM Training

In our experiments we rely on a ranking SVM and a

standard SVM. As ranking SVM we use SVM

rank

, the

implementation from T. Joachims (Joachims, 2002).

For the standard SVM we make use of libsvm (Chang

and Lin, 2001). We feed both SVMs with the same

training data for equal scenarios (both SVMs get the

same correct and negative examples). For both SVMs

we use their standard settings. The trade-off between

training error and margin (C) is set to 1.0 for both

SVMs.

We use two kernels for our experiments:

Φ(d, e) = hd,ei (linear kernel)

Φ(d, e) = exp



−γ · ||d − e||



,γ =

(RBF kernel)

In order to yield an actual similarity measure, the

output of the SVMs needs to be normalized into the

interval [0..1]. We rely on the probability estimates

provided by libSVM for this purpose.

3.2 Results

In this section we present our results. Table 1 and

Table 2 show the average accuracy for both types of

Support Vector Machines using the different sampling

strategies and a different number of training exam-

ples.

Table 1: Average assignment rate using RankSVM with lin-

ear kernel.

Training

examples

Time-based Random Nearest

200 98,4% 66,9% 98,3%

400 98,4% 42,7% 98,3%

600 98,5% 75,6% 98,3%

800 98,5% 62,5% 98,3%

1000 98,5% 43,2% 98,3%

2000 98,5% 8,7% 98,3%

4000 98,5% 43,6% 98,3%

8000 98,5% 67,9% 98,3%

16000 98,5% 58,9% 98,2%

32000 98,6% 81,6% 98,2%

64000 98,6% 90,3% 98,3%

200000 98,9% 98,9% 98,3%

Table 2: Average assignment rate using a standard SVM

with linear kernel (libSVM with probability estimates).

Training

examples

Time-based Random Nearest

200 98,1% 66,1% 98,0%

400 97,4% 58,7% 97,9%

600 98,1% 68,0% 98,0%

800 98,2% 37,1% 97,7%

1000 98,0% 36,6% 97,7%

2000 92,8% 66,1% 98,2%

4000 87,0% 43,7% 98,8%

8000 87,7% 67,1% 99,1%

16000 85,9% 51,1% 99,1%

32000 90,5% 94,9% 98,7%

64000 90,2% 92,2% 98,4%

200000 89,4% 87,4% 98,2%

3.2.1 Standard vs Ranking SVM

Overall, one very surprising conclusion of our exper-

iments is the fact that a good similarity function can

already be learned with very few examples. While the

Ranking SVM can already learn a model with close

to 98% accuracy with only two examples (one pos-

itive and one negative), the standard SVM needs at

least 10 positive and 10 negative examples to learn

a model with 98% accuracy. While the performance

of the RankingSVM does not vary depending on the

amount of training data, the standard SVM clearly

suffers from overﬁtting, shown by the drop of perfor-

mance from around 2000 training examples onwards.

The RankingSVM does not suffer from such a perfor-

mance drop, thus showing a more robust behavior.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

212

3.2.2 Impact of Sampling

The results license the conclusion that the sampling

strategy is crucial for the creation of a model as the

quality varies a lot.

We see that the random sampling strategy does not

allow to learn a good model. It might be possible that

a model created by a random selection of positive and

negative pairs gives acceptable results but in general

the results are much worse and unstable.

Both of the other methods work very well. We ex-

pected the nearest method to be the most compelling

but this is not true for a ranking SVM. Using a stan-

dard SVM we see clearly that the search for the near-

est wrong pair helps to create a better model. Using

the time-based sampling strategy we can see this ef-

fect, too. As the maximal distance between two data

points is low for smaller examples, the choice of a

random negative pair is comparable to the choice of

the nearest.

3.2.3 Choice of SVM Kernel

Both SVMs always achieve better results if a linear

kernel is used. This leads to the assumption that a lin-

ear kernel suits better to the problem. In the follow-

ing, we will thus report results using a linear kernel

only.

3.2.4 Feature Analysis

In Figure 1 and Figure 2 we compare the accu-

racy rates of the different features and all pos-

sible combinations. To obtain these ﬁgures,

we averaged over all accuracy rates for m ∈

500,1000,2000,4000,8000, 16000 as well as the

time-based and nearest sampling strategy.

We discovered that the use of only one single fea-

ture is not enough to ensure good predictions. It is

surprising that it is not possible to use textual features,

tags in our case, to determine a correct cluster assign-

ment at all. As there are a lot of events at the same

time or at the same location, it seems to be natural

that these features are not sufﬁcient if they are used

alone. Nevertheless, the accuracy rate when using ge-

ographic information only is high which leads us to

the hypothesis that the uniqueness of locations in our

dataset is high.

It is interesting that all combinations of two fea-

tures out of the three are enough to produce reason-

able results. This is independent from their actual re-

sults as a single feature, e.g. even if the features time

and tags used with a standard SVM do not produce

any acceptable result, the combination of both reaches

a very high result. This leads us to the assumption that

Time

Geo

Tags

Time+Geo

Geo+Tags

Time+Tags

All

Figure 2: Average accuracy rate for different features and

feature combinations using SVM

rank

the availability of all features for a single document is

not compulsory and the lack of features can be com-

pensated.

4 RELATED WORK

In this paper, we compared different methods to learn

similarity functions for event identiﬁcation. The most

directly related work is Becker et al. who also learn a

similarity function using supervised machine learning

techniques (Becker et al., 2010). In their paper they

use a standard SVM which they did not compare to

other types of SVMs such as ranking SVMs. Becker

et al. neither have systematically investigated the im-

pact of different strategies for creating training data

nor have carried out a systematic feature analysis. In

this sense our work closes a gap by providing advice

on how to learn an optimal similarity function for a

similar task.

The task of learning a similarity function from

training data is typically also addressed in the context

LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES

213

of supervised clustering, where some approaches op-

timize or learn the clustering similarity function from

the training data. Eick et al. use a process of inter-

leaving clustering with distance function in order to

learn a distance function (Eick et al., 2005). Demirez

et al. (1999) use genetic algorithms and an appropri-

ate ﬁtness function to guide the search for an opti-

mal clustering. Rendle et al. (2006) have also pro-

posed supervised techniques that use labelled data in

the form of Must-Link and Cannot-Link constraints to

guide the search for a clustering, have also applied in

the ﬁeld of Record Linkage. Basu et al. use labelled

data to create Must-Link and Cannot-Link constraints

in their semi-supervised clustering approach in order

to cluster unlabeled data (Basu et al., 2003).

The Learning of a similarity function between

documents is directly related to the task of learning a

ranking function in the area of Information Retrieval.

In our paper we propose to use a method similar to

traditional information retrieval where the system re-

trieves and ranks a document according to the simi-

larity between the queried document and the entries

in the document collection. Joachims proposes Rank-

ingSVMs to learn a linear ranking model that can be

exploited to rank documents in information retrieval

scenarios (Joachims, 2002). In his scenario, the tar-

gets are not class labels but a binary ordering rela-

tion. Others like Fakeri-Tabrizi et al. use a Ranking

SVM for an imbalanced classiﬁcation problem with

image annotation and show that this type of SVM

performs better than a standard SVM (Fakeri-Tabrizi

et al., 2011). Freund et al. propose an efﬁcient boost-

ing algorithm to combine linear rankings by experts

(Freund et al., 2003). These experts correspond to

the similarity functions we use as features in our ap-

proach. Our results indeed show that a ranking SVM

can be successfully applied for the task of ﬁnding and

learning a similarity function for the event identiﬁca-

tion problem. It is showing a more stable behavior

than a standard SVM and is also able to learn the sim-

ilarity function with less training examples. Gao et al.

have developed a similar approach but using a percep-

tron to learn the ranking function instead of a SVM

(Gao et al., 2005).

In the area of event identiﬁcation, there have been

several attempts to decide whether a document is as-

sociated with an event or not. The task of grouping

documents into clusters describing the same event has

been dubbed event identiﬁcation (Becker et al., 2010).

Others learn to distinguish Flickr documents repre-

senting an event from those that do not (Rattenbury

and Naaman, 2009). Allan and Papka used an incre-

mental clustering approach for detecting events in text

document streams (Allan et al., 1998). Becker et al.

use a similar method to identify event clusters (Becker

et al., 2010).

Firan et al. provide a formulation of the event

identiﬁcation problem as a standard classiﬁcation

task, thus learning a function class : D → E (Firan

et al., 2010). They used a Naive Bayes classiﬁer to do

this. This is problematic in our view. It does not ac-

count for the dynamic nature of data in the sense that

it does not model the fact that new events constantly

emerge. In order to accommodate this, a new classi-

ﬁer has to be trained for each new event that is added

to the system. There is no mechanism for detecting

new events. Our results have shown that there are on

average only 26 data points on each event. Given this

sparsity of data, it is not clear if it is possible to learn

appropriate classiﬁers for each of them.

5 CONCLUSIONS

In this paper we addressed the question of how we

can learn a suitable similarity function from training

data for the task of identifying events in social media

data. We made use of different types of SVMs (Rank-

ing and Standard SVMs) in order to train an appro-

priate similarity measure. We investigated different

strategies for creating training data, the impact of the

amount of training data and the kernel used. We have

shown that it is possible to learn a suitable similarity

measure using both SVMs. It is sufﬁcient to use only

few examples to train a model. Using a ranking SVM,

one positive and one negative pair allows the creation

of a suitable similarity measure. Furthermore, the per-

formance of a ranking SVM does not vary signiﬁ-

cantly for different sizes of training data and is thus

more robust than a standard SVM. Besides, we ﬁg-

ured out that a linear kernel always outperforms an

RBF one regardless of which type of SVM we use.

We have clearly shown that the sampling strat-

egy is crucial for the success of the training process.

The use of well-chosen training examples improve the

quality signiﬁcantly. We have seen that the search for

the nearest wrong pair helps to create a good model.

The fact that the time-based sampling strategy is com-

parable to the nearest one for small amounts of data

explains the good results of the SVM in the approach

of Becker et al. They used a time-based strategy with

500 training examples (Becker et al., 2010). In gen-

eral there is only few data needed to be able to create

a working model. If too much training data is used, at

least the standard SVM suffers from overﬁtting.

Furthermore, we have discovered in our feature

analysis that the lack of a single feature has no great

impact on the quality of the system. The results are

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

214

still compelling with the use of only two features. A

lot of pictures in social media are missing some of the

features but our method is still applicable as we know

from our downloaded data that about 99.8% of all

documents have at least two features assigned. Thus,

the methods we proposed in this papers are applica-

ble in a real-world scenario. In addition there is other

data like author and a description available which also

might be useful as a feature.

ACKNOWLEDGEMENTS

The research is funded by the Deutsche Forschungs-

gemeinschaft (DFG), Excellence Cluster 277 “Cogni-

tive Interaction Technology”.

REFERENCES

Allan, J., Papka, R., and Lavrenko, V. (1998). On-line new

event detection and tracking. In Proceedings of the

21st annual international ACM SIGIR conference on

Research and development in information retrieval,

pages 37–45. ACM.

Basu, S., Bilenko, M., and Mooney, R. (2003). Comparing

and unifying search-based and similarity-based ap-

proaches to semi-supervised clustering. In Proceed-

ings of the ICML-2003 Workshop on the Continuum

from Labeled to Unlabeled Data in Machine Learning

and Data Mining, pages 42–49. Citeseer.

Becker, H., Naaman, M., and Gravano, L. (2010). Learn-

ing similarity metrics for event identiﬁcation in social

media. In Proceedings of the third ACM International

Conference on Web search and Data Mining, pages

291–300.

Chang, C. and Lin, C. (2001). LIBSVM: a library for sup-

port vector machines.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine learning, 20(3):273–297.

Demiriz, A., Bennett, K., and Embrechts, M. (1999). Semi-

supervised clustering using genetic algorithms. Ar-

tiﬁcial neural networks in engineering (ANNIE-99),

pages 809–814.

Eick, C., Rouhana, A., Bagherjeiran, A., and Vilalta, R.

(2005). Using clustering to learn distance functions

for supervised similarity assessment. Machine Learn-

ing and Data Mining in Pattern Recognition, pages

120–131.

Fakeri-Tabrizi, A., Tollari, S., Usunier, N., and Gallinari, P.

(2011). Improving image annotation in imbalanced

classiﬁcation problems with ranking SVM. Multi-

lingual Information Access Evaluation II. Multimedia

Experiments, pages 291–294.

Finley, T. and Joachims, T. (2005). Supervised clustering

with support vector machines. In Proceedings of the

22nd international conference on Machine learning,

pages 217–224. ACM.

Firan, C., Georgescu, M., Nejdl, W., and Paiu, R. (2010).

Bringing order to your photos: event-driven classi-

ﬁcation of ﬂickr images based on social knowledge.

In Proceedings of the 19th ACM international con-

ference on Information and knowledge management,

pages 189–198. ACM.

Freund, Y., Iyer, R., Schapire, R., and Singer, Y. (2003).

An efﬁcient boosting algorithm for combining prefer-

ences. The Journal of Machine Learning Research,

4:933–969.

Gao, J., Qi, H., Xia, X., and Nie, J. (2005). Linear discrim-

inant model for information retrieval. In Proceedings

of the 28th annual international ACM SIGIR confer-

ence on Research and development in information re-

trieval, pages 290–297. ACM.

Joachims, T. (2002). Optimizing search engines using click-

through data. In Proceedings of the eighth ACM

SIGKDD international conference on Knowledge dis-

covery and data mining, pages 133–142. ACM.

Rattenbury, T. and Naaman, M. (2009). Methods for ex-

tracting place semantics from Flickr tags. ACM Trans-

actions on the Web (TWEB), 3(1):1.

Rendle, S. and Schmidt-Thieme, L. (2006). Object identiﬁ-

cation with constraints.

Reuter, T., Cimiano, P., Drumond, L., Buza, K., and

Schmidt-Thieme, L. (2011). Scalable event-based

clustering of social media via record linkage tech-

niques. In Proceedings of the Fifth International AAAI

Conference on Weblogs and Social Media.

LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES

215