EXPLOITING THE LANGUAGE OF MODERATED SOURCES

FOR CROSS-CLASSIFICATION OF USER GENERATED CONTENT

Avar´e Stewart and Wolfgang Nejdl

Forschungszentrum L3S, Appelstr 9a, Hannover, Germany

Keywords:

Automatic labeling, Cross-classiﬁcation, Medical intelligence gathering.

Abstract:

Recent pandemics such as Swine Flu have caused concern for public health ofﬁcials. Given the ever increas-

ing pace at which infectious diseases can spread globally, ofﬁcials must be prepared to react sooner and with

greater epidemic intelligence gathering capabilities. However, state-of-the-art systems for Epidemic Intelli-

gence have not kept the pace with the growing need for more robust public health event detection. Existing

systems are limited in that they rely on template-driven approaches to extract information about public health

events from human language text.

In this paper, we propose a new approach to support Epidemic Intelligence. We tackle the problem of detecting

relevant information from unstructured text from a statistical pattern recognition viewpoint. In doing so, we

also address the problems associated with the noisy and dynamic nature of blogs by exploiting the language in

moderated sources, to train a classiﬁer for detecting victim reporting sentences in blog social media. We refer

to this as Cross-Classiﬁcation. Our experiments show that without using manually labeled data, and with a

simple set of features, we are able to achieve a precision as high as 88% and an accuracy of 77%, comparable

with the state-of-the-art approaches for the same task.

1 INTRODUCTION

Many factors in today’s changing society contribute

towards the continuous emergence of infectious dis-

eases. In response, Epidemic Intelligence (EI) has

emerged as a type of intelligence gathering which

aims to detect events of interest to the public health

from unstructured text on the Web.

In a typical EI framework, disease reporting

events (i.e., victim, location, time, disease) are ex-

tracted from raw text. The events are then aggregated

to produce signals, which are intended to be an early

warning against potential public health threats. Epi-

demiologists use them to assess risk, or corroborate

and verify the information locally and with interna-

tional agencies.

Although there are numerous EI systems in exis-

tence, they are limited in two major ways. First, these

systems focus mainly on using news and outbreak re-

ports as a source of information (Hartley et al., 2009).

However, in order to effectively provide a warning

as early as possible, diverse information sources are

needed, such as those from just-in-time crisis infor-

mation blogs

. Disproportionately, blogs and other

types of social media have not been considered in in-

telligence gathering. Secondly, the algorithms used

in these systems typically detect disease related ac-

tivity by relying upon predeﬁned templates, such as

keywords or regular expression. The drawbacks of a

template-based approach is that given the variety of

natural language, many patterns may be required and

enumerating all possible patterns is costly. Moreover,

the results typically lead to a low recall for identifying

relevant events.

The ﬁrst steps toward overcoming this limitation

is to view the Epidemic Intelligence task in a new

light, by: 1) including more diverse sources, such as

blogs, and 2) using statistical approaches (e.g., statis-

tical pattern recognition), to detect information about

public health events. However, the automatic extrac-

tion of information from blogs remains a challeng-

ing task, because blogs are both noisy and dynamic

(Moens, 2009).

In this paper we address these twofold challenges

ﬁrst by relying upon comparable text. We say that

comparable text is one in which overlapping topics

http://www.ushahidi.com

571

Stewart A. and Nejdl W..

EXPLOITING THE LANGUAGE OF MODERATED SOURCES FOR CROSS-CLASSIFICATION OF USER GENERATED CONTENT.

DOI: 10.5220/0003298105710576

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 571-576

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

are discussed in a similar way. For such text, we

assume that the languages used have common parts,

and therefore, the linguistic structures in one cor-

pus, can be identiﬁed in the other. Second, we ex-

ploit this notion by building a binary classiﬁer, that

has been trained from the victim-reporting sentences

of one corpora, in our case outbreak reports. The

training data is gathered through weak labeling, i.e.,

the training set is automatically built. A classiﬁer

trained on this training set is then used to detect

disease-reporting sentences in blogs. We refer to

this approach as Cross-Classiﬁcation. The contribu-

tions of this work are: 1) an introduction of a Cross-

Classiﬁcation Framework for Epidemic Intelligence,

and 2) an exploitation of outbreak reports for weak

labeling.

The rest of the paper is structured as follows: the

Cross-Classiﬁcation approach is described in Section

2 and an evaluation is given in Section 3. In Section 4,

related work is presented and in Section 5, the paper

concludes with a discussion on future work.

2 CROSS-CLASSIFICATION

FRAMEWORK

In this section, we describe our Cross-Classiﬁcation

approach and outline how the text of outbreak reports

is exploited for weak labeling of blog post sentences.

We also outline the properties that impact the quality

of weak labeled sentences when using such sources.

2.1 Cross-classiﬁcation

The Cross-Classiﬁcation process is depicted in Fig-

ure 1. The comparable texts are two distinct corpora.

We deﬁne one as the auxiliary source (outbreak re-

ports) and the other one as the target source (blogs).

Additionally, the ﬁgure shows the traditional way of

classiﬁcation in which the sentences of the target do-

main are labeled manually and used as training data

(Direct Classiﬁcation).

Figure 1: Overview of Cross-Classiﬁcation.

An outbreak report is a moderated source that typ-

ically relies on experts to ﬁlter and extract information

about health threats. Based on the underlying proper-

ties of the auxiliary data (see Section 2.2), we auto-

matically label the sentences in the auxiliary data as

positive or negative and use them for training a binary

classiﬁer. The resulting model is then applied on the

target corpus in order to classify sentences.

The type of classiﬁer we use is a tree-kernel based

support vector machine (SVM). The reason for this

choice is that for natural language tasks similar to

ours, linguistic representations have proven to be suc-

cessful for detecting disease reporting information

(Zhang, 2008; Conway et al., 2009). One reason is

that kernel methods are capable of generating a high

number of syntactic features, from which the learning

algorithm can select those most relevant for a speciﬁc

application (Moschitti, 2006). This can help over-

come the feature engineering needed with linguistic

representations.

2.2 Weak Labeling for Training Data

Since training data is unavailable for our classiﬁcation

task, we apply weak labeling for gathering training

material. In particular, the Sentence Position is ex-

ploited for selecting positive and negative examples.

As in any classiﬁcation task, the quality of the clas-

siﬁer is highly dependent on the training data and its

noise. We also investigate further properties that may

impact the quality of weak labeling, these include:

Sentence Length, and Sentence Semantics.

Sentence Position. The position of information in

text has been widely exploited in document summa-

rization for news where the ﬁrst sentences in an article

or paragraph summarize the most important informa-

tion (Lam-Adesina and Jones, 2001). Based on this,

we model the auxiliary corpus as a sentence database,

where each document, in the corpus is represented

as an ordered sequence of one or more sentences.

We adopt an approach to automatically label the sen-

tences in each document, where the TopN sentences

in a document are taken as positive cases, for a thresh-

old value of N. Further we hypothesize that sentences

appearing towards the end of the sequence, are less

relevant, so the BottomN are automatically labeled as

negative examples.

Sentence Length. The sentences in the auxiliary

corpus, vary greatly in length, due to conjunction and

phrases. Previous work using tree representations for

sentences, has shown that longer sentences may con-

tain too many irrelevant features, and over-ﬁtting may

occur, thereby decreasing the classiﬁcation accuracy.

In this light, we propose that sentence length is also

an important aspect of Cross-Classiﬁcation and inve-

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

572

stigate its impact on the quality of Cross-

Classiﬁcation in our experiments.

Sentence Semantics. Finally, we are interested in

identifying victim-reporting sentences, where the def-

inition used for victim reporting is based on the tem-

plate for MedISys Disease Incidents

. This template

includes disease, time, location, case, and the status of

victims that have been extracted from the full text of

news articles. We say a sentence is a victim-reporting

one, if it contains a medical condition in conjunction

with a victim, time, or location, where the case and

status of a victim may be inferred from the context.

The semantic information is thus represented by the

presence of named entities (NEs) in the sentence. In

our experiments, we also investigate if the presence

of NEs is an important factor in choosing positive ex-

amples.

3 EXPERIMENTS

The goal of our experiments is to measure how well a

Cross-Classiﬁer can detect victim-reportingsentences

within a blog. We train the classiﬁer with three dif-

ferent feature sets and compare the results. Further,

we experiment with three weak labeling properties

for the outbreak reports, namely: Sentence Position,

Sentence Length, and Sentence Semantics. We com-

pare the performance of the Cross-Classiﬁer to that

of a traditional classiﬁer, and state-of-the-art perfor-

mance measures reported for a similar task, which

use a considerable number of features and rely exclu-

sively upon labeled data for training.

3.1 Experimental Setting

As target data, we selected the AvianFluDiary

, a well

known source within the “ﬂu bloggers” community.

The data was collected for a one year period: Jan-

uary 1 - December 31, 2009. For the auxiliary data,

we used ProMED-mail

, a global electronic reporting

system, listing outbreaks of infectious diseases. This

data was collected over a period of eight years: Jan-

uary 1, 2002 - December 31, 2009. Early experiments

using the data from a single year, as the auxiliary data,

showed poor results, due to the fact that there were too

few documents (and hence sentences) to support weak

labeling. In total, we collected 4,249 documents for

AvianFluDiary and 14,665 for ProMED-mail.

http://medusa.jrc.it/medisys/helsinkiedition/all/home.

html

http://aﬂudiary.blogspot.com/

http://www.promedmail.org

The data for both the auxiliary and target domains

was processed using the Standford Parser

to split

and parse each sentence. The total number of sen-

tences for AvianFluDiary was 44,723 and ProMED-

mail 347,822. Sentences were further processed us-

ing OpenCalais

to extract NEs. In total, 3,300

documents and 34,752 sentences from AvianFluDi-

ary and 10,026 documents and 127,314 sentences

from ProMED-mail contained entities recognizable

by OpenCalais.

Weak labeling for the auxiliary data was con-

structed using the Top5 sentences as positive exam-

ples. Roughly 25,000 sentences were included in the

Top1, and the amount of training data increased by

25,000 sentences, as N increased. Further, we notice

in our corpus that the bottom sentences tend to refer

to additional web sites, so any bottom N sentences

containing URLs where eliminated.

SVM Classiﬁcation. The Cross-Classiﬁer was

based on SVM-TK (Moschitti, 2006), and the clas-

siﬁcation features used to build the classiﬁer were

the parts-of-speech parse tree (POS), the term vec-

tor (VEC), and their combination (POSVEC). Experi-

ments using the vector space feature alone performed

consistently below the POS and POSVEC, thus, we

do not report them further in this work. Five random

sets, from each TopN set were selected to train a clas-

siﬁer, and the results obtained for each classiﬁer were

averaged over these ﬁve trials.

Baseline. As a baseline, we compare the Cross-

Classiﬁcation approach to the traditional (or Direct

Classiﬁcation) method (see Figure 1), in which both

the training and testing is done on manually labeled

sentences of AvianFluDiary. Of the 5,328 sentences

manually labeled, 729 were positive cases, showing

a relatively low percentage of information-bearing

sentences within the blog. The direct classiﬁer was

trained with equal amounts of positive and negative

cases using a 10-fold cross-validation. The evalua-

tion measures used were precision (P), recall (R), f1-

measure (F) and accuracy (A), reported using a scale

of 0 to 100%.

We next present our experiments: ﬁrst we evaluate

the Cross-Classiﬁcation performance with respect to

the three weak labeling properties of outbreak reports;

then, the classiﬁcation features are examined.

http://nlp.stanford.edu

http://www.opencalais.com

EXPLOITING THE LANGUAGE OF MODERATED SOURCES FOR CROSS-CLASSIFICATION OF USER

GENERATED CONTENT

573

Table 1: Cross-Classiﬁcation performance for each of the

topN (N=1···5) sentences using training sizes of 1K, 2K

and 3K, and the POSVEC feature. Each of the topN entries

shown is the result of averaging over 5 trials.

Size N P R F A

1 81.27 44.55 57.54 67.15

2 79.71 65.51 71.85 74.39

3 78.44 70.34 74.15 75.50

4 77.90 75.75 76.77 77.13

5 75.90 75.56 75.70 75.78

1 82.41 48.34 60.93 69.01

2 80.54 67.30 73.29 75.50

3 78.80 73.17 75.85 76.73

4 76.33 75.58 75.93 76.06

5 76.09 75.50 75.79 75.88

1 82.15 47.98 60.57 68.78

2 79.49 67.63 73.06 75.07

3 77.99 74.29 76.08 76.65

4 76.57 76.16 76.35 76.42

5 76.89 76.30 76.57 76.67

3.2 Weak Labeling Properties

3.2.1 Sentence Position

We evaluate how the position of the sentence within

the document affects the performance of a weak la-

beler when compared against a random selection of

sentences for a direct classiﬁer. Tables 1, and 2

summarize the Cross-Classiﬁcation performance us-

ing the features: POSVEC, and POS, when varying

the training size (Size) from 1,000 (1K) to 3,000 (3K)

sentences; using a ﬁxed sentence length of 5 to 199

characters. The bold font in each table shows the

maximum values obtained for each measure. The re-

sults clearly show that we obtain very good results

without manual labeling - precision reaching 82.41%

and recall 81.65%. Also, increasing the training size

improves the results, because the classiﬁer has more

examples from which it can learn. Also, in terms of

a performance trade-off, the Top1 and Top2 positions

prove not to be the best, but instead Top3 or Top4

show better trade-off, as the precision becomes equal

to the recall.

Random Selection. The results for the Cross-

Classiﬁer built from a random selection of sentences

as training set, using the POSVEC feature is presented

in Table 3. We notice that the random classiﬁer per-

forms signiﬁcantly poorer than the one in which the

weak labeling is used. This clearly suggests that the

sentence order is a useful, yet simple criteria for weak

labeling.

Table 2: Cross-Classiﬁcation performance for each of the

topN (N=1···5) sentences using training sizes of 1K, 2K

and 3K, and the POS feature. Each of the topN entries

shown is the result of averaging over 5 trials.

Size N P R F A

1 76.61 43.04 55.10 64.95

2 76.72 64.25 69.86 72.33

3 76.24 68.34 72.07 73.51

4 75.68 73.50 74.54 74.94

5 74.44 72.98 73.67 73.92

1 77.78 46.45 58.16 66.60

2 76.96 65.40 70.66 72.88

3 76.57 71.00 73.67 74.62

4 74.40 73.53 73.93 74.10

5 75.08 73.83 74.44 74.66

1 77.71 46.20 57.93 66.48

2 76.14 65.62 70.46 72.51

3 75.70 71.82 73.70 74.38

4 74.64 73.52 74.05 74.25

5 76.27 74.07 75.14 75.49

Table 3: Cross-Classiﬁcation performance using a random

selection of sentences and training sizes of 1K, 2K and 3K,

for the POSVEC feature.

Size P R F A

1K 41.37 40.11 40.63 41.61

2K 42.85 43.62 43.11 42.88

3K 33.37 32.66 32.94 33.40

Table 4: Direct Classiﬁcation performance for the POSVEC

feature.

P R F A

86.50 90.42 88.32 88.06

Direct Classiﬁcation. In the Direct Classiﬁcation,

we used the manually labeled data of the target cor-

pus as a training set, and the results are shown in Ta-

ble 4. When only the sentence position is taken into

account, we see that the overall performance of the

Direct Classiﬁer is signiﬁcantly better than the Cross-

Classiﬁer in terms of its recall and f1-measure. Yet,

in terms of precision, the Cross-Classiﬁer obtains as

much as 82.41% (see Table 1), in comparison with the

Direct Classiﬁer, which is 86.50%. Also, the recall is

much higher for the Direct Classiﬁer. This is to be

expected, given the errors inherent in weak labeling.

Finally, we compare the Cross-Classiﬁcation to

reported results for the same task (Zhang, 2008)

where the highest f1-measure value obtained is just

above 76%. When considering sentence position

alone, we obtain comparable f1-measures of 76.77%

(see Table 1).

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

574

3.2.2 Sentence Length

In order to determine if longer sentences impact the

performance of the weak labeled classiﬁer, we created

a second partition of data based on sentence lengths

in the interval of [200...1024]. Although the range

of this interval is quite large in comparison with the

interval [5...199], the actual number of sentences is

much smaller. Interval [5...199] contains 96,851 sen-

tences in the Top5 set, whereas interval [200...1024]

contains 19,368. Figure 2 shows the average over the

Top5 results for the POSVEC feature.

Figure 2: Cross-Classiﬁcation performance for sentence

lengths in the ranges of [5...199] and [200...1024] using the

POSVEC feature.

It can be seen, that using longer sentences results

in a classiﬁer with lower overall performance. We be-

lieve this to be the case because more noise is intro-

duced as longer sentences include clauses and paren-

thetical information, which are not directly related to

victim reporting.

3.2.3 Sentence Semantics

We evaluate the performance of the Cross-Classiﬁer

in the presence of selected NEs, which are relevant for

victim-reporting. These entity types include: location

and medical condition. In order to make as much dis-

tinction as possible between the positive and negative

NE-examples, only the TopN sentences containing

the named entities were chosen for training, whereas

the BottomN sentences not containing those entities

were used. In Figure 3, the Cross-Classiﬁcation re-

sults are obtained using sentence lengths in the inter-

val of [5...199] characters, averaging over the Top5

results for the POSVEC feature and using a training

size of 3K. We notice that ﬁltering weak labeled sen-

tences with respect to the presence of NEs yields a

signiﬁcantly higher precision when compared with no

NEs. Thus, NEs are useful for ﬁltering noise that is

present in the weak labeling examples.

Figure 3: Cross-Classiﬁcation performance with Named

Entities (NEs), and without Named Entities (no NEs). The

Top5 (N=5) sentences were used, and averaged over 5 trials,

for a training size of 3K, with the POSVEC feature.

3.3 Discussion

The experiments presented above allow us to see

that Sentence Position, Sentence Length and Sen-

tence Semantics do, in fact, impact the ability of a

weak labeling Cross-Classiﬁer to detect the relevant

victim-reporting sentences and several points should

be noted regarding each.

• Sentence Length. Although we have experi-

mented with two different ranges for the sentence

lengths, a closer examination should be made to

determine the minimum and maximum lengths

that optimize the precision and recall. Even so,

we already can draw the conclusion that smaller

sentences (range [5...199]) already bring the ben-

eﬁt of being able to distinguish between the infor-

mation bearing sentences, shown by our results.

• Sentence Position. It should be noted that the

Sentence Position is not independent of Sentence

Length. As mentioned, the shorter sentences that

appear in the Top1 often consist of titles or even

concise summaries of the article. Reﬁnements

which optimize the length of the sentence, should

also take this into account.

4 RELATED WORK

In the area of Epidemic Intelligence, approaches for

classifying disease-reporting sentences have been car-

ried out, where a number of features are used and dif-

ferent types of techniques, such as conditional ran-

dom ﬁelds and Naive Bayes networks (Zhang, 2008;

Conway et al., 2009). In all cases, the authors use

manually labeled data to build their models. In our

work, we seek to go beyond the human effort associa-

EXPLOITING THE LANGUAGE OF MODERATED SOURCES FOR CROSS-CLASSIFICATION OF USER

GENERATED CONTENT

575

ted with building a training set for blogs and so-

cial media, while striving for comparable results with

these state-of-the-art systems.

Transfer Learning. Transfer Learning allows the

domains, tasks and distributions for a classiﬁer’s

training and test data to be different (Pan and Yang,

2009). The sub-area of transfer learning most similar

to our work is transductive transfer learning, where

neither the source nor target data is labeled. In this

case, methods are sought to ﬁrst automatically label

the source data.

Automatic Labeling. Work has been done in sev-

eral areas (Tomasic et al., 2007; Fuxman et al., 2009)

to reduce the human labeling effort; where automatic

Labeling has been achieved with weak labeling. In

one such work, (Tomasic et al., 2007) wild labels

(obtained from observing users) provide the basis for

generating weak labels. Similar to our work, weak

labels are distinguished from gold labels, which are

generated by a human expert. The weakly-labeled

corpus is used to train machine-learning algorithms

that are capable of predicting the sequence and pa-

rameter values for the actions a user will take on a

new request. In other work automatic labeling is ac-

complished by ﬁrst deﬁning a set of criteria a potential

corpora must have in order to support the automatic

labeling process (Fuxman et al., 2009). To date, none

of the work based on automatic labelling or a trans-

fer learning approach, consider the task of Epidemic

Intelligence.

5 CONCLUSIONS AND FUTURE

WORK

In this paper we have demonstrated that with our

Cross-Classiﬁcation framework, it is possible to use

comparable text, such as outbreak reports as automat-

ically labeled data for training a classiﬁer that is ca-

pable of detecting the victim-reporting sentences in a

blog.

As with any automated labeling process, the ex-

amples are subject to noise and error. We investigated

how this noise can be reduced and evaluated the qual-

ity of such weak labeled sentences using three prop-

erties: Sentence Position, Sentence Length and Sen-

tence Semantics. With no effort in human labeling

and minimalistic feature engineering, we were able to

build a Cross-Classiﬁer, which achieved a precision

as high as 88%. The impact of this work is that the

noisy sentences in blogs, and possibly other types of

social media, can be appropriately ﬁltered to support

epidemic investigation.

Cross-Classiﬁcation has shown to be promising for

data in which the topic is rather focused. As a future

work, we will apply the approach to more diverse and

topic-drifting blog posts. Further, we seek to gener-

alize the results presented here, describing the con-

ditions under which corpora can be considered com-

parable. This would help in automatically selecting

the appropriate auxiliary and target corpora for Cross-

Classiﬁcation. In this work, we have assumed the

presence of a high quality data set that lends itself to

weak labeling. As further work, we plan to consider

cases in which a Cross-Classiﬁer can be built from

less volume of data, for example, by using a boot-

strapping approach.

REFERENCES

Conway, M., Collier, N., and Doan, S. (2009). Using hedges

to enhance a disease outbreak report text mining sys-

tem. In BioNLP ’09: Proceedings of the Workshop on

BioNLP, pages 142–143, Morristown, NJ, USA. As-

sociation for Computational Linguistics.

Fuxman, A., Kannan, A., Goldberg, A. B., Agrawal, R.,

Tsaparas, P., and Shafer, J. (2009). Improving classi-

ﬁcation accuracy using automatically extracted train-

ing data. In KDD ’09: Proceedings of the 15th ACM

SIGKDD international conference on Knowledge dis-

covery and data mining, pages 1145–1154, New York,

NY, USA. ACM.

Hartley, D., Nelson, N., Walters, R., Arthur, R., Yangarber,

R., Madoff, L., Linge, J., Mawudeku, A., Collier, N.,

Brownstein, J., Thinus, G., and Lightfoot, N. (2009).

The landscape of international event-based biosurveil-

lance. Emerging Health Threats.

Lam-Adesina, A. M. and Jones, G. J. F. (2001). Ap-

plying summarization techniques for term selection

in relevance feedback. In Proceedings of the 24th

annual international ACM SIGIR conference on Re-

search and development in information retrieval, SI-

GIR ’01, pages 1–9, New York, NY, USA. ACM.

Moens, M.-F. (2009). Information extraction from blogs. In

Jansen, B. J., Spink, A., and Taksa, I., editors, Hand-

book of Research on Web Log Analysis, pages 469–

487. IGI Global.

Moschitti, A. (2006). Making tree kernels practical for nat-

ural language learning. In Proceedings of the 11th

Conference of the European Chapter of the Associa-

tion for Computational Linguistics.

Pan, S. J. and Yang, Q. (2009). A survey on transfer learn-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 99.

Tomasic, A., Simmons, I., and Zimmerman, J. (2007).

Learning information intent via observation. In WWW

’07: Proceedings of the 16th international conference

on World Wide Web, pages 51–60, New York, NY,

USA. ACM.

Zhang, Y. (2008). Automatic Extraction of Outbreak Infor-

mation from News. PhD thesis, University of Illinois

at Chicago.

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

576