Topic and Subject Detection in News Streams for Multi-document

Summarization

Fumiyo Fukumoto

, Yoshimi Suzuki

and Atsuhiro Takasu

Interdisciplinary Graduate School of Medicine and Engineering, Univ. of Yamanashi, Kofu, Japan

National Institute of Informatics, Chiyoda-ku, Japan

Keywords:

Topic, Subject, Multi-document Summarization, Moving Average Convergence Divergence.

Abstract:

This paper focuses on continuous news streams and presents a method for detecting salient, key sentences

from stories that discuss the same topic. Our hypothesis about key sentences in multiple stories is that they

include words related to the target topic, and the subject of a story. In addition to the TF-IDF term weighting

method, we used the result of assigning domain-speciﬁc senses to each word in the story to identify a subject.

A topic, on the other hand, is identiﬁed by using a model of ”topic dynamics”. We deﬁned a burst as a time

interval of maximal length over which the rate of change is positive acceleration. We adapted stock market

trend analysis technique, i.e., Moving Average Convergence Divergence (MACD). It shows the relationship

between two moving averages of prices, and is popular indicator of trends in dynamic marketplaces. We

utilized it to measure topic dynamics. The method was tested on the TDT corpora, and the results showed the

effectiveness of the method.

1 INTRODUCTION

With the exponential growth of information on the In-

ternet, it is becoming increasingly difﬁcult for a user

to read and understand all the materials that is poten-

tially of interest. Multi-document summarization is

an issue to attack the problem. It differs from sin-

gle document summarization in that it is important

to identify differences and similarities across docu-

ments. This can be interpreted as a question of how

to identify a topic and a subject in series of stories.

Here, a topic is the same as TDT project: something

that occurs at a speciﬁc place and time associated with

some speciﬁc actions (Allan, 2003). A subject, on the

other hand, refers to a theme of the story itself, i.e.,

something a writer wished to express. Much of the

work on summarization has applied statistical tech-

niques based on word distribution to the target doc-

ument (Lin and Hovy, 2002). Other approaches ex-

plore to use machine learning or graph-based ranking

method (Marcu and Echihabi, 2002; Wan and Yang,

2008). Wan et al. proposed two models, the Cluster-

based conditional Markov Random Walk model and

the Cluster-based HITS model, both use the theme

clusters in the document set (Wan and Yang, 2008).

However, most of these approaches does not deal with

the identiﬁcation of a topic and a subject in series of

stories.

This paper focuses on extractive summarization

and presents a method for detecting key sentences

from continuous news streams. We assume that a

key sentence in multiple documents includes words

related to the target topic, and the subject of each

story. Moreover, we assume that the sense of the sub-

ject word is related to the domain where domain is the

traditional text categorization sense. For example, the

word ”court” in the target topic ”Pinochet trial” is re-

lated to the subject, ”Pinochet appealed his arrest and

a London court agreed,” and it has a sense of judica-

ture in the legal/criminal domain. Here, ”court” has at

least two senses in the WordNet; judicature and ten-

nis court. If we can ﬁnd that the ”count” from a story

”Pinochet trial” has a domain-speciﬁc sense, i.e., ju-

dicature sense, we can identify the ”court” to a sub-

ject word. In addition to the traditional term weight-

ing method TF-IDF, we used the result of assign-

ing domain-speciﬁc senses (ADSS) to identify sub-

ject words. On the other hand, a topic is identiﬁed

by using a model of ”topic dynamics”. We deﬁned a

burst as a time interval of maximal length over which

the rate of change is positive acceleration. We used

Moving Average Convergence Divergence (MACD)

to identify topic. MACD is a technique to analyze

stock market trend. It shows the relationship between

166

Fukumoto F., Suzuki Y. and Takasu A..

Topic and Subject Detection in News Streams for Multi-document Summarization.

DOI: 10.5220/0004109901660171

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2012), pages 166-171

ISBN: 978-989-8565-30-3

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

two moving averages of prices modeling bursts as in-

tervals of topic dynamics, i.e., positive acceleration.

2 SYSTEM DESIGN

2.1 Assignment of Domain-speciﬁc

Senses for Subject Detection

We used the result of ADSS to identify subject words.

We used the TDT corpus and WordNet 3.0 thesaurus.

The TDT documents are classiﬁed into eleven top-

ics (domains), such as ”natural disasters” and ”elec-

tions”. We used these topics except for the topic

MISC. To assign topics/domains of the TDT corpus to

each sense of the word in WordNet, we ﬁrst selected

each sense of a word to the corresponding domain by

using a text classiﬁcation technique. For each sense s

of a word w, we replace w in the training stories as-

signing to the topic t with its gloss text in WordNet.

(hereafter, referred to as word replacement). If the

classiﬁcation accuracy of the topic t is equal or higher

than that without word replacement, the sense s is re-

garded to be a domain-speciﬁc sense. However, the

sense selection is not enough for ADSS. Because the

number of words consisting gloss in WordNet is not

so large. As a result, the classiﬁcation accuracy with

word replacement was equal to that without word re-

placement

. Then we scored senses by computing the

rank scores.

1. Candidate Extraction

The ﬁrst step to ﬁnd domain-speciﬁc senses is to ex-

tract candidates. We divided TDT documents into

two: training data to learn SVM model and test data

to classify documents. For each topic, we collected a

set of wordsW with high TF-IDF value from the TDT

corpus. Let TS be a topic set, and S be a set of senses

that the word w ∈W has. The candidates are obtained

as follows:

1. For each sense s ∈ S, and for each t ∈ TS, we

replace w in the training documents assigning to

the topic t with its gloss text in the WordNet.

2. All the documents of training and test data are

tagged by a part-of-speech tagger, stop words are

removed, and represented as term vectors with

frequency.

3. The SVM is applied to the two types of the train-

ing documents, i.e., with and without word re-

In the experiments, the classiﬁcation accuracy of more

than 50% of words has not changed.

placement, and classiﬁers for each topic are gen-

erated.

4. SVM classiﬁers are applied to the test data. If the

classiﬁcation accuracy of the topic t is higher than

that without word replacement, the sense s of the

word w is judged to be a candidate sense in the

topic t.

The procedure is applied to all w ∈W.

2. Scoring Senses by Link Analysis

The next procedure for ADSS is to score each candi-

date for each topic. We used the MRW model. Given

a set of candidates C

in the topic t, G

= (S, E) is

a graph reﬂecting the relationships between senses in

the candidate set. S is a set ofvertices, and each vertex

in S is a gloss text assigned from the WordNet. E is

a set of edges, which is a subset of S × S. Each edge

in E is associated with an afﬁnity weight f(i → j)

between senses s

and s

(i 6= j). The weight is com-

puted using the standard cosine measure between the

two senses. Two vertices are connected if their afﬁn-

ity weight is larger than 0 and we let f(i → i)= 0 to

avoid self transition. The transition probability from

to s

is then deﬁned as follows:

p(i → j) =











f(i→j)

|S|

∑

k=1

f(i→k)

, if Σ f 6= 0

0 , otherwise.

(1)

We used the row-normalized matrix U

)

|S|×|S|

to describe G with each entry correspond-

ing to the transition probability,whereU

= p(i → j).

To make U a stochastic matrix, the rows with all zero

elements are replaced by a smoothing vector with all

elements set to

|S|

. The matrix form of the saliency

score Score(s

) can be formulated in a recursive form

as in the MRW model.

λ = µU

λ+

(1−µ)

| S |

~e. (2)

where

λ = [Score(s

)]

|S|×1

is the vector of saliency

scores for the senses. ~e is a column vector with all

elements equal to 1. µ is the damping factor, which

we set to 0.85. The ﬁnal transition matrix is given by

the Eq. (3), and each score of the sense in a speciﬁc

domain is obtained by the principal eigenvector of the

matrix M.

M = µU

(1−µ)

| S |

~e~e

. (3)

The procedure is applied to all of the topics. We se-

lected a certain number of words (senses) according

to rank score as a subject word in a document.

TopicandSubjectDetectioninNewsStreamsforMulti-documentSummarization

167

2.2 Topic Detection

Topic Bursts

He et al. proposed a method to ﬁnd bursts by using

Moving Average Convergence/Divergence (MACD)

histogram which was used in technical stock mar-

ket analysis to detect bursts (He and Parker, 2010).

MACD histogram refers to a difference between the

MACD and its moving average. MACD is deﬁned by

Eq. (4).

hist(n

) = MACD(n

) −

EMA(n

)[MACD(n

)](4)

EMA(n

) refers to n

-day Exponential Moving Aver-

age (EMA). For a variable x = x(t) which has a corre-

sponding discrete time series x = {x

|t = 0,1,···} , the

n-day EMA is deﬁned by Eq. (5).

EMA(n)[x]

= αx

+ (1−α)EMA(n−1)[x]

t−1

∑

k=0

α(1−α)

t−k

(5)

α refers to a smoothing factor and it is often taken

to be

(n+1)

. MACD(n

) in Eq. (4) indicates the

difference of n

-day and n

-day exponential moving

averages, i.e., EMA(n

) − EMA(n

). We applied the

model to detect topic words.

Topic Detection

The procedure for topic detection is illustrated in Fig-

ure 1. Let A be a set of documents to be summarized.

A set of topic words TS are detected as follows:

1. Create MACD histogram where X-axis refers to a

period of time of length T, and Y-axis denotes the

frequency of documents concerning to the target

topic. Hereafter, referred to as correct histogram,

as shown in Figure 1.

2. Each term in the TDT corpus is weighted by us-

ing TF-IDF scheme. For each target topic, terms

within the documents assigning to the target topic

are sorted in the descending order to make a term

list.

3. Given the number of k, we extracted the topmost

k terms from the term list. For each term, we ap-

plied the following procedures.

(a) Create MACD histogram where X-axis refers

to a period of time of length T, and Y-axis de-

notes bursts. Hereafter, referred to as bursts his-

togram, as shown in Figure 1.

T T

Correct histogram Bursts histogram

Histogram similarity

bursts

Figure 1: Similarity between correct and bursts histograms.

(b) As illustrated in the bottom of Figure 1,

compute similarity between correct and bursts

histograms by using Bhattacharyya distance,

ρ(p, q) =

∑

i=1

√

where p and q are a nor-

malized distance of correct histogram and burst

histogram, respectively

. p

refers to the fre-

quency of documentsthat arrive in time i, and q

indicates bursts of topic in time i. If the value of

ρ(p, q) is larger than a certain threshold value,

the term t is regarded as a topic word.

In the procedure (b), we assume that burst histogram

of the term t is close to the correct histogram if t is

a topic term. Because burst histogram obtained by

procedure (a) refers to a burst of a topic t concerning

to a set of documents A. Thus, we can assume that it is

similar to the histogram obtained by using a frequency

of A concerning to the target topic.

2.3 Sentence Extraction

Each sentence concerning to the target topic is rep-

resented using a vector of frequency weighted words

that can be subject or topic words. Similar to the pro-

cedure of ADSS, we used the MRW model to com-

pute the rank scores for the sentences, i.e., given a

document set D, a graph G consists of a set of ver-

tices S and each vertex s

in S is a sentence in the doc-

ument set. After the saliency scores of sentences have

been obtained, choose a certain number of sentences

according to the rank score into the summary.

We tested Bhattacharyya distance, histogram intersec-

tion and KL-distance to obtain similarities. We reported

only the result obtained by Bhattacharyya distance as it was

the best results among them.

KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

168

Table 1: Topics assigned to categories in the TDT.

Category Topics

Elections U.S. Mid-Term Elections

Scandals/hearings Olympic Bribery Scandal

Legal/criminal Pinochet Trial

Natural disasters Hurricane Mitch

Accidents Nigerian Gas Line ﬁre

Violence or war Car Bomb in Jerusalem

Science and discovery Leonid Meteor Shower

Finances IMF Bailout of Brazil

New laws Anti-Doping Proposals

Sports Australian Yacht Race

Table 2: The number of candidate senses.

Category Doc Total Cand.(%)

Elections 518 3,837 1,970(51.3)

Scandals/hearings 108 3,566 1,775(49.7)

Legal/criminal 860 3,396 1,805(53.1)

Natural disasters 286 3,496 1,851(52.9)

Accidents 82 2,828 916(32.3)

Violence or war 234 3,495 1,741(49.8)

Science and discovery 186 3,955 1,969(49.7)

Finances 540 3,428 1,945(56.7)

New laws 7 3,304 1,646(49.8)

Sports news 326 3,428 1,957(57.0)

3 EXPERIMENTS

We used the TDT3 corpus which comprises a set of

eight English news sources collected from October to

December 1998. It consists of 34,600 stories. A set

of 60 topics are deﬁned for evaluation in 1999, and

another 60 topics for evaluation in 2000. Of these

topics, we used 78 topics, each of which is classiﬁed

into 10 categories. Table 1 illustrates categories and

some examples of topics assigned to these categories.

3.1 Assignment of Domain-speciﬁc

Senses

We divided TDT documents into two: training and

test data in text classiﬁcation. The size of training data

for each category is two-third of documents, and the

remaining is test data. All documents were tagged by

Tree Tagger (Schmid, 1995). For each category, we

collected the topmost 500 noun words with high TF-

IDF weight from the TDT3 corpus. We used Word-

Net 3.0 to assign senses. Table 2 shows the number

of training documents, the total number of senses, and

the number of candidates senses (Cand.) that the clas-

siﬁcation accuracy of each category was higher than

the result without word replacement. We used these

senses as an input of the MRW model.

Table 3: The result against SFC resource.

Cat ADSS SFC SFC & TDT Recall

Finances 390 125 81 0.648

New law 358 1,628 193 0.437

Science 389 671 176 0.699

Sports 395 1,947 8 1.000

There are no existing sense-tagged data for these

10 categories that could be used for evaluation.

Therefore, we selected a limited number of words

and evaluated these words qualitatively. To this end,

we used the Subject Field Codes (SFC) resource

(Magnini and Cavaglia, 2000) annotating WordNet

2.0 synsets with domain labels. The SFC consists of

115,424 words assigning 168 domain labels with hi-

erarchy. It contains ”ﬁnances”, ”laws”, ”science” and

”sports” labels. We used these four labels and its chil-

dren labels in a hierarchy, and compared the results

with SFC resource. The results are shown in Table

3. ”ADSS” shows the number of senses assigned by

our approach. ”SFC” refers to the number of senses

appearing in the SFC resource. ”SFC & TDT” de-

notes the number of words (senses) appearing in both

SFC and the TDT corpus. We note that the corpus

we used was TDT corpus, while SFC assigns domain

labels to the words appearing in the WordNet. There-

fore, we used recall as the evaluation measure where

it refers to the number of senses matched in our ap-

proach and SFC divided by the total number of senses

appearing in both SFC and TDT. ”Recall” in Table

3 refers to the best performance among the varying

number of senses according to the rank scores. As we

can see from Table 3 that word replacement improved

text classiﬁcation performance as the former was 0.06

F-score, while that of the latter was only 0.01. One

reason is the length of the gloss text in the WordNet;

the average length of gloss text assigned to ”law” was

5.75, while that for ”sports” was 8.96. The method of

assigning senses depends on the size of gloss text in

the WordNet. Efﬁcacy can be improved if we can as-

sign example sentences to WordNet based on corpus

statistics. This is a rich space for further exploration.

It is interesting to note that some senses of words

that were obtained correctly by our approach did not

appear in the SFC resource because of the difference

in WordNet version, i.e., we used WordNet 3.0 and

the TDT corpus for ADSS, while SFC is based on

WordNet 2.0. Table 4 illustrates some examples ob-

tained by our approach but that did not appear in the

SFC. These observationssupport the usefulness of our

automated method.

For evaluating subject detection, we randomly se-

lected one topic for each category, and manually

checked whether the words assigning domain-speciﬁc

senses are the subject or not. Table 5 shows the re-

TopicandSubjectDetectioninNewsStreamsforMulti-documentSummarization

169

Table 4: Some examples obtained by ”ADSS”.

Cat Example of words and their senses

New law ﬁre: intense adverse criticism

break: an escape from jail

Sports era: (baseball) a measure of a

pitcher’s effectiveness

Table 5: Performance of subject detection.

Cat ADSS Subject F F(TF-IDF)

Finances 86 88 0.678 0.302

New law 63 79 0.577 0.208

Science 72 102 0.609 0.319

Sports 92 132 0.705 0.412

sult of subject detection. ”ADSS” refers to the num-

ber of subject words identiﬁed by our approach and

also appeared in the documents assigning to the target

topic. ”Subject” denotes the number of correct sub-

ject words identiﬁed by two humans

, and ”F” shows

F-score. We compared the result obtained by our ap-

proach with simple term weighting method, TF-IDF.

”F(TF-IDF)” shows F-score obtained by using TF-

IDF weighting method. As can be seen clearly from

Table 5, the results obtained by our approach were

much better than those of a simple term weighting,

TF-IDF method.

3.2 Sentence Extraction

Finally, we report the results of sentence extraction.

Because of manual creation of the evaluation data, we

used 32 out of 78 topics which have less than 646

sentences in documents in the experiment. The eval-

uation is made by two humans. We set the extraction

ratio to 30%, and compared our method with the fol-

lowing three approaches to examine howthe results of

subject and topic detection affect sentence extraction.

The average number of sentences with the extraction

ratio of 30% was 42.5 sentences.

1. Apply MRW model to noun words (Noun)

The method applied the MRW model to the sen-

tences consisting of noun words.

2. Apply MRW model to topic words (Topic)

The method applied the MRW model to the results

of MACD method.

3. Apply MRW model to subject words (Subject)

In contrast to “Apply MRW model to subject

words”, the method applied the MRW model to

the results of ADSS.

For each of the four methods including our approach,

we divided 32 topics into two: 10 topics to train the

The classiﬁcation is determined to be correct if two hu-

man judges agrees.

Table 6: ROUGE-1 score for 22 topics.

Method Max Min Ave

Noun 0.428 0.180 0.258

Subject 0.485 0.202 0.269

Topic 0.529 0.362 0.421

Subject & Topic 0.750 0.318 0.485

optimal number of subject and topic words, and the

remaining 22 topics to test sentence extraction perfor-

mance by using the estimated number of subject and

topic words. The best performance obtained by ADSS

(Subj) using training data was 0.332 ROUGE-1 score

and the number of subject words was 10 per topic.

Similarly, the best performanceby our approach (Subj

& Topic) was 0.442 ROUGE-1 score and the num-

ber of subject and topic words were 10 and 5, respec-

tively. Therefore, we used the topmost 10 subject and

5 topic words, and evaluated each method by using

the test data. The results are shown in Table 6. In

Table 6, ”Max” and ”Min” denote the maximum and

minimum ROUGE-1 score, respectively. ”Ave” refers

to the average score obtained by using 22 topics.

As shown in Table 6, the use of subject words only

did not contribute sentence extraction performance,

as the average ROUGE-1 score was 0.269 and it was

not signiﬁcant improvement over the baseline, i.e., the

method applied MRW model to the sentences consist-

ing noun words. The use of topic words only con-

tributes sentence extraction performance compared to

the use of subject words only. Moreover, the results

obtained by using both subject and topic words, i.e.,

our approach attained at 0.485 averaged ROUGE-1

score, and we found that it always outperforms, re-

gardless of how many number of documents (sen-

tences) were used. These results clearly support the

usefulness of our method.

Figure 2 illustrates the topmost three sentences ex-

tracted by each method. The topic is ”Ukraine Min-

ing Accidents”. ”” indicates that the system and

human judges agree. Words marked with ”{}” and

the underlined words refer to a subject word and a

topic word identiﬁed by the system, respectively. Fig-

ure 2 shows that terms such as ”Ukraine”, ”mines”,

and ”accident” consisting topic name are correctly ex-

tracted by ”Topic” and ” Subject & Topic” methods.

Similarly, ”Donetsk” and ”region” appearing a partic-

ular document is correctly extracted as a subject word

by ”Subject” method. Our method correctly extracted

salient sentences, while the methods used subject or

topic only were not perfect extraction. Moreover, the

results obtained by ”Subject” shows that sentences

1 and 2 are similar contents, i.e. the extracted sen-

tences include redundancy information, while those

obtained by ”Subject & Topic” shows that sentences 1

and 2 are subsumption relationship, i.e., sentence 2 in-

KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

170

Method: Noun

1 Pieces of coal fell in a deep mine shaft in eastern Ukraine on Monday, killing a miner, ofﬁcials said.

2 Ukraine’s cash-strapped mines lack the funds needed to improve safety conditions and modernize equipment, which

has led to a steady increase in work-related fatalities in recent years.

3 Two mine workers were killed in separate accidents in eastern Ukraine, pushing this year’s number of mine fatalities

to more than 300, ofﬁcials said Wednesday.

Method: Subject

1 Two {miners} were killed in {Ukraine’s} crumbling {coal} {mines} over the weekend, bringing to at least 305 the

number of {coal} industry {workers} who have died on the job so far this {year}, ofﬁcials said Monday.

2 Two {mine} {workers} were killed in separate accidents in eastern {Ukraine}, pushing this {year’s} number of

{mine} fatalities to more than 300, ofﬁcials said Wednesday.

3 The worst single accident this {year} was in April, when a methane gas blast killed 63 {workers} at a {mine} in the

eastern {Donetsk} {region}.

Method: Topic

1 The accident occurred at the depth of 540 meters (1,782 feet) at the Olkhovatska mine in the coal-rich Donetsk region,

the Emergency Situations Ministry reported.

2 The death toll in coal mine accidents in Ukraine has been rising for years as the cash-strapped government has been

unable to modernize deteriorating mines and improve safety conditions.

3 Work and safety conditions at the former Soviet republic’s coal mines have deteriorated in recent years as the cash-

strapped government has failed to provide sufﬁcient funds to support the largely unreformed industry.

Method: Subject & Topic

1 Two {mine} {workers} were killed in separate accidents in eastern {Ukraine}, pushing this {year’s} number of

{mine} fatalities to more than 300, ofﬁcials said Wednesday.

2 A total of 301 people have died in {mine} accidents in {Ukraine} so far in 1998, about 100 more than over the same

period last {year}, according to government statistics.

3 The number of {miners} killed on the job has been increasing steadily in recent {years} because {mines} lack the

funds needed to modernize equipment and improve safety conditions.

Figure 2: Topmost three sentences extracted by each method (”Ukraine Mining Accidents”).

cludes additional information, ”about 100 more than

over the same period last year”. These observations

again clearly support the usefulness of our method.

4 CONCLUSIONS

We have developed an approach to multi-document

summarization from continuous news streams. The

results showed the effectiveness of the method. Fu-

ture work will include: (i) comparison to other

topic models such as hierarchical Pachinko Alloca-

tion Model (Mimno et al., 2007) and Two-Tiered

Topic Model (Celikylmaz and Hakkani-Tur, 2011),

(ii) comparison to other term weighting methods such

as Information Gain and χ

statistics, and (ii) apply-

ing the method to other data such as DUC2004 and

DUC2007 for quantitative evaluation.

REFERENCES

Allan, J., editor (2003). Topic Detection and Tracking.

Kluwer Academic Publishers.

Celikylmaz, A. and Hakkani-Tur, D. (2011). Discovery of

Topically Coherent Sentences for Extractive Summa-

rization. In Proc. of the 49th ACL, pages 491–499.

He, D. and Parker, D. S. (2010). Topic Dynamics: An Alter-

native Model of Bursts in Streams of Topics. In Proc.

of the 16th ACM SIGKDD, pages 443–452.

Lin, C.-Y. and Hovy, E. H. (2002). From Single to Multi-

Document Summarization: A Prototype System and

its Evaluation. In Proc. of the 40th ACL, pages 457–

464.

Magnini, B. and Cavaglia, G. (2000). Integrating Subject

Field Codes into WordNet. In In Proc. of the 2nd

LREC.

Marcu, D. and Echihabi, A. (2002). An Unsupervised Ap-

proach to Recognizing Discourse Relations. In In

Proc. of the 40th ACL, pages 368–375.

Mimno, D., Li, W., and McCallum, A. (2007). Mixtures of

Hierarchical Topics with Pachinko Allocation. In In

Proc. of the 24th ICML, pages 633–640.

Schmid, H. (1995). Improvements in Part-of-Speech Tag-

ging with an Application to German. In Proc. of the

EACL.

Wan, X. and Yang, J. (2008). Multi-Document Summariza-

tion using Cluster-based Link Analysis. In Proc. of

the 31st ACM SIGIR, pages 299–306.

TopicandSubjectDetectioninNewsStreamsforMulti-documentSummarization

171