DETECTING CORRELATIONS BETWEEN HOT DAYS

IN NEWS FEEDS

Raghvendra Mall, Nahil Jain and Vikram Pudi

Centre of Data Engineering, IIIT Hyderabad, Hyderabad, India

Keywords:

Deriving hot topics, Derived hotness, Correlated hot events.

Abstract:

We use text mining mechanisms to analyze Hot days in news feeds. We build upon the earlier work used

to detect Hot topics and assume that we have already attained the Hot days. In this paper we identify the

most relevant documents of a topic on a Hot day. We construct a similarity based technique for identifying

and ranking these documents. Our aim is to automatically detect chains of hot correlated events over time.

We develop a scheme using similarity measures like cosine similarity and KL-divergence to ﬁnd correlation

between these Hot days. For the ‘U.S. Presidential Elections’, the presidential debates which spanned over a

week was one such event.

1 INTRODUCTION

News feeds are an important source of information

that provide relevant news through electronic media.

The purpose of this paper is to throw light on textual

analysis of Hot days or days when a particular topic

in the news receives high amount of coverage. Some

work has already been done to obtain Hot days for a

topic in news feeds (Shewart and Wasson, 1999).

Firstly, we identify the most relevant documents

of a topic on a Hot day. Throughout the paper we use

the topic of ‘U.S. Presidential Elections’ as an exam-

ple. We represent each document on a Hot day by

a vector space model comprising of noun and adjec-

tive phrases. We then compare the document’s vec-

tor with the vector corresponding to the combination

of all the documents related to that topic represent-

ing the background information. We use the standard

cosine similarity measure for comparison to estimate

the relevance of each document. The documents are

ranked based on their extent of similarity with the

background which indicates their relevance. Consider

for example, for a given day there are 1000 documents

and out of them around 100 documents are relevant

to the topic ‘Presidential Elections’. We rank these

documents in accordance to their relevance and select

only the top k percent. By selecting just the most rele-

vant documents we restrict to the major events related

to the topic on that Hot day.

Secondly, we make efforts to extend the concept

of Derived Hot days as developed in (Mall et al.,

2009). The purpose of our work is to observe chains

of Hot correlated events based on the concept of De-

rived Hot days as in (Mall et al., 2009). We use the

k percent of the most relevant documents to build a

vector space model for that Hot day. We compare

this with the vector space models of the k most rel-

evant documents of the corresponding Derived Hot

Days using measures of similarity like cosine simi-

larity and KL-divergence. This helps us to estimate

whether contextually there is correlation between the

Hot day and corresponding Derived Hot days. For ex-

ample, the Presidential debates which spanned over a

period of one week and included three debates were

discovered as correlated events using our methodol-

ogy. Our contributions are indicated as the following:

• Use NLP techniques to develop ranking methods

for the documents relevant to the Hot days.

• Compare similarity of ranked relevant documents

between Hot days and Derived Hot days.

• To identify chains of Hot correlated days.

A detailed information about detecting trends and

variations in News Feeds is provided in (Mall et al.,

2009). We use the AG’s Corpus of News Articles

(Gulli, 2005) using the ‘Presidential Elections’ as the

topic under consideration. The corpus contains more

than 1 million news articles from at least 2000 dif-

ferent news sources. There were about 10, 000 docu-

ments relevant to ‘U.S. Presidential Elections’ span-

ning over a period of 120 days. In the next section, we

introduce our proposed technique to identify and rank

375

Mall R., Jain N. and Pudi V..

DETECTING CORRELATIONS BETWEEN HOT DAYS IN NEWS FEEDS.

DOI: 10.5220/0003627203670370

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 367-370

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

the most important documents on the basis of their

relevance. We also discuss measures used to detect

similarity between Hot Days and Derived Hot days.

2 PROPOSED APPROACH

We ﬁrst explain the use of NLP methodologies to

detect the most important documents and rank these

documents on the basis of their relevance. Then we

compare the similarity of the ranked relevant docu-

ments between Hot days (set of) and between Hot

days and Derived Hot days to determine the corre-

lation between them. This helps us to estimate chains

of Hot correlated days.

2.1 NLP Techniques to Identify and

Rank Important Documents

Once we know the days when the topic of interest is

Hot, we process all the documents that day to identify

the most important ones. A standard parser is used for

the purpose of tagging (TnT parser). The corpus used

as a pre-model for providing tags is the default model

available with the TnT parser. For each Hot day we

select the relevant documents (ones whose score is

greater than 0.5). We parse each document so that

there is only one token in each line and make it suit-

able for the Tnt parser. The tokens of each document

is kept in a single ﬁle. This ﬁle is used as our test

ﬁle which has to be provided with tags and we use the

trigrams model for providing tags to each token.

Once we have provided the tags we select noun

noun (NN) phrases and adjective noun (JN) phrases

for each document. These are the most important

phrases or concepts which can replicate the main

content of the documents efﬁciently. The same NN

phrase or JN phrase can occur multiple time in a doc-

ument for a given Hot day. So we maintain the fre-

quency of occurrence of each phrase in a document.

We can now estimate the total occurrence of each such

phrase in all the documents for that day.

We have a bag of phrases model for each relevant

document and for all relevant documents (background

information) on a Hot day. Now, we deﬁne a mech-

anism to rank the relevant documents. We convert

the bag of phrases to a vector space model assigning

0 for those phrases not present in the document and

assigning frequency of the phrase for those present in

the document. We then compare the cosine similarity

between the vector for a given document and the

vector corresponding to the relevant background

information. Mathematically, its represented as:-

= Vector o f phrases present in document D

N = Vector of phrases present in all documents

Score(D

) = CosineSimilarity(D

∗ N) (1)

The greater the score for a document D

, the more

is the importance or relevance of that document for

that given Hot day. So, greater the similarity with

the background information of that day and higher the

rank.

2.2 Similarity between Hot Days and

Derived Hot Days

We now try to estimate whether Hot days are really

actually correlated or not, based on context. For this

purpose we use the variable parameter k which is used

to limit the selection of the ranked relevant docu-

ments. For example, if we set k = 10 then only the top

ranking 10% of the relevant documents are used. We

then construct a vector from the top ranking k percent

of the documents in a manner similar to that described

in previous subsection. Then we take into account Hot

Days in pairs and as there are 31 Hot Days we have

30 such pairs of days. We then calculate the cosine

similarity between each such pair of days and greater

the similarity more is the actual correlation between

those Hot days. This parameter k plays an important

role in the quality of correlation.

A similar process is followedto determine correla-

tion between Hot days and Derived Hot days. Here it

is important to mention that the existence of a Derived

Hot day corresponding to a given Hot day depends

on the statistical criteria. So once we have identiﬁed

the Hot Day and Derived Hot day pair, we again con-

struct the vector using the top ranking k percent of the

relevant documents for those days and calculate the

cosine similarity or the KL divergence.

3 EXPERIMENTAL RESULTS

We performed our experiment on AG’s corpus of

NewsArticles using the ‘Presidential Elections’ as the

topic of our focus. The presidential elections were ac-

tually held during the period when the data was col-

lected and was thus chosen as the topic under consid-

eration. To identify chains of hot correlated days, we

used cosine similarity and KL divergence measures.

3.1 Cosine Similarity Results

From Figure 1, we see the square brackets represent

the Hot Days for which value is 1. The other sym-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

376

Figure 1: Cosine Similarity Measure.

bols are used to represent percentage of top ranked

relevant documents selected for that Hot day. We can

see that their values represent the cosine similarity be-

tween the previous Hot day and that Hot day. We

observe that cosine similarity value increases as we

increase the value of k from 5 to 20 percent. This

follows logically as more the number of ranked doc-

uments for a given Hot Day, better is representation

of the concepts that day. An important remark would

be that for k >= 10, we have good cosine similar-

ity values between consecutive Hot days and we can

predict whether the Hot days are correlated or not ef-

fectively. For example, there is very high correlation

(nearly 0.8) between 75

and 76

day for k = 20%.

When we observe the cosine similarity between

Hot days and Derived Hot days, we see that most of

the Derived Hot days which have been estimated by

the criteria (UCL

) deﬁned in (Mall et al., 2009) are

actually correlated to their previous Hot days. Any

cosine similarity value greater than 0.25 can be con-

sidered as good correlation between the days. For ex-

ample, there is efﬁcient correlation between 48

and

Hot days or the days of ﬁrst Presidential debate

and 53

and 54

Derived Hot days or the days of

second Presidential debates, even when the threshold

k is set to as low as 5 percent. Similarly, there is high

correlation between the 91

day which is the Hot day

where the main events are after election controversies

of tampering with votes, passing new bills etc. and

and 97

day which are Derived Hot days where

the electoral controversiescontinue to evolveand new

bills are passed. Thus we can identify the quality of

similarity between the Hot days which helps to deter-

mine chain of Hot correlated events over time.

3.2 KL Divergence Results

The KL divergence mathematically represents the

average logarithmic difference between the proba-

bilities P and Q. Here P represents the probability

distribution for the phrases of a previous Hot day

and Q represents the probability distribution of the

phrases of a given Hot day when comparing the diver-

gence between two Hot days. While P represents the

probability distribution of the phrases for a Hot day,

Q represents the same for its Derived Hot Day when

estimating correlation between them. The probability

distribution only comprises of those phrases which

are common to both the Hot day and the Derived Hot

day. The probability of each phrase for a given day is

calculated as the ratio of the occurrence of a phrase in

the top k percent of ranked relevant documents to the

total frequency of all the phrases in the top k percent

of ranked relevant documents. We also handle the

special case of zero conditional probability i.e. the

case when the two distributions P and Q have no

phrase in common. In other words, when there is no

matching terms between those days we associate a

divergence of 1.0 (or no correlation) for those pair of

days. Mathematically, it is represented as:-

(P||Q) =

∑

P(i)

log(P(i))

log(Q(i))

(2)

From Figure 2, we observe that KL divergence results

are relatively better than cosine similarity results. The

Hot days are represented by square boxes, Derived

Hot days by ‘+’ symbols (attaining a value of 0.7)

and the other symbols represent the similarity mea-

sures for different values of k. We see that in case

of consecutive Hot days, k = 15% gives the lowest di-

vergence values when days are actually correlated and

gives divergence as 1.0 when they are not correlated.

The reason is that as we allow more percentage of the

ranked relevant documents to be selected i.e. k > 15,

some phrases with very low probability may be simi-

lar between Hot days but most of them are not. So as a

result they have a higher divergence value. Any value

lesser than 0.2 can be considered as good correlation,

since smaller the divergence more the similarity and

greater the correlation.

In case of Hot days and Derived Hot days, the

method works extremely well and only identiﬁes

DETECTING CORRELATIONS BETWEEN HOT DAYS IN NEWS FEEDS

377

Figure 2: KL Divergence Measure.

those days as Derived Hot days which have actually

very low divergence from their corresponding Hot

day. So a lot of statistically determined Derived Hot

days are actually not correlated to Hot days. How-

ever, our example of the 48

and 50

Hot day and

and 54

are among those which satisfy the KL

divergence criteria and are indeed highly correlated.

3.3 Evaluation of Correlations

Figure 1 and Figure 2 depict the correlations between

Hot Days and Derived Hot Days. The cosine similar-

ity and KL divergence models are estimated for dif-

ferent values of k. This k represents the percentage of

the top ranked most relevant documents on that day

which help to build up the models. If we keep high

value for k then we can have over-ﬁtting which indi-

cates terms which are not so relevant may crop up in

the models. On the other hand, if we keep very low

value for k then the we can have under ﬁtting or many

relevant terms might be missing from the models. So,

we select k = 15% for the purpose of our analysis.

Cosine similarity values less than 0.25 denote less

signiﬁcant correlations. We notice that there are some

pair of days (e.g. 62

day and 68

day etc.) which

have 0 similarity value or no phrase in common which

denotes that contextually these days have no correla-

tion. So these days have been incorrectly labelled as

Derived Hot Days. However weak correlations, be-

tween the topic, exist for some pair of days (e.g. 14

day and 19

day - 0.19, etc.). The KL divergence

value of 1.0 means that there is maximum divergence

or no correlation between those pair of days. So these

days are again incorrectly labelled as Derived Hot

days. KL divergence method shows extremely low

divergence value for 91

Hot day and corresponding

and 97

Derived Hot days. This indicates that

there is very high correlation between that Hot day

and its Derived Hot days which are following up the

events. According to cosine similarity and KL diver-

gence measures, there are 8 pairs of strongly corre-

lated Hot days and Derived Hot days. However, co-

sine similarity indicates weak correlation between 4

such pair of days and KL divergence has feeble corre-

lation for 1 such pair of days. Thus the KL divergence

metric is more effective in identifying strongly corre-

lated days than cosine similarity measure.

4 CONCLUSIONS

In this paper, we built upon the concept of Hot days

identiﬁed by means ofUCL

metric. We obtained the

most relevant documents for a particular topic on a

given Hot day. We used NN phrases and JN phrases

to identify concepts from relevant documents for a

Hot day. Using these phrases and cosine similarity we

developed a ranking methodology to order the docu-

ments on basis of their relevance. We then estimated

whether there is actual correlation between Hot days

and between Hot days and Derived Hot days using

these ranked relevant documents. We used measures

of similarity like cosine similarity and KL divergence

and observed that KL divergence provided more qual-

ity results for the estimation of correlation. This helps

to estimate chain of Hot correlated events.

REFERENCES

Gulli, A. (2005). Ag’s corpus of news articles.

http://www.di.unipi.it/∼gulli/AG corpus of news

articles.html.

Mall, R., Bagdia, N., and Pudi, V. (2009). Variations and

trends in hot topics in news feeds. In Fifteenth Inter-

national Conference on Management of Data.

Shewart, M. and Wasson, M. (1999). Monitoring a news-

feed for hot topics. In Fifth International Conference

on Knowledge Discovery in Data Mining.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

378