NEWS RECOMMENDING BASED ON TEXT SIMILARITY

AND USER BEHAVIOUR

Dušan Zeleník and Mária Bieliková

Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies

Slovak University of Technology, Ilkovičova 3, 842 16 Bratislava, Slovakia

Keywords: Recommendation, Personalization, Behaviour, Monitoring, Similarity, News web portal, News, Readers.

Abstract: In this paper we describe a method for recommending news on a news portal based on our novel

representation by a similarity tree. Our method for recommending articles is based on their content. The

recommendation employs a hierarchical incremental clustering which is used to discover additional

information for effective recommending. The important and novel part of our method is an approach to

discovering the interests of individual readers using tree structure created according to similarity of articles.

We concentrate on enabling the recommendations in any time, i.e. we discover user’s interests real-time.

Our method discovers specific interests of the reader using information gained from monitoring his

activities in the news portal. We describe the mechanisms for recommending up-to-date and relevant

articles. It is based on known solutions, but incorporates unique representation of user interests by binary

tree. Moreover, our aim was to provide recommendations in real-time. Recommendations are thus generated

depending on the actual reader’s interest. We also present an evaluation of recommendations in the

experiment where we use accounts of real readers and their history of reading.

1 INTRODUCTION

Making personalized recommendations is nowadays

becoming increasingly popular topic. There are

several reasons for this. The main reason is the size

of information space containing sources to be

recommended and the inability of a human to

browse this space in full in order to find relevant

information. Our main concern is to facilitate the

exploring activity of the user through the

information space by recommendations with stress

on changes of the user interests in time.

Recommendations are also a perfect tool for

marketers. Targeted advertising is linked to the

analysis and consumer needs. Thus, especially in e-

shops and web business, recommendation of goods

commonly takes place.

Area of news is a typical example of a

comprehensive information space. Online

newspapers aim at keeping their readers interested.

They dedicate effort to search for improvements, in

particular, to bring comfort. Amounts of articles

which are added daily are thus processed to be

recommended to users according to their needs.

In the field of news, we should consider time

sensitivity of articles. We can expect that our content

changes dynamically and our users change their

interests in time too. Recommending articles is then

time sensitive and should be done real time to

preserve relevancy of the news.

In this paper we describe our proposal of

content-based recommending. Recommendations are

made based primarily on a history

of the reading.

For recommendation decision we use the individual

user activity (recent articles read) to predict the

content he is interested in. We choose content-based

recommender due to the increasing opportunities in

the processing of content. Besides, regarding news,

content is definitely important and valuable.

Our method for news recommending uses

incremental hierarchical clustering. Clustering is

carried out using textual similarities and is adapted

to allow rapid up-to-date and personalized

recommendations.

As a part of the sme.fiit project (Barla et al.,

2010), which aims at news recommendation in

largest electronic Slovak newspaper (www.sme.sk)

we present in this paper a recommender called

TRECOM. TRECOM is based on monitored user

behaviour and processed news articles represented

302

Zeleník D. and Bieliková M..

NEWS RECOMMENDING BASED ON TEXT SIMILARITY AND USER BEHAVIOUR .

DOI: 10.5220/0003339403020307

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 302-307

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

effectively considering the news similarity. Data

used for the recommendation are taken from the web

site SME.sk and contain news texts and the user

history (logs of news reading). We employ a

reader’s activities history from the news portal

without any feedback, but the intention is similar to

the one presented in the related work (Carvalho et

al., 2005) where web logs and user history is used.

2 RELATED WORK

Generating personalized recommendations is related

to the observation of a single user behaviour

followed by items suggesting using either content-

based or collaborative filtering methods (Su and

Khoshgoftaar, 2009). In advance there are also

hybrid techniques, which could be used to

recommend items (Burke, 2002). These methods

often combine both principles to avoid negative

aspects in both types.

Collaborative filtering methods for news

recommending are based on presumption that the

majority of similar users have found something

interesting for the rest (Suchal and Návrat, 2010).

Actually, this is more about predicting the behaviour

than about discovering the interest or needs. This

approach has one advantage in comparison to

content-based approach. We are able to surprise the

user and keep the relevancy of the article in the same

time (Ge and Delgado-battenfeld, 2010).

Mooney (Money and Roy, 2000) proposed a

method for book recommendation where each text is

processed and represented using text categorization.

They claim that content-based recommenders are

best at recommending unpopular items when there is

not sufficient information about users, but content

information is easy to obtain. In our case, we have

news relatively easy to process.

There are more options how to calculate

similarity for texts. As it was mentioned, in related

work (Tintarev and Masthoff, 2006) even simple

solutions like Bag Of Words are accurate enough for

news recommending, considering the fact that it has

low complexity. Complexity is important if we want

to provide real-time recommendations.

There is always a problem with unknown or new

users (Adomavicius and Tuzhilin, 2005). These

users have not been monitored to enable any

recommender to estimate their interests or needs.

Typical solution is to recommend random, the

newest or the most popular items.

Serious problem is also overspecialization which

happens especially in content-based recommenders.

This could be solved by randomly generated

recommendations and omitting the items which are

very similar to those which the reader already saw

like in DailyLearner (Billsus and Pazzani, 2000).

There are also other aspects of user state which

should be considered when we want to recommend

news. As it was mentioned in paper on news

recommending (Jancsary et al., 2010) there are

context-sensitive features. To involve these aspects

we need to find them in real-time. Respectively, we

have to affect the recommendations in real-time.

3 NEWS PORTAL

We recommend articles which are at the web-based

news portal. We have to face time sensitivity,

variety and amounts of articles and readers. Our

users are readers of this news called SME. There are

 around 350 thousands of visits every day;

 authors add around 250 new articles every day in

430 different categories (combination of category

and section);

 average user reads 2 articles per a day and

spends almost 17 minutes at this site a day.

Articles comprise information about the time of

publication, author, section, category and more.

Time of publication is important attribute. It defines

time sensitivity for this domain. Old articles lose

importance over time, despite of the relevance of

their content for specific user. Generally we have to

find personalized and the most recent articles.

For recommending recent articles in dynamic

environment of news we should use a representation

of articles which allows incremental adding articles.

Retrieving articles and searching user interests has to

be based on algorithms with low complexity to be

able recommend in real time with preserving

recency of the news. Besides mentioned time

sensitivity we should reflect changing user interests.

Readers of the news do not want to be

overwhelmed by the same information. There is a

need to vary articles which are recommended. The

reader gradually uses the recommendations, so it is

appropriate to vary these recommendations over

time. Not because of the news recency only, but also

because of the changing or deepening user interest.

Constrained list of recommended items should cover

majority of momentary user’s interests.

NEWS RECOMMENDING BASED ON TEXT SIMILARITY AND USER BEHAVIOUR

303

4 METHOD FOR NEWS

RECOMMENDING

The first phase is to discover the interests of

individual users. This is done by monitoring the

activity of each reader. Articles that readers display

are located in a hierarchical structure we designed.

This structure keeps relations between similar

articles. We discover user interests using the records

of user activity and the hierarchy. We describe the

way how to locate articles that are appropriate to the

reader. Another task is to compile list of articles.

The number of recommendation should be

constrained to a limited number. Therefore, we need

to find the equilibrium between recency and

relevancy to maximize precision.

4.1 Discovering Interests

A prerequisite for our method is that each individual

has some interests. This can be easily verified using

a history of readings of particular readers. We can

follow the interest in certain categories or sections of

news portal. Figure 1 presents the selected reader

and records of his activities during the period of 15

days. We can see that the interest of certain

categories prevails over the others.

Figure 1: Top (of 40) categories displayed by the reader.

Similarly, there are identifiable fields of interests

for each reader. It makes sense to explore more

interests based on the calculated similarity between

articles. We substituted this metadata (categories)

made by editors by the hierarchy of similarity

relations which provides its own metadata.

We use a hierarchy of relations, which is

incrementally built, similarly to the hierarchy

presented by Sahoo (Sahoo et al., 2005). In our

hierarchy, we can rely on the repository which

contains current articles and also assume that they

are properly organized. Set of words extracted from

articles and normalized are used as features to

compute similarities among articles. Each node in

the tree is labelled by a set of features. Edges in the

tree represent the hierarchy which keeps similar

articles nearby. We designed our representation as a

hierarchy where

 real articles are placed at the lowest level of the

tree (leaf nodes),

 features are spread to the meta level of the

structure,

 similarity is kept in the hierarchy.

There are several options how to calculate similarity

based on the content itself. There are also

sophisticated methods, which are able to determine

semantic similarity (Gabrilovich and Markovitch,

2007). However, simple text similarity is often used

in news recommending and it gives good results

(Kroha and Baeza-Yates, 2005). We use Jaccard’s

similarity to calculate articles similarity.

Figure 2 shows a way of discovering the reader's

interests. We use the tree structure created using the

similarity of articles and records of user activity. We

have a hierarchy of nodes which effectively

represent similarity of real articles even without

actual calculation between particular pairs. Thick

edges are paths from the displayed article to the root

of the tree. Nodes where are thick edges merged are

fields of interests.

Figure 2: Discovering interests. Black nodes represent

already displayed articles.

In this manner, we discover interests for each

user using his history of reading. One user interest is

one node in the tree which is used to define the set

of articles belonging to this interest (articles in the

subtree). Since we use a tree structure we work with

hierarchy of these interests.

4.2 Retrieving Suggestions

When considering reader’s fields of interests we can

compare fineness of the interests depending on the

depth where the node of the tree is located. Fields of

interest that are closer to the leaves of the tree are

more focused on a particular topic (e.g. articles

about hockey). Fields of interests that are closer to

the root are dedicated to more general

topics (e.g. articles about the sport). This structure

has some useful properties.

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

304

Recommending specific articles using our

proposed hierarchical structure is a matter of

selecting articles from more interests. We find the

relevant interest for a particular reader. The

relevance of the interest is calculated as the ratio of

articles displayed from this interest and all articles

belonging to the interest.

Thus, we are able to sort interests and prepare for

the selection of appropriate articles. There are

obviously plenty of appropriate articles, since rapid

growth of the news dataset (250 new articles per

day). Figure 3 presents the selection of interesting

articles for a specific reader.

Highlighted articles are those which the user has

read. These articles are used to determine the

relevance of his interest. Other articles are

potentially interesting for the reader. We selected the

articles, which are the subject of further

recommendations.

Figure 3: Selecting interesting articles.

Because of the need to avoid overspecialization,

we penalize very similar interests. Otherwise,

recommended articles would have been closely

similar to those already displayed by the reader.

4.3 Compiling Recommendations

The reader has sometimes problem also with long

list of recommendations (Bollen et al., 2010).

Therefore, we choose articles that cover all relevant

interests of the reader but are from distinct fields.

We also integrate the time as an attribute in the

compilation of the list of recommendations. Time is

an important attribute, which could indicate whether

the interest that we discovered is outdated or not.

We introduce the additional information that is

maintained in a hierarchical structure. We can find

the latest article which was added for each branch of

the tree. Time attribute is spread as maximum of two

sub-branches. This way we can efficiently identify

the most recent article and the time when it was

published. Interest is then as relevant as the last

added article. We gain the possibility to combine

time relevancy and the content relevancy. To create

a list of recommended articles we considered both

attributes. The method is described in the following

steps and Figure 4.

1. Selection of articles displayed by a reader

2. Discovery of areas of interests in the tree

3. Selection of unread articles for each interest

4. Sorting articles by time in particular interest

5. Creation of a matrix containing interest

6. Linking the columns of the matrix into a list as

illustrated in Figure 4.

Figure 4: Compiling the mix of recommendations. Articles

in rows belong to the same field of interest. The most

relevant interest is at the top. The most recent articles for

each interest are on the left side. Articles 1-9 are formed

into list of recommendations by columns.

The list covers interests of the user. Our method

is designed to recommend 10 items for one request

to avoid choice overload (Bollen et al., 2010).

Articles are not only from one theme but cover more

topics. Articles are even up-to-date.

We are able to create the list in every moment

and in real-time. This is mainly because of the

hierarchy which we use to represent relations

between articles. Operations are fast enough to

generate the matrix of recommendations. Actually,

we change the whole matrix only in cases where the

user initiates a new session, or exhausts the list.

5 EVALUATION

We performed several experiments conducted in the

real environment of the news portal. This brings real

data as articles, readers and their activity (thousands

of articles and readers).

One way of evaluation, we made, is the real

usage of the method and a user feedback. We

selected few readers and formed the controlled

group. This group had to evaluate recommended

articles and become familiar with our recommender.

NEWS RECOMMENDING BASED ON TEXT SIMILARITY AND USER BEHAVIOUR

305

Users have had a chance to use our recommender

through the browser plug-in. This plug-in works as

an extension to the news portal. We enriched the

news portal with a list of recommendations. We also

added a simple voting control to each article only for

the evaluation purposes.

The experiment was conducted with 10 people

who rated 88 recommended articles during ordinary

reading. Readers evaluated recommended articles

using binary values (appropriate, inappropriate).

Reader had a list which was changed every hour. It

was not obligate to evaluate every article from the

list. We took 88 rated articles and 62 articles were

positively evaluated. Our accuracy with this

controlled group was 70%. It means that 70% of

recommended items covered interests of readers.

Our second experiment was based on synthetic

tests. In this case we simulated the feedback

received from readers on the basis of their actual

behaviour in past. Since we are talking about

simulating the evaluation, we use many more

readers than in previous experiment. The entire test

was executed with a set of 1, 000 active readers and

their reading records (5 days, 20 articles per day for

average reader).

We divided the records from complete history of

readers into two smaller intervals. The first interval

is denoted as training interval and second as a test

interval. The first interval is used to generate

recommendations using our method. It is the same

list of recommendations which would appear when

using the website at the end of this period. Real

history in test interval is then compared with

recommendations generated using training interval.

However, the reader displays thematically

similar articles to the recommended articles. To

examine whether recommendations cover reader’s

interests or not, we did not compared exact articles.

We compared articles using the similarity. We did

not use our relations to compare articles. We used

pair of section and category provided by the news

website to be objective.

We used 1,000 active readers and their history.

To be accurate, there are around 430 valid

combinations of sections and categories. It means

there are around 430 options to pick correct

combination.. We evaluated if the recommendation

is the same combination of section and category as

the article in testing period.

Figure 5 indicates the precision and recall for

more testing intervals. We can compare the length of

the intervals used to calculate the recommendations.

We see that the precision is growing up to 60%.

The recommendation is correct in 60% of the cases.

Figure 5: Precision and recall plotted in the chart.

To make a better picture we have compared our

results with results of the other content-based

recommender which uses the same dataset and

evaluation method (Kompan et al., 2010). We have

observed that our method has significantly higher

recall for shorter testing intervals. Our recommender

was able to cover user’s interests also in 1 hour

interval. This happens because our recommender

uses composition to cover as much user’s interests

as it is possible.

From a user’s perspective it is often important to

know the method for recommending and how these

recommendations are calculated (Ahn et al., 2010).

Our user is willing to accept advice if he knows how

the machine discovered this advice. We discovered

that sometimes just an outline of the solution helps.

We found out that we are logging also articles

already recommended. These records are used again

for recommending. Discovering of interests is

inappropriate when the calculation works with these

articles. In fact, recommenders should not replace

the standard navigation, but they should satisfy the

user with an additional functionality. Otherwise, the

information space may be undesirable narrowed by

these recommendations. One solution could be the

addition of random articles, which would allow the

user to navigate into these hidden areas.

We also observed the problem with non-active

readers. This means that the interval used for

recommending was not sufficient to discover

interests. We used only 5 days of the history of

reading. The average user reads only two articles per

a day. Shorter intervals are not sufficient because

our method is not able to discover enough interests.

6 CONCLUSIONS

In this paper we described a method for personalized

news recommending. We focused on the content of

articles and the user’s interests. We used an effective

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

306

representation of the similarity relations between the

articles. Advantages of this hierarchy include

logarithmical complexity, metadata which are

generated using content of the articles and

incremental approach. This is useful if we need real-

time calculation and the metadata provided by the

authors of the news are not sufficient. On the other

hand, a disadvantage is that the tree structure could

not provide relations which are not transitive (i.e.

text similarity of news).

We use properties of the hierarchical

representation in our method. The results thus meet

the requirements of the recommender system.

Hierarchical clustering has low, logarithmical

complexity of storing and retrieving articles. The

hierarchy enables us to discover interests for every

moment using the history of reading.

Our main contribution is utilization of

hierarchical structure, which incrementally generates

metadata about articles. Meta-documents which are

created this way have inheritance relations. These

relations represent similarity between real articles.

The advantages of our recommender systems are

linked to this representation. We are able to discover

user’s interests in real-time, even if we use vast

information space to recommend news.

We focused in our work on real-time content-

based recommending. Our future work includes

considering the context of the user’s interests. We

plan to improve our recommender to consider the

actual interests of a user. We have a presumption

that interests change in time, with location, mood or

emotions. Since we are able to recommend news in

real-time, this is mainly a matter of recognizing the

behavioural patterns and contexts.

ACKNOWLEDGEMENTS

This work was supported by the Scientific Grant

Agency of SR, grants No. VG1/0508/09 and

VG1/0675/11, and it is a partial result of the

Research & Development Operational Program for

the project Support of Center of Excellence for

Smart Technologies, Systems and Services II, ITMS

25240120029, co-funded by ERDF.

REFERENCES

Ahn, J., Brusilovsky, P., Grady, J., He, D., and Syn, S. Y.

2007. Open user profiles for adaptive news systems:

help or harm?. In Proc. of the 16th int. Conf.on World

Wide Web. WWW '07. ACM, New York, NY, 11-20.

Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next

Generation of Recommender Systems. IEEE Trans. on

Knowl. and Data Eng. 17, 6, 734-749.

Barla, M. et al., 2010. News recommendation. In Proc. of

the 9th Znalosti, Jindrichuv Hradec., 171-174.

Billsus, D., Pazzani, M. 2000. User Modeling for Adaptive

News Access. User Modeling and User-Adapted

Interaction, vol. 10, nos. 2-3, (Feb. 2000),147-180.

Bollen, D., Knijnenburg, B. P., & Graus, M. 2010.

Understanding Choice Overload in Recommender

Systems Categories and Subject Descriptors. In Proc.

of 4th ACM Conf. on Recommender Systems.

Barcelona, Spain, 63-70.

Burke, R. 2002. Hybrid Recommender Systems: Survey

and Experiments. User Modeling and User-Adapted

Interaction 12, 4 (Nov. 2002), 331-370.

Carvalho, C., Jorge, A. M., and Soares, C. 2006.

Personalization of E-newsletters Based on Web Log

Analysis and Clustering. In Proc. of the

IEEE/WIC/ACM Int. Conf. on Web intelligence. IEEE

Computer Society, WDC, 724-727.

Gabrilovich, E., Markovitch, S. 2007. Computing

semantic relatedness using Wikipedia-based explicit

semantic analysis. In Proc. of the 20th int. Joint Conf.

on Artificial Intelligence, Hyderabad., India, 1606-

1611.

Ge, M., Delgado-battenfeld, C. 2010. Beyond Accuracy:

Evaluating Recommender Systems by Coverage and

Serendipity. In Proc. of 4th ACM Conf. on

Recommender. Systems. Barcelona, Spain, 257-260.

Jancsary, J., Neubarth, F., Trost, H. 2010. Towards

Context-Aware Personalization and a Broad

Perspective on the Semantics of News Articles. In

Proc. of 4th ACM Conf. on Recommender Systems,

Barcelona, Spain, 289-292.

Kroha, P.,Baeza-Yates, R., 2005. News classification

based on term frequency. In Proc. of the 16th Conf. on

Database and Expert Sys. Apps, 428–432.

Kompan, M., Bieliková, M., 2010. Content-Based News

Recommendation. In Proc. of the 11th Conf. EC-WEB.

Springer-Verlag, Bilbao, Spain, 61-72.

Mooney, R. J. and Roy, L. 2000. Content-based book

recommending using learning for text categorization.

In Proc. of the 5th Conf. on Digital Libraries, TX,

USA, 195-204.

Sahoo, N., Callan, J., Krishnan, R., Duncan, G., Padman,

R. 2006. Incremental hierarchical clustering of text

documents. In Proc. of the 15th ACM int. Conf. on

Information and knowledge management, NY, USA,

357-366.

Su, X. and Khoshgoftaar, T. M. 2009. A survey of

collaborative filtering techniques. Adv. in AI, 36-55.

Suchal, J., Navrat, P. 2010. Full text search engine as

scalable k-nearest neighbor recommendation system.

In Proc. of the AI in Theory and Practice 2010. WCC.

IFIP AICT 331, Springer, Boston, 165-173.

Tintarev, N., Masthoff, J. 2006. Similarity for news

recommender systems. In Proc. of the AH’06

Workshop on Recommender Systems and Intelligent

User Interfaces, 1-8.

NEWS RECOMMENDING BASED ON TEXT SIMILARITY AND USER BEHAVIOUR

307