EVALUATION OF COLLABORATIVE FILTERING ALGORITHMS

USING A SMALL DATASET

Fabio Roda, Leo Liberti

LIX,

Ecole Polytechnique, 91128 Palaiseau, France

Franco Raimondi

School of Engineering and Information Sciences, Middlesex University, London, U.K.

Keywords:

Recommender systems, TMW, KNN.

Abstract:

In this paper we report our experience in the implementation of three collaborative ﬁltering algorithms (user-

based k-nearest neighbour, Slope One and TMW, our original algorithm) to provide a recommendation service

on an existing website. We carry out the comparison by means of a typical metric, namely the accuracy

(RMSE). Usually, evaluations for these kinds of algorithms are carried out using off-line analysis, withholding

values from a dataset, and trying to predict them again using the remaining portion of the dataset (the so-called

“leave-n-out approach”). We adopt a “live” method on an existing website: when a user rates an item, we also

store in parallel the predictions of the algorithms on the same item. We got some unexpected results. In the

next sections we describe the algorithms, the benchmark, the testing method, and discuss the outcome of this

exercise. Our contribution is a report of the initial phase of a Recommender Systems project with a focus on

some possible difﬁculties on the interpretation of the initial results.

1 INTRODUCTION

Recommender systems have became an important re-

search area in the ﬁeld of information retrieval; many

approaches have been developed in recent years and

many empirical studies appeared. Evaluation of most

works has been carried out using “artiﬁcial” dataset

provided by well know research groups, such as

GroupLens, or by the Netﬂix prize. This situation

assures somehow a standard in the way results are

evaluated but also can lead to tuning the algorithms

to work better with dataset which have speciﬁc fea-

tures. In general, from the practitioner’s point of view,

there is a lack of precise delimitation for the domain

of applicability of algorithms and it is not totally clear

which is the minimal size of a dataset for a certain al-

gorithm. Small and poor datasets can “waste” good

algorithms; in addition, most algorithms are likely

to have very similar performance on these kinds of

dataset, thus rendering initial design choices very dif-

ﬁcult. The aim of this paper is to provide elements of

discussion on this subject.

To ground our claims we make use of a concrete

online application which suggests places in the real

world (restaurants, bars, shops, etc) to registered

users, trying to match users’ tastes by digging into the

network of friends and places of the community to get

the best possible responses. Using an accepted termi-

nology (Vozalis and Margaritis, 2003), we can clas-

sify the application as a Collaborative Filtering Sys-

tem. We have implemented our in-house algorithm

for this application, called TMW, which tries to over-

come the issue of fragmented information, and we

have compared it to two classical algorithms: user-

based k-nearest neighbour and Slope One. These

three algorithms constitute the base for the experi-

mental evaluation presented in this paper.

The reminder of this paper is organised as follows.

In Section 2 we describe the three algorithms. In

Section 3 we report the experimental results obtained

when running these algorithms on a production envi-

ronment. In Section 4 we discuss the results.

2 BACKGROUND

In this section we describe the algorithms used for the

evaluation: k-nearest neighbour (KNN), Slope One,

603

Roda F., Liberti L. and Raimondi F..

EVALUATION OF COLLABORATIVE FILTERING ALGORITHMS USING A SMALL DATASET.

DOI: 10.5220/0003336506030606

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 603-606

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

and TMW. For a given user u and a given item (place)

i we denote with Pr(u, i) the outcome of the algorithm

(the prediction).

2.1 KNN

In the well known KNN approach to collaborative

ﬁltering (Breese et al., 1998), input data are expressed

by a matrix R, where each entry r

represents the

rating of user u for item i. Rows and columns

(representing users or items) are compared to identify

“similarities”. Especially in the so called “user

based” approach we look for correlation between

users. Indeed, users who rate the same items in a

similar way form a “Neighborhood” (we call them

“mentors”) that we can use to estimate a prediction.

We use a to identify the “active user” (the one we

want to anticipate) and sim

the similarity between

the active user a and a generic user u. The simplest

way to use the mentors’ ratings is a weighted average,

so the prediction function is:

prediction(a, i) =

∑

∗sim

∑

|sim

There are many ways to compute the similarity

between users, for example, a well-established tech-

nique is “Cosine Similarity”. The similarity between

two users, u, and a, is obtained by calculating the co-

sine of the angle formed between the corresponding

vectors (rows in the matrix).

cosine similarity =

∑

∗r

√

∑

√

∑

Different similarity criteria can been used, for ex-

ample “Pearson’s Correlation” is a frequent alterna-

tive to cosine similarity. This form of KNN is quite

primitive, and some authors proposed various im-

provements. However, as shown in the Netﬂix prize,

these improvements cannot overturn extremely neg-

ative results. Therefore, we decided to implement

these two basic versions of the KNN to have an idea

of its initial performance. In the following KNN 1 is

based on Cosine Similarity and KNN 2 on Pearson’s

Correlation.

2.2 Slope One

The Slope One algorithm compares similar items

rather than similar users and determines the “popu-

larity differential” between them. Essentially, it cal-

culates the difference δ between the average rating of

two items, j

and j

, and then uses this value to pre-

dict the rating for j

of a given user u

in the following

way:

prediction(u

) = r

+ δ

If we look at this process in terms of linear

regression we can say that, instead of applying a

transformation like f(x) = ax + b it uses a reduced

form f(x) = x + b. When we have more than one

known r

we can get several predictions and so the

proposed best one is the average of all:

prediction =

∑

+δ

where R

is the set of all relevant items. It has

been shown in (Lemire and Maclachlan, 2005) that

this algorithm can provide a good accuracy.

2.3 TMW

The idea underlying TMW is to form and maintain

an heterogeneous graph whose nodes are users and

places. This idea has been proposed by (Aggarwal

et al., 1999), and in (Huang et al., 2004) the au-

thors propose using a “two-layer graph” (one layer

for users and one for items). This approach is not

mainstream in the recommender system community

so we describe brieﬂy the basic ideas of our TMW al-

gorithm below (details can be found in our previous

work (Roda et al., 2009)).

2.3.1 Formalization

Let U (for users) and P (for places) be two ﬁnite sets

and V = U ∪P the resulting vertex set. We introduce

the following two graphs:

• R = (V, A) is a bipartite digraph; A ⊆ U ×P is

weighted by a function ρ : A → [−1, 1] (weights

represent ratings).

• S = (U, B) is a graph weighted by a function γ :

B →[0, 1] (it represents the known social network

and weights represent the “trust” between users).

G = R ∪S is the union of the two graphs which

models a mixed network (ratings and social relation-

ships). It is a base to infer new relations and to mod-

ify existing ones exploiting the fact that sometimes

users like (almost) the same places in (almost) the

same way. While the system gets more information

(i.e. new ratings are expressed by old and new users)

new relations can be established and the social net-

work grows (formally, G evolves in a richer graph G

′

In order to produce a prediction for a given user and a

given place, we have to ﬁnd the best “stream of infor-

mation” in this augmented and dynamic network.

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

604

2.3.2 Identiﬁcation of Maximum Conﬁdence

Paths

We treat this problem as an optimization problem and

try to consider this situation as a stream of informa-

tion which ﬂows in the social network, constrained by

the level of trust users share each other. Intuitively, we

look for the best path between the active user and the

best mentor. We make the assumption that the conﬁ-

dence for a path is deﬁned by the lowest conﬁdence

arc in the path, thus trusting the “word of mouth” if

this ﬂows through trusted users only, and we discard

a path if it contains an “unreliable” agent.

The problem reduces to ﬁnding the maximum con-

ﬁdence path between a user and the best mentor: this

is the same as ﬁnding a path whose arc of minimum

weight is maximum between all the path minima,

which is precisely the so-called maximum capacity

path problem. Algorithms are known to solve this

problem in time linear in the number of arcs (Punnen,

1991). When the best path is found we take the last

but one edge m

best

as the best mentor and the weight

of the very last arc (i.e., its rating) as the prediction.

In parallel, we carry out the computation of the

value of the bottleneck which represents a trust level

and which allows us to choose trusted recommen-

dations. When the trust level is lower than a cer-

tain threshold, we replace the TMW prediction with

a value computed as the linear regression between

user’s ratings and the average community rating.

3 EVALUATION OF THE

ALGORITHMS

We use a simple metric to evaluate our system, in

order to have clear results easy to communicate to

non technical stakeholders. The choice was made be-

cause of the clamour of the Netﬂix competition and

so we used the same metric: accuracy computed as

the square root of the averaged squared difference be-

tween each prediction and the actual rating (the root

mean squared error or “RMSE”). Let the r

denote

the actual rating provided by a certain user u for an

item i, with i= 1, 2, ..., n

≤n, where n is the num-

ber of all available items) and let p

denote the pre-

diction generated by a certain algorithm for the same

user and the same item. RMSE, relating to user u, is

deﬁned by:

RMSE

∑

u=1

− p

)

The total RMSE is obteined as the average of the

RMSE of all users. As mentioned above, the evalua-

tion of RMSE is typically performed using the “leave-

n-out” approach , where a part of the dataset is hidden

and the the rest is used as a training set for the rec-

ommender, which tries to predict properly the with-

held ratings. Here we employ a different method, due

to the possibility of recording ratings and predictions

live. TMW is currently used as the main recommen-

dation algorithm for a website to which we have au-

thorized access; in this way we don’t need to simulate

the prediction process because we can compare pre-

dictions with real ratings from users. Using the web-

site users enter ratings for places (such as restaurant,

bars, etc.), on a scale from 0 to 5, and can receive

recommendation based on their voting history.

For our experiment we proceed as follows: every

time a user provides a rating, we calculate the pre-

dictions with all the three algorithms described above

using the entire dataset (with the exception of the rat-

ing entered), and store all the prediction results. Ba-

sically, for every rating R(u

, p

), we record a line in

a log ﬁle with the predictions about that item and user

of TMW, KNN 1 (using cosine similarity), KNN 2

(using Pearson correlation) and Slope One, respec-

tively, as shown in Table 1.

Table 1: Log line for the experiment.

rating (real) TMW KNN 1 KNN 2 Slope 1

R(u

, p

) Pr

tmw

knn

sl1

As a benchmark to evaluate all the algorithms we

employ the community average for a certain item,

with the aim of measuring how much each algorithm

can improve the simple community recommendation.

Thus, we also compute the RMSE of the commu-

nity recommendationwith respect to the actual ratings

provided by users.

The results reported in the following tables refer

to the analysis we performed using the ratings we got

over the ﬁrst four months of activity of the website.

We have 315,463 ratings, inserted by 69,794 users on

147,319 items (places). Table 2 reports the overall

RMSE computed in this period.

Table 2: RMSE for all ratings.

comm. TMW KNN 1 KNN 2 Slope 1

1.0710 0.9865 0.9872 0.9884 0.9859

In Table 3 we express this result in relative terms

by providing the rate of improvement with respect to

the average of the ratings by the community: for in-

stance, TMW improves community ratings by 7.89%

on average.

EVALUATION OF COLLABORATIVE FILTERING ALGORITHMS USING A SMALL DATASET

605

Table 3: Improvement over community average.

Algorithm Improvement over community avg.

TMW 7.89 %

KNN 1 7,82 %

KNN 2 7,71 %

Slope One 7,95 %

4 DISCUSSION AND

CONCLUSIONS

A ﬁrst consideration is that all the algorithms don’t

supply predictions dramatically more accurate than

the community average. Even if it is well known now

(see Netﬂix prize results) that an improvement of a

few percentage points of accuracy is hard to get, still

in absolute terms the RMSE seems a bit excessive (re-

member that votes range from 0 to 5). Indeed, in all

cases the RMSE is always greater than 0.9, represent-

ing an average error in the order of 20% on the actual

ratings.

A second observation is that our results do not

clearly show which algorithm works better. TMW

provides slightly better results than KNN but their

outcomes are very similar. We ﬁnd a bit surprising

that KNN using cosine similarity performs better than

the one based on Pearson Correlation and that Slope

One performs better than both of them. Indeed, in

line with previous works we expected KNN based on

Pearson Correlation to perform the best. We iden-

tify two possible reasons. First, we considered a plain

version of KNN and did not investigate possible im-

provements. Second, TMW only looks for the best

mentor available, instead of carrying out a full neigh-

bourhood formation, which could pose problems in a

sparse dataset.

We suspect that the results obtained in our exper-

iments are due to the structure and the dimension of

the dataset. Indeed, the training data set of Netﬂix

consists of 100 million ratings provided by over 480

thousand users, on nearly 18 thousand movie titles.

Group Lens provides three datasets, one of 100,000

ratings by 943 users for 1,682 movies, another of 1

million ratings by 6,040 users for 3,900 movies, and a

third of 10 million ratings and 100,000 tags by 71,567

users for 10,681 movies. Jester Joke dataset has 4.1

million by 73,496 users on 100 jokes. On the contrary,

our dataset is sparser, and thus most of the RMSE

analyses that can be found in the literature do not ap-

ply to our case.

The main issue, which arises from our experience,

is that it is not clear which is the minimal dimension

of a dataset to make it a reliable base to build a test

bed. This a very important question in our opinion

and our impression is that it has been underestimated

in the literature. Of course, if a dataset is untrustwor-

thy an alternative consists in using public datasets, but

this may be audacious, because it may be not ideal

to tune a system to recommend places on a dataset

which was originated from a system to recommend

movies. Additionally, even if the collaborative ﬁlter-

ing approach works with generic items, new difﬁcul-

ties may occur later when the system is extended with

content based capabilities.

In conclusion, this experiment has shown that

from the practitioner’s point of view, ﬁnding the best

algorithm is equivalent to ﬁnding a reliable dataset

and test bed, and that this issue has not been addressed

adequately in the literature.

REFERENCES

Aggarwal, C., Wolf, J., Wu, K.-L., and Yu, P. (1999). Hort-

ing hatches an egg: a new graph-theoretic approach

to collaborative ﬁltering. In Fayyad, U., Chaudhuri,

S., and Madigan, D., editors, Proceedings of the ﬁfth

ACM SIGKDD international conference on Knowl-

edge discovery and data mining, pages 201–212, New

York. ACM.

Breese, J., Heckerman, D., and Kadie, C. (1998). Empirical

analysis of predictive algorithms for collaborative ﬁl-

tering. Technical Report MSR-TR-98-12, Microsoft

Research.

Huang, Z., Chung, W., and Chen, H. (2004). A graph model

for e-commerce recommender systems. Journal of the

American Society for Information Science and Tech-

nology, 55(3):259–274.

Lemire, D. and Maclachlan, A. (2005). Slope one predic-

tors for online rating-based collaborative ﬁltering. In

Proceedings of SIAM Data Mining (SDM’05).

Punnen, A. (1991). A linear time algorithm for the max-

imum capacity path problem. European Journal of

Operational Research, 53:402–404.

Roda, F., Liberti, L., and Raimondi, F. (2009). Com-

binatorial optimization based recommender systems.

In Proceedings of the 8th Cologne-Twente workshop

(CTW09) on Graphs and Combinatorial Optimiza-

tion, Paris.

Vozalis, E. and Margaritis, K. (2003). Analysis of rec-

ommender systems algorithms. In Lipitakis, E., ed-

itor, The 6th Hellenic European Conference on Com-

puter Mathematics & its Applications, pages 732–745,

Athens. Athens University of Economics and Busi-

ness.

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

606