Wiki-LDA: A Mixed-Method Approach for Effective Interest Mining on

Twitter Data

Xiao Pu

1,2

, Mohamed Amine Chatti

, Hendrik Th

and Ulrik Schroeder

Ecole Polytechnique F

erale de Lausanne, Lausanne, Switzerland

Idiap Research Institute, Martigny, Switzerland

Informatik 9 (Learning Technologies), RWTH Aachen University, Aachen, Germany

Keywords:

Learning Analytics, Educational Data Mining, Personalization, Adaptation, Learner Modelling, Interest

Mining, Topic Modelling, Twitter.

Abstract:

Learning analytics (LA) and Educational data mining (EDM) have emerged as promising technology-

enhanced learning (TEL) research areas in recent years. Both areas deal with the development of methods

that harness educational data sets to support the learning process. A key area of application for LA and EDM

is learner modelling. Learner modelling enables to achieve adaptive and personalized learning environments,

which are able to take into account the heterogeneous needs of learners and provide them with tailored learn-

ing experience suited for their unique needs. As learning is increasingly happening in open and distributed

environments beyond the classroom and access to information in these environments is mostly interest-driven,

learner interests need to constitute an important learner feature to be modeled. In this paper, we focus on

the interest dimension of a learner model and present Wiki-LDA as a novel method to effectively mine user’s

interests in Twitter. We apply a mixed-method approach that combines Latent Dirichlet Allocation (LDA),

text mining APIs, and wikipedia categories. Wiki-LDA has proven effective at the task of interest mining and

classiﬁcation on Twitter data, outperforming standard LDA.

1 INTRODUCTION

Recently, there is an increasing interest in learning

analytics (LA) and Educationa Data Mining (EDM).

LA focuses on the development of methods for an-

alyzing and detecting patterns within data collected

from educational settings, and leverages those meth-

ods to support the learning experience. A system-

atic overview on LA and its key concepts is pro-

vided by (Chatti et al., 2012) and (Chatti et al.,

2014) through a reference model for LA based on

four dimensions, namely data, environments, context

(what?), stakeholders (who?), objectives (why?), and

methods (how?).

EDM is concerned with developing methods to

explored the unique types of data that come from ed-

ucational settings and, using these methods to bet-

ter understand students and the setting in which they

learn (Romero et al., 2010). From a technical per-

spective, EDM is the application of data mining tech-

niques to educational data (Baker, 2010).

There are many applications or tasks in educa-

tional environments that have been addressed in LA

and EDM research. A key area of application is

learner (student) modelling, as a result of a focus on

adaptive intelligent web-based educational systems,

including intelligent tutoring system (ITS) and adap-

tive hypermedia system (AHS) (Baker, 2010; Chatti

et al., 2012; Romero et al., 2010). A learner model

represents information about learner’s characteris-

tics or states, such as knowledge, motivation, meta-

cognitation, and attitudes (Baker, 2010). A learner

model is also a representation of information about

an individual learner that is essential for adaptation

and personalization tasks (Chatti, 2010). The six most

popular and useful features in learner modelling in-

clude the learner’s knowledge, interests, goals, back-

ground, individual traits, and context (Brusilovsky

and Millan, 2007). Different data mining techniques

have been used to build a learner model (Romero

et al., 2010). The majority of the proposed ap-

proaches, however, have focused on the modelling of

the learner’s knowledge. This can be explained by

the fact that knowledge has constituted the most im-

portant part of the learner model in ITS and AHS.

In contrast, these systems have paid little attention to

426

Pu, X., Chatti, M., Thüs, H. and Schroeder, U.

Wiki-LDA: A Mixed-Method Approach for Effective Interest Mining on Twitter Data.

In Proceedings of the 8th International Conference on Computer Supported Education (CSEDU 2016) - Volume 1, pages 426-433

ISBN: 978-989-758-179-3

learner’s interests.

We believe that future learner modelling applica-

tions will increasingly focus on the interest dimension

of a learner model, as a result of a shift in focus in the

last few years from centralized learning system (e.g.

ITS, AHS, LMS) to open and networked learning en-

vironments, such as personal learning environments

(PLEs) and massive open online courses (MOOCs).

These environments deal with large volume of data

from a wide variety of sources beyond the ITS/LMS.

The data comes from formal as well as informal learn-

ing channels (Chatti et al., 2012). As access to in-

formation in these environments is mostly interest-

driven, learner interests need to constitute an impor-

tant learner feature to be modelled in order to help

learners overcome the information overload problem

as well as to support adaptation, personalization, and

recommendation tasks.

Detecting learner’s interest is also crucial for life-

long learner modelling. (Kay and Kummerfeld, 2011)

deﬁne a lifelong learner model as a store for the col-

lection of learning data about an individual learner.

The authors note that to be useful, a lifelong learner

model should be able to hold many forms of leaning

data from diverse sources. This data can come in dif-

ferent formats, distributed across space, time, and me-

dia. The capacity to mine learner’s interests across

different learning contexts would provide more ef-

fective personalized learning experiences for lifelong

learners.

Recognizing the importance of the interest dimen-

sion in the learner modelling task, we propose in

this paper an innovative approach to effectively mine

learner’s interests in social networks (e.g. Twitter).

We apply a mixed-method approach that combines

Latent Dirichlet Allocation (LDA), texting mining

APIs, and wikipedia categories.

2 RELATED WORK

Recently, the ability to discover topics or interests of

Internet users from information provided on their per-

sonal proﬁles on social media has become increas-

ingly important and necessary. In particular, relevant

to our own work, there has been few recent and for the

most part different approaches to discover users’ top-

ics of interest on Twitter. Content analysis on Twit-

ter introduces unique challenges to the efﬁcacy of

topic models on short, messy text. Tweets are con-

strained to a 140 characters in length and are written

in informal language with misspelling, acronyms and

non-standard abbreviations, unlike the standard writ-

ten English on which many supervised models in ma-

chine learning and natural language processing (NLP)

are trained and evaluated (Mehrotra et al., 2013; Ra-

mage et al., 2010). Hence, effectively modeling con-

tent on Twitter requires techniques that can adapt to

this uncommon data. In the following, we give an

overview of related work in this ﬁeld of research.

(Michelson and Macskassy, 2010) present a sim-

ple non-machine learning approach to discover Twit-

ter users’ topics of interest by examining the entities

they mention in their tweets. Their approach lever-

ages a knowledge base to disambiguate and catego-

rize the entities in the Tweets, then develop a “topic

proﬁle” which characterizes users’ topics of interest,

by discerning which categories appear frequently and

cover the entities. In their work, the goal is to sup-

port clustering and searching of Twitter users based

on their topics of interest. The authors, however, note

that the noisy and ambiguous nature of Twitter makes

ﬁnding the entities within the tweets quite challeng-

ing.

(Puniyani et al., 2010) perform an exploratory

analysis of the content of Twitter, using Latent Dirich-

let Allocation (LDA) (Blei et al., 2003) to uncover

latent semantic themes. They show that these latent

topics are predictive of the network structure. The la-

tent topics predict which other microbloggers a user

is likely to follow, and to whom microbloggers will

address messages.

(Mehrotra et al., 2013) state that the application of

standard LDA to Twitter content produces mostly in-

coherent topics. The authors propose that a solution

to this problem is tweet pooling; i.e. merging related

tweets together and presenting them as a single doc-

ument to the LDA model. They investigate different

tweet pooling schemes to improve topics learned from

Twitter content without modifying the basic machin-

ery of LDA. Finally they make a comparison and con-

clude that the novel scheme of Hashtag-based pool-

ing leads to drastically improved topic modelling over

Unpooled and other schemes.

(Zhao et al., 2011) note that standard LDA does

not work well with the messy form of Twitter content.

The authors present a Twitter-LDA model slightly dif-

ferent from the standard LDA to discover topics from

a representative sample of the entire Twitter. They

propose to use one topic per tweet, and argue that this

is better than the basic LDA scheme and the author-

topic model. The authors then use the proposed model

to empirically compare the content of Twitter and a

traditional news medium - the New York Times. They

note that Twitter can be a good source of topics that

have low coverage in traditional news media. And al-

though Twitter users show relatively low interests in

world news, they actively help spread news of impor-

Wiki-LDA: A Mixed-Method Approach for Effective Interest Mining on Twitter Data

427

tant world events.

(Ramage et al., 2010) propose Labeled LDA (L-

LDA) as variation of LDA based on a partially su-

pervised learning model. Unlike LDA which returns

topics that are latent (i.e., simply numbered distri-

butions over words), L-LDA associates a document

with easily-interpretable topics. The authors apply L-

LDA to map the content of the Twitter feed into di-

mensions. These dimensions correspond roughly to

substance, style, status, and social characteristics of

posts.

(Quercia et al., 2012) focus on the task of doc-

ument classiﬁcation in Twitter (i.e., given a Twitter

proﬁle and a set of possible topics, determine which

topics best ﬁt the proﬁle’s tweets). The authors use

Labeled LDA (L-LDA) and compare it to the com-

petitive baseline of Support Vector Machines (SVM).

They determine the possible topics in the training doc-

uments by using text classiﬁcation APIs. As a result,

they conclude that L-LDA generally performs as well

as SVM, and it clearly outperforms SVM when train-

ing data is limited, making it an ideal classiﬁcation

technique for infrequent topics and for (short) pro-

ﬁles of moderately active users. L-LDA can accu-

rately classify a proﬁle with topics for which it has

seen only small amounts of training data and greatly

outperforms SVMs at determining how similar a pair

of proﬁles is, implying that L-LDA’s techniques of in-

ference are preferable to the linear classiﬁcation of

SVM when dealing with rich, mixed-topic documents

such as Twitter proﬁles.

This related research suggests a number of inter-

esting methods that could be used for content anal-

ysis on Twitter. However, these methods are only

capable of generating single-word interests. For in-

stance, it is not possible to generate the keyphrase ed-

ucational data mining as a possible interest. Instead,

only single keywords - in our example educational,

data, and mining - could be generated. This is in gen-

eral a key limitation of standard LDA and its varia-

tions in the literature to date. In this paper, we pro-

pose Wiki-LDA as a novel method for signiﬁcatly im-

proving LDA topic modelling on Twitter. Wiki-LDA

leverages LDA, text mining APIs, and Wikipedia cat-

egories in order to produce meaningful single-word

as well as keyphrase interests and accurately classify

them into related topics.

3 CONCEPTUAL APPROACH

Our overall approach breaks into nine high level steps,

as depicted in Figure 1:

1. Collect and store Tweets from Twitter as training

Figure 1: Wiki-LDA: Conceptual Approach.

set . This was done by crawling Tweets from pop-

ular user accounts which are listed under the ma-

jor topic classiﬁcations on Twitter. This training

data set was then pre-processed and indexed via

the Lucene

text information retrieval library;

2. Transform the text data to a vector representation

and ﬁnally to a Mahout

-readable matrix format.

Implement the Latent Dirichlet Allocation (LDA)

algorithm for training. From this process a “bag-

of-words” collection of data is obtained and then

stored into MySQL database for inference;

3. Implement a dynamic crawling methods for test

users, in which their crawled Tweets are used for

prediction. Pre-processing of the crawled Tweets

was done, also with the Lucene library;

4. Implement LDA prediction for each test user

based on the “bag-of-words” result from the LDA

training process to predict possible topic distribu-

tion;

5. Use text analysis APIs to generate keywords and

keyphrases from the test user proﬁles;

6. Send results from the respective APIs to

Wikipedia in order to obtain all the related cate-

gories of each speciﬁc keyword or keyphrase;

http://lucene.apache.org/core

http://mahout.apache.org/

CSEDU 2016 - 8th International Conference on Computer Supported Education

428

7. Collect the analyzed categories from Wikipedia

and use them as input for LDA in order to de-

termine the possible topic distribution for each

keyphrase generated from the APIs;

8. Combine all results for each speciﬁc topic;

9. Visualize the ﬁnal results to the user.

4 IMPLEMENTATION

The implementation of our approach can be roughly

divided into three major parts:

• Training Set Creation - Crawling of user data from

popular social networks, which in our speciﬁc

case was Twitter. This part was implemented via

the Twitter API

in Java. The API enabled us to

collect user Tweets to form both the training and

test data.

• Training Process - Training of the LDA machine

learning algorithm using the crawled data.

• Prediction Process - Constructing a model to pre-

dict single-word as well as keyphrase interests of

new users.

4.1 Training Set Creation

As a ﬁrst step in the training phrase, we selected the 9

most popular abstract topics as published on the Twit-

ter Website. These include: Art & Literature, Busi-

ness, Food & Drink, Government, Health, Science &

Technology, Music, Sport, Travel. We then crawled

tweets from about 4-5 users in each topic to form

the training set. Due to the limitation of the crawl-

ing API from Twitter, in which the maximum amount

of Tweets that can be crawled for a single user per re-

quest is restricted to the 20 recent ones, we manually

crawled about 150-200 additional tweets for each user

over a time span of one month. We chose users for

each topic based on the recommended popular users

in each topic provided by Twitter. For instance, we

chose @FinancialTimes, BBCBusiness, etc. For the

topic topic “Business”. Hence, we had a corpus of

about 800−1000 tweets in each topic that can be used

for training.

4.2 Training Process

Our goal was to automatically identify the topics

that Twitter users are interested in based on their

tweets. We mainly used the Latent Dirichlet Alloca-

tion (LDA) (Blei et al., 2003) for this purpose. LDA is

https://dev.twitter.com

Table 1: Top 10 words analyzed out in our LDA implemen-

tation.

Art & Literature Business Food & Drink Government Music

review

books

theater

cartoon

book

novel

art

library

museum

writer

percent

bank

market

business

trade

prices

bills

boss

opinion

ﬁnancial

restaurant

food

recipe

dinner

recipes

dish

cooking

cheese

soup

chefs

president

obama

insurance

immigration

economy

care

leaders

government

coverage

enrollment

album

music

rock

songs

rocking

sound

hear

jazz

piano

band

Science & Technology Sport Travel Health

google

app

apple

online

android

startup

computer

internet

update

game

NBA

team

sport

league

basketball

soccer

trade

lakers

crazy

travel

traveler

destinations

city

visiting

hotel

tips

beach

passengers

weather

healthy

insurance

calories

risk

care

fats

weight

sleep

cancer

help

an unsupervised machine learning technique to iden-

tify latent topic information from large document col-

lections. It uses a ”bag of words” assumption, which

treats each document as a vector of word counts.

Based on this assumption, each document is repre-

sented as a probability distribution over some topics,

while each topic is represented as a probability distri-

bution over a number of words.

We ran LDA with 1000 iterations of Gibbs sam-

pling with predeﬁned K = 9. The topic-word distribu-

tion was updated each time in the Gibbs sampling pro-

cess until the distribution converges. Table 1 shows

an excerpt of the topic-word distribution that we ob-

tained as a result of the training phase.

4.3 Prediction Process

After training the LDA model using the training set

above, the next step was to predict possible topic dis-

tributions for test users depending on the resulting

topic-word distribution. As pointed out in section 3, a

key limitation of standard LDA is that it does not al-

low to generate keyphrase interests. This would lead

to an interest list which is less coherent and inter-

pretable. To address this issue, we developed a novel

method, called Wiki-LDA for signiﬁcantly improving

LDA topic modelling on Twitter. Wiki-LDA extends

the standard LDA by leveraging text analysis APIs

and Wikipedia categories. In the following, we dis-

cuss the Wiki-LDA approach in more details.

Since Twitter is too sparse for traditional topic

modelling, we followed an author-wise pooling ap-

proach to gather together all the tweets from a given

user into a single document (Mehrotra et al., 2013).

Our model, thus learns the latent topics that charac-

Wiki-LDA: A Mixed-Method Approach for Effective Interest Mining on Twitter Data

429

Table 2: List of keywords extracted from APIs for @google.

AlchemyAPI OpenCalais

Keywords

android

google

disney

googleio

techcrunch

seattle

googleplay

asia

googlesearch

obama

percy harvin

System software

Software

PlayOnLinux

Web 2.0

Cloud clients

Cross-platform software

Embedded Linux

Smartphone

Google

Gmail

Android

terize users, rather than tweets.

We then applied online text analysis tools to the

collected tweets of the test user. We used Alchemy

API

and OpenCalais API

to extract possible key-

words and keyphrases from the tweet data. Table

2 presents the results analyzed from Twitter user

”@google” by the mentioned text analysis tools.

The next step was to classify the extracted key-

words and keyphrases into related topics using LDA.

This would be a straightforward exercise if the

keyphrases contain some words which exist in the

LDA training results, but this process would present

some problems if the words are totally new to LDA.

Our aim was to increase the probability that a gen-

erated keyword or keyphrase is accurately classiﬁed

by LDA. To achieve this, we used Wikipedia API,

which provides all possible categories corresponding

to a particular keyword or keyphrase. After crawling

all possible categories based on a given keyword or

keyphrase query, we collect these categories and use

them as input for LDA.

Figure 2: Sample classiﬁcation process for keyphrase

”percy harvin”.

Figure 2 illustrates a sample process for the clas-

siﬁcation of the extracted keyphrase “percy harvin”

(i.e. a keyphrase generated by Alchemy API for Twit-

ter user ”@google”) by combination of Wikipedia and

LDA. The complete classiﬁcation procedure works as

follows:

1. If the system ﬁnds extracted keywords/keyphrases

http://www.alchemyapi.com/api/

http://www.opencalais.com/documentation/opencalais-

documentation

Table 3: Comparison of classiﬁcation results with standard

LDA and Wiki-LDA.

Most Related Topic classiﬁed words in this topic

Standard-LDA Sci. & Tech.

google

coming

Wiki-LDA Sci. & Tech.

System software (openCalais)

Smartphone (openCalais)

PlayOnLinux (openCalais)

google (AlchemyAPI)

googleio (AlchemyAPI)

techcrunch (AlchemyAPI)

coming (LDA)

google (LDA)

(in our example ”percy harvin”) from text analysis

APIs which cannot be analyzed by original LDA,

it automatically input these keywords/keyphrases

to the Wikipedia API;

2. The Wikipedia API returns all categories asso-

ciated with ”percy harvin” to the system. Here

the Wikipedia categories associated with ”percy

harvin” include: ”American football wide re-

ceivers”, ”sports clubs established in 1961”,

”Sports in Minneapolis Minnesota”, etc;

3. The system receives these categories and splits

them into single words; in our example the collec-

tion of all words for ”percy harvin” are : [”Amer-

ican”, ”football”, ”wide”, ”receivers”, ”sports”,

”clubs”, ”established”, etc];

4. The system uses this collection of words as input

to LDA. After calculation, LDA gives one topic

distribution for each word. Here for the word

collection derived from the categories of ”percy

harvin”, the distribution is: Sports 0.67, Gover-

ment 0.03, Music 0.13, etc;

5. Finally, we choose the topic with the highest prob-

ability from the distribution provided by LDA (in

our example ”Sports”) as the possible topic of the

original keyphraseinput input ”percy harvin”.

Table 3 is a sample result of the related keywords

and keyphrases for topic “Science & Technology”

from user “@google”. The standard LDA algorithm

could only analyze single keywords (e.g. ”coming”,

”google”). The term extraction step by using the

Alchemy API and OpenCalais API resulted in more

keywords and keyphrases that couldn’t be directly an-

alyzed by standard LDA (e.g. ”PlayOnLinux”, ”Sys-

tem software”). Harnessing Wikipedia categories as

explained above, has led to an accurate classiﬁcation

of these keywords and keyphrases to the topic ”Sci-

ence & Technology”. The analysis and classiﬁcation

results are visualized through a graphical user inter-

face, as depicted in Figure 3.

The complete process of the Wiki-LDA approach

for interest mining and classiﬁcation in Twitter is

shown in Figure 4. The system uses the Twitter API

CSEDU 2016 - 8th International Conference on Computer Supported Education

430

Figure 3: Visualization of interests related to topic

Sci&Tech for test user @google.

Topic Distribution

from LDA

Topic Distribution

from Wiki-LDA

Figure 4: Overall prediction process for test user @google.

to collect the Tweets a user. After a pre-processing

step which uses the Lucene library for tokenization,

removal of stop words, and stemming of the input

data, the system uses the result data set simultane-

ously as input for the LDA prediction algorithm and

the text extraction APIs. The standard LDA predic-

tion part produces the topic distribution for the user

based on the input data. The APIs extract keywords

and keyphrases which are then used by the system

as input for the Wikipedia API to gather all possible

categories for each extracted keyword and keyphrase.

The bag of category words are then given to LDA

again in order to determine the possible topic for each

extracted keyword and keyphrase as discussed in the

example above. The analysis results from standard

LDA and Wiki-LDA are then merged into a single in-

terest list representing the ﬁnal topic distribution for

the Twitter user.

Figure 5 depicts a comparison between the anal-

ysis and classiﬁcation results generated with standard

LDA and Wiki-LDA. It shows that Wiki-LDA enables

to extract and accurately classify more interest key-

words and keyphrases as compared to the naive appli-

cation of LDA.

Figure 5: Comparison of analysis and classiﬁcation results

with standard LDA and Wiki-LDA for test user @google.

5 EXPERIMENTAL EVALUATION

In this section we describe the details of the experi-

ment conducted to gauge the quality of the analysis

and classiﬁcation results achieved by Wiki-LDA. The

experiment evaluation was performed through quan-

titative measures as well as personal interviews with

Twitter users.

5.1 Classiﬁcation Evaluation

We selected four Twitter users for evaluation of the

Wiki-LDA approach, as shown in Table 4.

We ran the Wiki-LDA algorithm over the Tweets

of the four users and extracted the possible topic dis-

tribution for each test user, with related keywords and

Table 4: Tweets from test users.

Twitter User # Collected Tweets

@Oldaily 128

@BarckObama 141

@DailyHealthTips 320

@NBA 112

Wiki-LDA: A Mixed-Method Approach for Effective Interest Mining on Twitter Data

431

keyphrases. To evaluate the performance of Wiki-

LDA, we manually computed the precision and recal

of topics for each test user, where recall is the percent-

age of the extracted interests that are indeed correct,

and precision is the percentage of the correct extracted

interests out of all extracted interests. We then com-

bined precision and recall in the composite metric of

F-measure (or F1 score): F1 =

2·precision·recall

precision+recall

. Ta-

ble 5 summarizes a comparison between the F1 Score

achieved by standard LDA and Wiki-LDA. The eval-

uation shows that Wiki-LDA has led to signiﬁcantly

improved interest mining results on the Twitter data

used for our experiment.

Table 5: Classiﬁcation Evaluation.

Twitter User

F1 Score

Standard-LDA Wiki-LDA

@oldaily .696 .918

@DailyHealthTips .936 .979

@BarackObama .931 .985

@NBA .746 .850

In order to show the improvement of results more

clearly, Table 6 shows the speciﬁc topics with high-

est probability for each test user, and top-5 relevant

words which are extracted by both original LDA and

Wiki-LDA. From the results, we can see that, in the

prediction part of the system, the Wiki-LDA model

can correctly analyze and classify not only single key-

words, but also keyphrases, thus making the interest

mining task more accurate and meaningful.

Table 6: Extraction and classiﬁcation results with standard

LDA and Wiki-LDA.

Twitter User Most related Topics

classiﬁed words in this topic

Standard-LDA Wiki-LDA

@oldaily Sci.&Tech.

google

learn

online

create

research

Online education

Educational software

George Siemens

E-learning

learn

@BarackObama Government

coverage

insurance

senate

act

covered

Patient Protection

Affordable care act

primary campaign

health

president

@DailyHealthTips Health

foods

worst

diet

hair

healthy

hair care

baldness

human skin color

human skin

weight

@NBA Sport

game

score

NBA

season

lakers

National basketball

association

Cleveland Cavaliers

lakers

game

5.2 Personal Interviews

Personal interviews were conducted with four Twit-

ter users. Table 7 shows an excerpt of the interests

of each user extracted by both standard LDA and

Table 7: Extracted interests for test users.

Test Users Hobby

Extracted keywords

Standard-LDA Wiki-LDA

@sadiksha Computer Science

comming

google

photo

......

google

PlayOnLinux

System software

......

@Xia41258659

Cuisine,

Travelling,

Music

restaurant

around

food

......

laura jansen

Matcha

dinner

......

@LZYuan 1981

Reading,

Travelling,

Music

case

books

writer

......

Culture

library

song

......

@vividxiao

Music,

Travelling

listening

rock

deal

......

Pairs

Ladygaga

piano

......

Wiki-LDA. The list of interests generated by Wiki-

LDA included not only keywords but also keyphrases,

in addition to keywords which did not appear in the

training set. These interests were presented to the

users who were asked to gauge the consistency of

the results. In general, the user feedback was that

the interests generated by Wiki-LDA are more accu-

rate, meaningful, and coherent than those generated

by standard LDA. This result was further conﬁrmed

by the computation of F1 score based on the users’ re-

sponses, as summarized in Table 8. Overall, the eval-

uation results indicate that the Wiki-LDA model is a

better choice than standard LDA for interest mining

on Twitter data.

Table 8: Results from personal interviews.

Twitter User

(Volunteer)

F1 Score for Interests analysis

Standard-LDA Wiki-LDA

@Sadiksha .143 .571

@Xia41258659 .574 .857

@LZYuan 1981 .588 .824

@vividxiao .256 .749

6 CONCLUSION AND FUTURE

WORK

Learning Analytics (LA) and Educational Data Min-

ing (EDM) are concerned with developing methods

for exploring data coming from educational environ-

ments to resolve educational research issues. Learner

modelling is a crucial task in these emerging research

areas. In this paper, we focused on the interest di-

mension in the learner modelling task, which is cru-

cial in today’s learning environments characterized by

openness and autonomy. We presented the concep-

tual, implementation, and evaluation details of Wiki-

LDA, as a mixed-method interest mining approach

that combines Latent Dirichlet Allocation (LDA), text

CSEDU 2016 - 8th International Conference on Computer Supported Education

432

extraction APIs, and wikipedia categories in order to

effectively mine user’s interests in Twitter. Through

the combination of machine learning, information re-

trieval, and knowledge bases, we were able to mit-

igate the obvious limitation of the small size of the

training data set and to extract not only keywords but

also keyphrases as possible interests.

Overall, the evaluation results showed that Wiki-

LDA clearly outperforms standard LDA in terms of

the meaningfulness and coherence of the extracted in-

terests as well as the accuracy of the classiﬁcation of

the interests in related topics. Hence, this work pro-

vides a novel method for signiﬁcantly improving in-

terest mining on Twitter data.

While our early results are encouraging for gen-

erating the interest proﬁle of a Twitter user, there are

still a number of areas we would like to improve. The

ﬁrst, and most important are is deﬁning a large train-

ing corpus, which is crucial for a machine learning

task. We have crawled tweets from 3-4 user accounts

from Twitter for each abstract topic as training set. A

logical next step to improve is hence to gather many

more Tweets from more users, and improve the range

of possible abstract topics in order to classify more

latent words.

Moreover, the Wiki-LDA algorithm has still room

for improvement. One technical limitation of LDA

is the need to ﬁx the possible number of topics K be-

fore learning. To improve on this one can consider the

possibility of letting K to be inﬁnity in LDA and deter-

mine the number of topics through a separate learning

process.

Another important area to improve is our evalu-

ation. We plan to perform a larger scale experiment

in a real learning environment which will allow us to

thoroughly evaluate our interest mining approach.

ACKNOWLEDGEMENTS

The ﬁrst author acknowledges the sup-

port of the Swiss National Science Founda-

tion through the MODERN Sinergia Project

(www.idiap.ch/project/modern).

REFERENCES

Baker, R. (2010). Data mining for education. International

Encyclopedia of Education, 7:112–118.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3:993–1022.

Brusilovsky, P. and Millan, E. (2007). User models for

adaptive hypermedia and adaptive educational sys-

tems. In Brusilovsky, P., Kobsa, A., and Nejdl, W., ed-

itors, The Adaptive Web, LNCS 4321, chapter 1, pages

3–53. Springer-Verlag Berlin Heidelberg.

Chatti, M. A. (2010). The laan theory. In Personalization

in Technology Enhanced Learning: A Social Software

Perspective, pages 19–42. Aachen, Germany: Shaker

Verlag.

Chatti, M. A., Dyckhoff, A. L., Schroeder, U., and Th

us,

H. (2012). A reference model for learning analytics.

International Journal of Technology Enhanced Learn-

ing, 4(5/6):318–331.

Chatti, M. A., Lukarov, V., Th

us, H., Muslim, A., Yousef,

A. M. F., Wahid, U., Greven, C., Chakrabarti, A., and

Schroeder, U. (2014). Learning analytics: Challenges

and future research directions. e-learning and educa-

tion journal (eleed), 10.

Kay, J. and Kummerfeld, B. (2011). Lifelong learner mod-

eling. In Durlach, P. J. and Lesgold, A. M., editors,

Adaptive Technologies for Training and Education,

pages 140–164. Cambridge University Press.

Mehrotra, R., Sanner, S., Buntine, W., and Xie, L. (2013).

Improving lda topic models for microblogs via tweet

pooling and automatic labeling. In Proceedings of

the 36th international ACM SIGIR conference on

Research and development in information retrieval,

pages 889–892. ACM.

Michelson, M. and Macskassy, S. A. (2010). Discovering

users’ topics of interest on twitter: a ﬁrst look. In

Proceedings of the fourth workshop on Analytics for

noisy unstructured text data, pages 73–80. ACM.

Puniyani, K., Eisenstein, J., Cohen, S., and Xing, E. P.

(2010). Social links from latent topics in microblogs.

In Proceedings of the NAACL HLT 2010 Workshop on

Computational Linguistics in a World of Social Media,

pages 19–20. Association for Computational Linguis-

tics.

Quercia, D., Askham, H., and Crowcroft, J. (2012). Tweet-

lda: supervised topic classiﬁcation and link prediction

in twitter. In Proceedings of the 3rd Annual ACM Web

Science Conference, pages 247–250. ACM.

Ramage, D., Dumais, S. T., and Liebling, D. J.

(2010). Characterizing microblogs with topic models.

ICWSM, 10:1–1.

Romero, C., Ventura, S., Pechenizkiy, M., and Baker, R. S.

(2010). Handbook of educational data mining. CRC

Press.

Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H.,

and Li, X. (2011). Comparing twitter and traditional

media using topic models. In Advances in Information

Retrieval, pages 338–349. Springer.

Wiki-LDA: A Mixed-Method Approach for Effective Interest Mining on Twitter Data

433