Using Academic Genealogy for Recommending Supervisors

Gabriel Madeira

1 a

, Eduardo N. Borges

1 b

, Giancarlo Lucca

1 c

Washington Carvalho-Segundo

2 d

, Jonata C. Wieczynski

1 e

, Helida Santos

1 f

and Grac¸aliz Dimuro

1,3 g

Centro de Ci

encias Computacionais, Universidade Federal do Rio Grande, Rio Grande, RS, Brazil

Instituto Brasileiro de Informac¸

ao em Ci

encia e Tecnologia, Bras

ılia, DF, Brazil

Departamento de Estad

ıstica, Inform

atica y Matem

aticas, Universidad Publica de Navarra, Pamplona, Spain

Keywords:

Recommender Systems, Academic Genealogy, Academic Supervising, Nearest Centroid Classiﬁcation.

Abstract:

Selecting an academic supervisor is a complicated task. Masters and Ph.D. candidates usually select the most

prestigious universities in a given region, investigate the graduate programs in a research area of interest,

and analyze the professors’ proﬁles. This choice is a manual task that requires extensive human effort, and

usually, the result is not good enough. In this paper we propose a Recommender System that enables one to

choose an academic supervisor based on his/her academic genealogy. We used metadata of different theses

and dissertations and applied the nearest centroid model to perform the recommendation. The obtained results

showed the high precision of the recommendations, which supports the hypothesis that the proposed system is

a useful tool for graduate students.

1 INTRODUCTION

One of the ﬁrst steps during the process of acquir-

ing an academic degree, either Masters or Ph.D., is

the choice of the theme to be investigated and then an

academic supervisor, which will be in charge of aid-

ing the student to achieve his/her goals. In the task

of choosing an academic supervisor, the amount of

experience regarding the theme’s ﬁeld of study is sig-

niﬁcant. However, this choice might not be so trivial.

This job should include a thorough analysis of each

professor’s curriculum, including the list of scientiﬁc

publications and all theses and dissertations advised,

which can be available in multiple and distributed re-

search repositories.

Ray and Marakas (Ray and Marakas, 2007) assert

that students’ usual criteria are professors’ reputation,

knowledge, and matching of interests, among others.

However, this choice is often made in an unplanned

https://orcid.org/0000-0001-8348-3498

https://orcid.org/0000-0003-1595-7676

https://orcid.org/0000-0002-3776-0260

https://orcid.org/0000-0003-3635-9384

https://orcid.org/0000-0002-8293-0126

https://orcid.org/0000-0003-2994-2862

https://orcid.org/0000-0001-6986-9888

manner, which can become one of the reasons for re-

gret, lack of motivation, and poor quality of research

output. The authors proposed an analytical hierarchy

process for selecting a thesis supervisor, which shows

that the number of theses supervised is the least im-

portant criterion for both junior and senior graduate

students. Besides matching interests, the professors’

social network and relationship with other professors

in the same institute and outside were pointed out as

essential criteria.

In this paper, we developed a Recommender Sys-

tems (RS) that extracts knowledge from a set of de-

scriptive metadata of theses and dissertations super-

vised throughout the advisors’ career, considering so-

cial aspects extracted from their academic genealogy

trees. Our methods can represent adequately the pro-

ﬁle and research area of a young professor who men-

tored few or no students. When inputting the title and

abstract of a thesis/dissertation proposal, the system

returns a ranking of the most compatible advisors for

the chosen theme. So, the major contributions of this

paper are the following: a novel content-based recom-

mendation approach for selecting academic supervi-

sors; the use of academic genealogy trees (Sugimoto,

2014) (see an example in 4) to model the supervisors’

proﬁles; and the experimental evaluation of the pro-

posed RS using real data from a networked digital li-

Madeira, G., Borges, E., Lucca, G., Carvalho-Segundo, W., Wieczynski, J., Santos, H. and Dimuro, G.

Using Academic Genealogy for Recommending Supervisors.

DOI: 10.5220/0010442608850892

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 885-892

ISBN: 978-989-758-509-8

885

brary of theses and dissertations.

The experiments were conducted using a dataset

containing more than 79,000 advisors from more than

600,000 theses and dissertations. Our system was

able to recommend the correct advisors, on average,

in the third position of the suggested ranking.

The rest of this paper is organized as follows. Sec-

tion 2 presents the preliminary concepts necessary to

understand our methods. In section 3 we discuss re-

lated work. Section 4 presents our approach to recom-

mend academic supervisors. In Section 5, we discuss

the obtained results. Finally, in Section 6, we draw

our conclusions.

2 PRELIMINARY CONCEPTS

2.1 Recommender Systems

RSs provide suggestions for information related to

several decision-making processes. The recommen-

dations are offered as ranked lists of information

items, which are personalized for each user. Besides

ﬁltering the most suitable information, RSs organize

it with a high probability of relevance based on user

preferences and constraints. (Ricci et al., 2011).

Among the features pointed by (Bobadilla et al.,

2013) that deﬁne a RS, we highlight: type of data,

e.g. ratings, content for items, social relationships

and location-aware information; ﬁltering algorithm,

e.g content-based, collaborative, context-aware or hy-

brid; techniques, e.g. probabilistic algorithms and

fuzzy models; sparsity level of the database and the

desired scalability; objective – predictions or top-n

recommendations; quality evaluation, e.g. novelty,

coverage and precision (Ge et al., 2010).

Content-based ﬁltering (Salter and Antonopoulos,

2006) makes recommendations based on user past

choices using the similarity between the content of

these items and those to be recommended. Demo-

graphic ﬁltering (Krulwich, 1997) performs the sim-

ilarity among users, based on the principle that in-

dividuals with common personal attributes will also

have common preferences. Examples of these at-

tributes are gender, age, location and language. Col-

laborative ﬁltering (Bobadilla et al., 2012) allows

users to give ratings (explicit or implicitly) on infor-

mation items, which can be used to recommend con-

tent for other users with similar proﬁles. Hybrid ﬁl-

tering (Chen et al., 2018) combines multiple ﬁltering

algorithms.

In this paper, we adapted a well-known content-

based ﬁltering neighborhood-based recommendation

technique (Desrosiers and Karypis, 2011) for sug-

gesting academic supervisors, by making used of the

content of theses and dissertations descriptive meta-

data and the advising relationships. We focus on sys-

tem precision disregarding scalability to perform top-

n recommendations.

2.2 Vector Space Model

Proposed by Salton in 1968 (Salton, 1968), Vector

Space Model (VSM) is a classic information retrieval

model implemented in many search engines. It uses

bag of words representation and allows to retrieve

documents ordered according to the query similar-

ity. Let D be a collection of documents represented

by vectors of weights associated with the terms con-

tained in the collection vocabulary. The similarity be-

tween a query q and a document d ∈ D is given by

Eq.(1), which performs the cosine of the angle be-

tween the vectors.

sim(~q,

d) = cos(~q,

d) =

~q ·

|~q||

(1)

The most common weighting scheme is called TF-

IDF, represented by t f × id f (Manning et al., 2008)

and deﬁned by Eq.(2), where t is a term of the vo-

cabulary V, d is a document, n

t,d

is the frequency of

the term t in the document d, N is the size of the col-

lection, i.e. the amount of documents, and d f

is the

frequency of documents containing t.

t f

t,d

× id f

t,d

∑

∈V

× log



d f



(2)

A VSM can be efﬁciently implemented using an in-

verted index, which maps each term of the vocabulary

into a list of postings, which contain the identiﬁer of

the document containing the term and additional in-

formation such as frequency.

2.3 Nearest Centroid Classiﬁcation

Nearest Centroid (Tibshirani et al., 2002; Manning

et al., 2008) is a model that classiﬁes test samples

according to their distance to the centroid of data

classes. For text classiﬁcation, let n be the amount of

documents in a set D, and let s

|1 ≤ i ≤ n be a sample

deﬁned by (~x

), where~x

is a document represented

in the VSM using t f × id f weighting scheme, and let

Y be the set of class labels, and y

∈ Y is the class label

of this sample.

In the training phase, the algorithm sets the cen-

troids~µ(l) for each distinct class label l ∈ Y, comput-

ing the vector average or center of mass of its mem-

bers. Equation (3) deﬁnes~µ(l), where D

is the docu-

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

886

ment whose class label is l.

~µ(l) =

∑

∈D

(3)

The prediction function reported in Eq.(4) returns the

class label ˆy which minimizes the Euclidean distance

between the associated centroid ~µ

and the test in-

stance ~x. Alternatively, Eq.(5) deﬁnes how the label

can be predicted using the cosine similarity, previ-

ously presented in Eq.(1).

ˆy = argmin

l∈Y

|~µ

−~x| (4)

ˆy = argmax

l∈Y

sim(~µ

,~x) (5)

2.4 Mean Average Precision

The Mean Average Precision (MAP) (Manning et al.,

2008) is a well-know metric for evaluating Informa-

tion Retrieval systems. It performs the mean of the av-

erage precision scores calculated for several queries.

Figure 1 shows an example in which the MAP

is performed. Relevant documents are represented

in bold. For the ﬁrst query, there are three relevant

documents that are returned in the second, third and

ﬁfth positions of the ranking. The average precision

for query 1 is avg

= (1/2 + 2/3 + 3/5)/3 = 0.59.

Query 2 retrieves two relevant documents in the ﬁrst

and third positions. The average precision for query 2

is avg

= (1/1 + 2/3)/2 = 0.83. Therefore, MAP =

(avg

+ avg

)/2 = 0.71.

In classiﬁcation problems, the queries are the test

instances submitted to the classiﬁcation model.

average precision query 2 = (1 + 2/3) / 2 = 0.83

= relevant documents for query 1

Ranking #1

Precision 0/1 1/2 2/3 2/4 3/5

= relevant documents for query 2

Ranking #2

Precision 1/1 1/2 2/3 2/4 2/5

average precision query 1 = (1/2 + 2/3 + 3/5) / 3 = 0.59

mean average precision = (0.59 + 0.83) / 2 = 0.71

Figure 1: An example of MAP considering two queries.

3 RELATED WORK

(Husain et al., 2019) review the literature about ex-

pert ﬁnding systems between 2010 and 2019. These

systems have been proposed in different domains and

environments, such as medicine, enterprise, ques-

tion answering communities, and social networks.

Academia was the largest domain, comprising 44

studies (65% of the sample). The majority of these

systems were developed for speciﬁc academic tasks

like paper reviewing, research collaborations, ﬁnding

similar experts, and industry or university collabora-

tions. Only one study addressed ﬁnding a suitable su-

pervisor (Alarfaj et al., 2012). The authors proposed

a simple database-driven approach that selects a su-

pervisor from the university’s academic staff, and a

data-driven approach where candidates are extracted

from pages returned by a web search engine.

(Hasan and Schwartz, 2018) developed RecAdvi-

sor, a criteria-based Ph.D. supervisor recommender

for Florida State University (FSU). The prototype col-

lects information from four different sources: Mi-

crosoft Academic Graph, Computing Research and

Education Association (CORE), professors’ CVs and

FSU’s digital repository. The proﬁles are indexed us-

ing Elasticsearch (Gormley and Tong, 2015).

Selecting an academic supervisor is not a popu-

lar research theme. Most related work we could ﬁnd

were proposed for very speciﬁc scenarios, such as

ﬁnding scientiﬁc articles and papers, recommending

academic courses, and suggesting researchers for col-

laboration.

Docear’s RS (Beel et al., 2013) is part of a liter-

ature management software. The system allows a re-

searcher to search, read, make annotations and orga-

nize scientiﬁc articles, besides drafting manuscripts.

Docear suggests citations from a digital library con-

taining around 1.8 million research articles from var-

ious disciplines.

Champiri et al. (Champiri et al., 2015) published

a survey analysing if incorporating contextual infor-

mation in recommender systems is an effective ap-

proach to create more accurate and relevant recom-

mendations in digital libraries. They highlight RSs

with the purpose of exploring a research area and ﬁnd-

ing relevant research sources.

In order to recommend the most relevant courses

to its users, the RARE system (Bendakir and A

ımeur,

2006) combines the beneﬁts of both former students’

experience learned in the data mining process and

current students’ ratings. It is a hybrid ﬁltering ap-

proach based on association rules. Authors used the

association algorithm Apriori implemented by Weka

tool for training the model. With a similar purpose,

O’Mahony and Smyth (O’Mahony and Smyth, 2007)

developed a RS for an on-line enrolment application

of Dublin’s University College. Users can search

by inserting keywords or speciﬁc core module IDs.

The output is a list of elective modules which match

Using Academic Genealogy for Recommending Supervisors

887

the search criteria and their proﬁle. Authors used a

item-based collaborative ﬁltering algorithm (Karypis,

2001). Another strategy has recently been proposed to

domain-aware grade prediction and top-n course rec-

ommendation (Elbadrawy and Karypis, 2016).

Rodrigues et al. (Rodrigues et al., 2018) use

different strategies to suggest scientiﬁc collabora-

tion for researchers based on their interest. The au-

thors model the similarity between researchers us-

ing data from ResearchGate social network. They

exploit co-authorship attributes and paper reading

records with a hybrid approach, having both content-

based and collaborative ﬁltering. Experimental re-

sults showed that the content-based strategy out-

performs neighborhood-based collaborative ﬁltering

strategies up to 21.16% regarding F-measure for the

top-20 recommendation lists.

Mendonc¸a et al. (Mendonc¸a et al., 2020) present

a systematic mapping of RSs based on scientiﬁc pub-

lications. They analysed that Machine Learning al-

gorithms and Vector Space Model representation are

the most used in content-based RSs for the academic

ﬁeld. On the other hand, for collaborative ﬁltering

approaches, common methods are based in neighbor-

hood, such as k Nearest Neighbors. Databases fre-

quently used were: CiteULike, DBLP, Microsoft Aca-

demic Search (MAS), CiteSeerx, PubMed and Web of

Science.

The RS proposed in this paper differs from most

related works in the following aspects. Instead of rec-

ommending collaborations based in coauthoring, we

used the advising relationships to suggest academic

supervisors. The supervisor proﬁles are learned us-

ing their academic genealogy trees build from Elec-

tronic Theses and Dissertations (ETDs) repositories.

We quantitatively evaluate our system using data from

more than 79,000 professors and 600,000 students.

4 A NOVEL ACADEMIC

SUPERVISOR RS

In this section we present our novel approach for rec-

ommending academic supervisors. Figure 2 shows

the architecture of the proposed RS.

The ﬁrst process collects data and builds the aca-

demic genealogy. From a repository of ETDs, the

RS selects a set of documents of interest. After

that, the researchers, i.e. supervisors and authors,

are extracted from the selected theses and disserta-

tions. A deduplication method is applied to identify

each unique research. From the relationships between

unique researchers, the genealogy graph is built using

the method proposed in (Madeira et al., 2020).

Repository of EDTs

Documents

Figure 2: Architecture of the proposed academic supervisor

RS.

Using machine learning, in the second process, we

ﬁt a model that will be used to perform recommen-

dations. Textual data from the academic genealogy

are pre-processed with standard operations. We trans-

form data features into VSM using t f × id f term-

weighting strategy, presented in Section 2.2. Next,

using the transformed data, we train a classiﬁer based

on the Nearest Centroid algorithm (Section 2.3) using

the cosine similarity and build the recommendation

model.

Finally, users can query by inserting the title and

abstract of the research proposal. This information

is transformed using the same pre-processing scheme

as before and then it is used as input of the classi-

ﬁer, which returns a ﬁnal ranking of recommendations

composed by the most suitable supervisors ordered by

relevance. In the following subsections, each step is

detailed.

4.1 Data Source

The proposed RS can handle different data sources.

Repositories of ETDs must support some interoper-

ability features, such as the OAI-PMH protocol (De-

varakonda et al., 2011), or have an API available for

harvesting metadata.

In this study, we used a Brazilian repository,

known as (Biblioteca Digital Brasileira de Teses e

Dissertac¸

oes – BDTD)

. This networked digital li-

brary contains metadata from more than 600 thousand

documents. BDTD integrates and disseminates, in an

unique website, the complete content of different the-

ses and dissertations that are produced in Brazilian

Available in http://bdtd.ibict.br.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

888

universities. Additionally, its access is open and free

of any kind of charge.

This digital library also contributes to increase the

content of Brazilian theses and dissertations on the

internet, growing the visibility of the national techno-

logical and scientiﬁc production. Moreover, BDTD

also provides major visibility and management of the

investments done in graduate programs.

From the BDTD available metadata ﬁelds, we

picked:

• network acronym str – acronym for the univer-

sity;

• network name str – name of the origin repository;

• title – document’s title;

• description – document’s abstract;

• author – document’s author;

• advisor – document’s supervisor;

• author lattes – URL of the author’s curriculum in

Lattes Platform

;

• topic – related topics of the document;

• citation – how to cite the document;

• language – language of the document (mostly in

Portuguese)

• publishDate – year the document was published;

• format – indicates if a document is a Ph.D. thesis

or a Master dissertation;

• url – URL of the document in its original reposi-

tory.

We collected 612,714 theses or dissertations from

BDTD. For each document, we extracted the re-

searchers (author and supervisor) and their relation-

ship to build the academic genealogy. The giant com-

ponent of the graph had more than 300 thousand ver-

tices connected by more than 350 thousand edges.

4.2 Data Pre-processing

The amount of documents collected and the data vol-

ume stored can be large. Data can have different no-

tations and language particularities. Thus, in order to

standardize the information, a pre-processing is nec-

essary. We applied the following operations in the

textual data. First, the title, description, author and

advisor metadata ﬁelds are tokenized. Tokens are nor-

malized by turning characters to lowercase, removing

An information system maintained by the Brazilian Na-

tional Council for Scientiﬁc and Technological Develop-

ment (CNPq) that integrates databases of curricula, research

groups and institutions. Available in http://lattes.cnpq.br.

word2

tf-idf weight

advisor1

advisor2

centroid

Ranking

advisor2

2: advisor1

new_document

word1

tf-idf weight

Figure 3: An example of how Nearest Centroid algorithm

works with VSM.

accents and non-alphanumeric symbols. We also re-

moved stopwords in Brazilian Portuguese using the

NLTK toolkit (Loper and Bird, 2002). This opera-

tion reduces data volume and speeds up the system

without affecting signiﬁcantly the quality of the re-

sults. After that, the space of features of the title

and description are transformed into the VSM, using

t f × id f term-weighing strategy.

Besides, we applied a cleaning process, removing

duplicates and documents with missing information.

We also corrected misreported information, resulting

in 579,486 pre-processed theses and dissertations.

Finally, an additional pre-processing operation

sets the class label using the advisor metadata ﬁeld,

composing the training samples with the structure

= (~x

), where ~x

is the title and description t f ×

id f weights, and y

is the class label of the sample i.

The conducted experiments used only one level of the

tree but the system can be parameterized to reach any

depth, adding the vector components of the supervi-

sors’ theses in the training samples.

4.3 Training the Classiﬁer for

Recommendation

Due to the good results presented in (Han and

Karypis, 2000), we chose the Nearest Centroid clas-

siﬁcation algorithm to learn the supervising proﬁles.

Figure 3 shows an example of how the algorithm

works regarding two candidate advisors and a new test

instance (Ph.D. or Masters proposal). Each circle or

triangle refers to a thesis or dissertation in the mul-

tidimensional term space. Dots are the centroids of

the clusters formed by all works supervised by each

advisor, i.e. the supervisors’ proﬁles. The user re-

ceives as recommendation a proﬁle list, composed by

the class labels (distinct advisors) of the n = 2 nearest

centroids.

Using Academic Genealogy for Recommending Supervisors

889

5 EXPERIMENTAL EVALUATION

In this section, we explain how we evaluate the qual-

ity of the proposed academic supervisor RS, we re-

port implementation details, and we also present the

obtained results.

5.1 Validation

To evaluate the proposed RS, we used k-fold cross-

validation technique, which consists in splitting the

available dataset in k folds and calculating the evalu-

ation metrics k times, where in each interaction, one

of the parts is used for testing and the others are used

for training the model.

We used the evaluation metric Mean Average Pre-

cision (MAP), presented in Section 2.4 applied in the

output of the predict function, which is a ranking with

more than 79 thousand positions. Besides, a student

may have two different advisors, one for the Masters

dissertation and another for the Ph.D thesis. In this

case, both were considered correct recommendations,

because they could appear in distinct positions of the

ranking.

5.2 Implementation

The implementation was coded in Python using the

scikit-learn library (Pedregosa et al., 2011). The

Nearest Centroid algorithm was adapted from the

scikit-learn implementation to return an ordered list

of centroids. This list is the ranking of the most suit-

able academic supervisors from BDTD for a Ph.D. or

Masters proposal in the query.

We highlight that due to dataset volume, we

needed to modify the source code of the python

library. Precisely, the scikit-learn uses a ﬂoat64

matrix to store the documents represented in the

VSM. This approach slows down the system perfor-

mance. So, to avoid this problem, we have changed

it to scipy.sparse.lil matrix, which implements row-

based list of lists sparse matrix.

Experiments run on a dual-socket quad-core

Intel

 Xeon

 L5420 2.5 GHz CPU with 32 GB of

memory.

5.3 Results

In this section, we start by presenting a query example

and the returned recommendations. After, we focus

on the evaluation considering MAP and the frequency

in which the correct advisor is well recommended.

Figure 4 shows the academic genealogy tree of the

researcher named Mar

ılia Abrah

ao Amaral. Note that

she had two different advisors, one for Masters (M)

and another for Ph.d (D). Relationships labels also

include the thesis or dissertation’s publication year.

Moreover, we can also observe that this person has

already advised a Masters student in 2017.

Marília Abrahão Amaral Tree

D-2008

D-1997

M-2002

M-2017

Anne Caroline Lesinhovski

Marília Abrahão Amaral

Neri Dos Santos

Vania Ribas Ulbricht

Jose Palazzo Moreira De Oliveira

Figure 4: An example of an academic genealogy tree, where

each arrow represents a Ph.D. (D) or Master (M) advising

relationship.

In the proposed RS, using as input the title and ab-

stract of her thesis, we obtain a ranking of supervi-

sor recommendations. The ﬁrst ﬁve positions are pre-

sented in Table 1.

Note that a correct answer (Marilia’s Masters ad-

visor, J. Palazzo) was returned in the ﬁfth position.

Considering that we have more than 79 thousand pos-

sible candidates, the obtained result is fairly good.

Moreover, if we take into account the remain-

der recommendations, all suggestions are indeed re-

lated to Marilia’s research ﬁeld and can also be con-

sidered exceptional recommendations. The ﬁrst two

researchers (J. Valdeni and R. Vicari), for instance,

work in the same university of J. Oliveira. Those three

professors had already been part of the same research

group within the same graduate program. Therefore,

any of them could have been Marilia’s advisor as their

proﬁles are strongly related to her topic of interest.

Table 1: Returned recommendations using the title and ab-

stract of Mar

ılia Abrah

ao Amaral’s Ph.D. thesis.

Position Name

1 Jose Valdeni de Lima

2 Rosa Maria Vicari

3 Alex Sandro Gomes

4 Jos

e Dutra de Oliveira Neto

5 Jose Palazzo Moreira de Oliveira

. . .

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

890

Notwithstanding the above, the analysis of only

one returned ranking can not be enough to evaluate

the system quality for this query. Thus, considering

all ten models ﬁtted in cross-validation, the obtained

mean position of this advisor was 5.2, reinforcing the

quality of the recommendation.

To evaluate the general quality of the pro-

posed system, we performed a study considering all

573,671 instances. Table 2 presents the results of the

cross-validation process, achieving a MAP equals to

32.41%, meaning that our system was able to suggest

the correct advisors, on average, in the third position

of the recommended ranking, since 1/3 ≈ 0.3241.

Table 2: Cross-Validation evaluation considering the MAP

metric.

Fold MAP Fold MAP

1 0.3259 6 0.3245

2 0.3266 7 0.3256

3 0.3216 8 0.3223

4 0.3236 9 0.3242

5 0.3239 10 0.3228

Avg. 0.3241

In order to clarify the effectiveness of the method, we

present a graphical analysis of the general obtained

results. Precisely, in Figure 5, we show a histogram

containing the frequency of each position of a correct

advisor returned in the ranking. In this ﬁgure, the x

axis represents the ﬁrst 100 positions of the ranking.

In the y axis, there is the amount of correct recom-

mendations for each position in the x axis.

Figure 5: Number of queries that returned the correct advi-

sor for each position of the ranking.

It can be noted that more than 15,000 queries have

returned a correct advisor in the ﬁrst position. More-

over, the majority of the queries have returned the ad-

visor at least in the 10th position. After that, we ob-

serve that the system presents a stability. After the

100th position, the recommendations tend to be zero.

6 CONCLUSION

In this work we developed a recommender system that

extracts knowledge from a set of descriptive meta-

data of theses and dissertations. We proposed a novel

content-based recommendation approach for suggest-

ing academic supervisors using academic genealogy

to model their proﬁles.

Unlike most of the related work, which helps for

ﬁnding scientiﬁc literature, academic courses, or re-

searchers for collaboration, our system recommends

supervisors for thesis and dissertation proposals. Tak-

ing into account that choosing an adequate advisor

can be a hard task, such system seems to be an im-

portant assisting tool.

Experiments were conducted using realdata from

a repository containing more than 600 thousand the-

ses and dissertations. The evaluation shows that our

system was able to recommend a correct advisor, on

average, in the third position of the suggested ranking.

In future works, we intend to integrate an aca-

demic genealogy tree viewer with the recommender

system in a Web platform. Lastly, additional ﬁlters

will be included, such as the location and the univer-

sity acronym.

ACKNOWLEDGMENTS

This study was supported by CAPES Financial

Code 001, PNPD/CAPES (464880/2019-00), CNPq

(301618/2019-4), and FAPERGS (19/2551-0001279-

9, 19/2551-0001660).

REFERENCES

Alarfaj, F., Kruschwitz, U., Hunter, D., and Fox, C. (2012).

Finding the right supervisor: Expert-ﬁnding in a uni-

versity domain. In Proceedings of the Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies: Student Research Workshop, NAACL HLT

’12, page 1–6, USA. Association for Computational

Linguistics.

Beel, J., Langer, S., Genzmehr, M., and N

urnberger, A.

(2013). Introducing docear’s research paper rec-

ommender system. In Proceedings of the 13th

ACM/IEEE-CS Joint Conference on Digital Libraries,

JCDL ’13, page 459–460, New York, NY, USA. As-

sociation for Computing Machinery.

Bendakir, N. and A

ımeur, E. (2006). Using association rules

for course recommendation. In Beck, J. E., Aimeur,

E., and Barnes, T., editors, Proceedings of the AAAI

Workshop on Educational Data Mining, pages 1–10,

Palo Alto, California, USA. Association for the Ad-

vancement of Artiﬁcial Intelligence.

Using Academic Genealogy for Recommending Supervisors

891

Bobadilla, J., Hernando, A., Ortega, F., and Guti

errez, A.

(2012). Collaborative ﬁltering based on signiﬁcances.

Information Sciences, 185(1):1–17.

Bobadilla, J., Ortega, F., Hernando, A., and Guti

errez,

(2013). Recommender systems survey. Knowledge-

Based Systems, 46:109–132.

Champiri, Z. D., Shahamiri, S. R., and Salim, S. S. B.

(2015). A systematic review of scholar context-aware

recommender systems. Expert Systems with Applica-

tions, 42(3):1743 – 1758.

Chen, R., Hua, Q., Chang, Y.-S., Wang, B., Zhang, L., and

Kong, X. (2018). A survey of collaborative ﬁltering-

based recommender systems: From traditional meth-

ods to hybrid methods based on social networks. IEEE

Access, 6:64301–64320.

Desrosiers, C. and Karypis, G. (2011). A comprehensive

survey of neighborhood-based recommendation meth-

ods. In Recommender systems handbook, pages 107–

144. Springer, Boston, MA.

Devarakonda, R., Palanisamy, G., Green, J. M., and Wilson,

B. E. (2011). Data sharing and retrieval using oai-

pmh. Earth Science Informatics, 4(1):1–5.

Elbadrawy, A. and Karypis, G. (2016). Domain-aware

grade prediction and top-n course recommendation.

In Proceedings of the 10th ACM Conference on Rec-

ommender Systems, RecSys ’16, page 183–190, New

York, NY, USA. Association for Computing Machin-

ery.

Ge, M., Delgado-Battenfeld, C., and Jannach, D. (2010).

Beyond accuracy: Evaluating recommender systems

by coverage and serendipity. In Proceedings of the

Fourth ACM Conference on Recommender Systems,

RecSys ’10, page 257–260, New York, NY, USA. As-

sociation for Computing Machinery.

Gormley, C. and Tong, Z. (2015). Elasticsearch: the deﬁni-

tive guide: a distributed real-time search and ana-

lytics engine. O’Reilly Media, Inc., Sebastopol, CA,

USA.

Han, E.-H. S. and Karypis, G. (2000). Centroid-based

document classiﬁcation: Analysis and experimental

results. In European conference on principles of

data mining and knowledge discovery, pages 424–

431, Department of Computer Science / Army HPC

Research CenterUniversity of Minnesota, Minneapo-

lis. Springer.

Hasan, M. A. and Schwartz, D. G. (2018). Recadvisor:

Criteria-based ph.d. supervisor recommendation. In

The 41st International ACM SIGIR Conference on Re-

search & Development in Information Retrieval, SI-

GIR ’18, page 1325–1328, New York, NY, USA. As-

sociation for Computing Machinery.

Husain, O., Salim, N., Alias, R. A., Abdelsalam, S., and

Hassan, A. (2019). Expert ﬁnding systems: A sys-

tematic review. Applied Sciences, 9(20):4250.

Karypis, G. (2001). Evaluation of item-based top-n recom-

mendation algorithms. In Proceedings of the Tenth

International Conference on Information and Knowl-

edge Management, CIKM ’01, page 247–254, New

York, NY, USA. ACM.

Krulwich, B. (1997). Lifestyle ﬁnder: Intelligent user pro-

ﬁling using large-scale demographic data. AI Maga-

zine, 18(2):37.

Loper, E. and Bird, S. (2002). NLTK: the natural language

toolkit. CoRR, cs.CL/0205028.

Madeira, G., Borges, E. N., Lucca, G., Santos, H., and

Dimuro, G. (2020). A tool for analyzing academic ge-

nealogy. In Filipe, J.,

Smiałek, M., Brodsky, A., and

Hammoudi, S., editors, Enterprise Information Sys-

tems, pages 443–456, Cham. Springer International

Publishing.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to information retrieval. Cambridge Uni-

versity Press, Cambridge, England.

Mendonc¸a, F. C., Gasparini., I., Schroeder., R., and Kem-

czinski., A. (2020). Recommender systems based

on scientiﬁc publications: A systematic mapping.

In Proceedings of the 22nd International Confer-

ence on Enterprise Information Systems - Volume 1:

ICEIS,, pages 735–742, Set

ubal, Portugal. INSTICC,

SciTePress.

O’Mahony, M. P. and Smyth, B. (2007). A recommender

system for on-line course enrolment: An initial study.

In Proceedings of the 2007 ACM Conference on Rec-

ommender Systems, RecSys ’07, page 133–136, New

York, NY, USA. Association for Computing Machin-

ery.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,

Cournapeau, D., Brucher, M., Perrot, M., and

Edouard

Duchesnay (2011). Scikit-learn: Machine learning

in python. Journal of Machine Learning Research,

12(85):2825–2830.

Ray, S. and Marakas, G. (2007). Selecting a doctoral dis-

sertation supervisor: Analytical hierarchy approach to

the multiple criteria problem. International journal of

doctoral studies, 2(1):23–32.

Ricci, F., Rokach, L., and Shapira, B. (2011). Introduc-

tion to recommender systems handbook. In Rec-

ommender systems handbook, pages 1–35. Springer,

Boston, MA.

Rodrigues, M. W., Brand

ao, W. C., and Z

arate, L. E.

(2018). Recommending scientiﬁc collaboration from

researchgate. In 7th Brazilian Conference on Intelli-

gent Systems (BRACIS), pages 336–341, New York,

NY, USA. IEEE.

Salter, J. and Antonopoulos, N. (2006). Cinemascreen

recommender agent: combining collaborative and

content-based ﬁltering. IEEE Intelligent Systems,

21(1):35–41.

Salton, G. (1968). Automatic Information Organization and

Retrieval. McGraw Hill Text, New York, NY, USA.

Sugimoto, C. R. (2014). Academic genealogy. In Beyond

bibliometrics: Harnessing multidimensional indica-

tors of scholarly impact, pages 365–380. MIT Press,

Cambridge, MA, USA.

Tibshirani, R., Hastie, T., Narasimhan, B., and Chu,

G. (2002). Diagnosis of multiple cancer types by

shrunken centroids of gene expression. Proceedings

of the National Academy of Sciences, 99(10):6567–

6572.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

892