GRASP: Graph-based Mining of Scientiﬁc Papers

Navid Nobani

1,2

, Mauro Pelucchi

, Matteo Perico

, Andrea Scrivanti

and Alessandro Vaccarino

Dept. of Informatics, Systems & Communication, University of Milan-Bicocca, Milan, Italy

Digital Attitude, Milan, Italy

CRISP Research Center, University of Milan-Bicocca, Milan, Italy

obix, Bergamo, Italy

alessandro.vaccarino@gmail.com

Keywords:

Graph Networks, Scientiﬁc Documents, Information Retrieval, Literature Review.

Abstract:

Over the past two decades, academia has witnessed numerous tools and search engines which facilitate the

retrieval procedure in the literature review process and aid researchers to review the literature with more ease

and accuracy. These tools mostly work based on a simple textual input which supposedly encapsulates the

primary keywords in the desired research areas. Such tools mainly suffer from the following shortcomings: (i)

they rely on textual search queries that are expected to reﬂect all the desired keywords and concepts, and (ii)

shallow results which makes following a paper through time via citations a cumbersome task. In this paper,

we introduce GRASP, a search engine that retrieves scientiﬁc papers starting from a sub-graph query provided

by the user, offering (i) a list of time papers based on the query and (ii) a graph with papers and authors as

vertices and edges being cited and published-by. GRASPhas been created using a Neo4j graph database, based

on DBLP and AMiner corpora provided by their API. Acting performance evaluation by asking ten computer

science experts, we demonstrate how GRASPcan efﬁciently retrieve and rank the most related papers based on

the user’s input.

1 INTRODUCTION

Nowadays, thanks to various online libraries which

index and provide scientiﬁc papers of different ﬁelds,

researchers seldom have problems with ﬁnding the

desired sources. However, paradoxically, the abun-

dance of such sources, while theoretically should play

a role in favor of the researcher, practically will give

her hundreds or thousands of new resources which in

the best case demand a considerable amount of time

to be processed manually.

A common procedure used in literature review

may consist of the following iterative steps: (i) ﬁnding

preliminary keywords related to the research ﬁeld; (ii)

using these keywords to ﬁnd new papers and articles;

(iii) ﬁltering out the unrelated sources by gradually

going through contents of each source (e.g. abstract,

conclusion); (iv) ﬁnding articles which cite or have

been cited by this work.

Following this procedure requires a considerable

amount of time for the researcher, since the third and

fourth steps are not always straight-forward. More-

over, to ﬁlter the articles as related/unrelated, one of-

ten needs to read at least the sections such as abstract

or conclusion.

In this paper, we propose GRASP, a research tool

which facilitates researchers’ process of literature re-

view and exploration. Starting from user input (cus-

tomized sub-graph query and a list of partitions),

GRASPbuilds a network of related documents based on

mining and scoring a corpus of articles. The scope is

to provide a set of articles (connected via citations),

both as a ranked list and a graph, which contains

the interconnected and inﬂuential works to the user’s

query. Figure 1 depicts a graphical overview of the

GRASPtool.

Figure 1: A graphical overview of GRASP.

176

Nobani, N., Pelucchi, M., Perico, M., Scrivanti, A. and Vaccarino, A.

GRASP: Graph-based Mining of Scientiﬁc Papers.

DOI: 10.5220/0010518901760183

In Proceedings of the 10th International Conference on Data Science, Technology and Applications (DATA 2021), pages 176-183

ISBN: 978-989-758-521-0

Contribution. The contribution of this work is

twofold:

1. Knowledge Graph: we provide a rich network of

connected papers through citation relations which

in turn can be used as the backbone of graph-

based scholarly tools providing various applica-

tions such as search engines and recommender

systems.

2. Working Demo (Video Available): we imple-

mented our tool and evaluated its performance

through a user study among university re-

searchers. A video of a working example is pro-

vided Online

The paper is organized as follows. Section 2 dis-

cusses some related work, section 3 presents an or-

ganic overview of GRASPsystem and section 4 intro-

duces concepts, deﬁnitions and problems. Section

5 gives a look at the big scholarly data, by describ-

ing the data warehouse of GRASP. Section 6 discusses

the retrieval technique, the scoring method and some

hints about GRASP. Section 7 presents the experimen-

tal evaluation of our technique, while section 8 draws

conclusions and future work.

2 RELATED WORK

GRASPworks in the context of Big Scholarly Data.

Nowadays, many online repositories for Big Schol-

arly Data are available, and many powerful analysis

tools were developed to help researchers and students

to access them and run queries to extract the required

information (Xia et al., 2017).

In the Big Data context, storing and processing a

high volume of unstructured data is a recurring chal-

lenge. In particular, in the tools dealing with this type

of data, the aspects of data analytics that provide re-

searchers with a series of solutions able to quickly ex-

tract information from these repositories become of

particular relevance.

GRASPcan be considered as an hybrid application

since on one hand it applies text analysis tools to re-

trieve the required information from the text, and on

the other hand, it uses a knowledge graph (see sec-

tion 4) to generate a ranked list of documents match-

ing the user’s criteria. For this reason, it falls into the

categories of Research Management and Recommen-

dation Systems as outlined by (Khan et al., 2017).

In the past four decades the mining of scientiﬁc

texts has been targeted by numerous number of re-

searchers, mostly in computer science and computa-

https://youtu.be/waKVJgTwqf4

tional linguistics ﬁelds. These works can be catego-

rized into four parts:

(i) Information Extraction, Classiﬁcation, and Clus-

tering of the Documents. For instance, in their

work (Sch

afer et al., 2008) use NLP techniques to

extract factual relations from scientiﬁc texts, while

(Williams et al., 2014) introduce a method for in-

formation extraction and document clustering of aca-

demic texts. (Sch

afer and Kiefer, 2011) introduce

a semantic search, citation classiﬁcation of scientiﬁc

papers.

(ii) Computing Similarity Ranks of Academic Papers

as computing citation-based impact measures (Hoang

et al., 2010) and creating a relatedness score done (Ef-

fendy et al., 2014).

(iii) Focusing on Authors, their Works and Networks

they belong to (Citation Graphs and Citation Net-

works). For example (Cesarini et al., 2018) models

DBLP as a graph and explores authors, their rela-

tions and similarity with other authors. In a similar

work (Mercorio et al., 2019) create a graph network

of authors and their academic publications. (Tran

et al., 2012) use an LDA based topic modelling in or-

der to facilitate author matching within two different

databases. Both (Za and Spagnoletti, 2013) and (Zhao

and Strotmann, 2015) present network citation solu-

tions that use Graph modelling to investigate citation

patterns and sub-networks.

(iv) Big Scholary Data Integation and Automatic Ex-

traction of Information. (Williams et al., 2014) and

(Ororbia et al., 2015) describe CiteSeerX and how it

integrates data from across the Web and performs au-

tomatic extraction, clustering, entity linking and name

disambiguation on that data.

A recent work incorporating a network of papers

selected according to a topic is Connected Papers

which creates a graph of similar papers based on mu-

tual citations and allows researchers to have a naviga-

ble overview of the existing works related to a speciﬁc

ﬁeld of studies. The main difference between Con-

nected Papers and GRASPis the scope: GRASPis not a

general purpose system but rather it’s built to support

the literature review process with a scientiﬁc frame-

work. GRASPis able to run tailored queries on speciﬁc

partitions of a big scholarly data warehouse, and give

in output a network of related documents based on

mining and scoring a corpus of articles. Moreover,

there are substantial differences regarding the rank-

ing algorithm that helps and guides our users to start

the literature review process from the most inﬂuential

works related to the initial query.

While some former works address generating a

graph database or network from scientiﬁc papers aim-

https://www.connectedpapers.com/

GRASP: Graph-based Mining of Scientiﬁc Papers

177

ing at creating relations among authors and their

works, to the best of our knowledge, none of these

works elaborate on retrieval of the linked papers (via

citation) through time.

3 THE GRASPSYSTEM

As described in Figure 1, the user provides as input to

the system a sub-graph query and a set of partitions.

GRASPuses the sources to furnish as output a citation

graph and the paper list.

Sub-graph Query. A sub-graph query (or a group

of Family items) is a JSON which consists of one or

more components (see Figure 2). These components

aim is to narrow down the research results by provid-

ing a recipe which assists GRASPto arrive to the most

related results, through relative, alternative, reinforc-

ing and not-related terms(see Deﬁnition 3 for more

details).

Partitions. Partitions are keywords (i.e. ﬁelds of

study) which aim to reduce the number of documents

analysed and ranked by the system.

Scoring. After applying a primarily ﬁlter using the

partitions, GRASPutilizes the sub-graph query to score

the remaining documents based on the provided fam-

ilies and considering the title and the body of each

document(see section 6.1).

Upon scoring the documents in the database, the fol-

lowing outputs will be generated:

Citation Graph. A visual output as the representa-

tion of a graph which in turn depicts how authors and

their publications are connected through publish and

reference edges (see Figure 4).

Paper List. As the second output GRASPprovides a

list of publications sorted by the scoring function de-

scribed above, together with their rank and a link to

the publication source.

4 CONCEPTS, DEFINITIONS,

PROBLEM

Before introducing concepts, we go through some

deﬁnitions about the data model behind GRASP. The

data warehouse of our system is built on a graph

database, as this databases present relationships more

efﬁciently, specially dealing with interconnected data

and their ﬂexibility due to their schema-free nature

(Angles and Gutierrez, 2008).

Deﬁnition 1: Directed Labeled Multi-graph. Sim-

plifying, a Directed labeled multi-graph G is deﬁned

by the following tuple:

G = (N, E, ln, le)

deﬁned as

N is the set of nodes;

E is the set of edges;

ln is the set of node labels, each node could have one

or more labels;

le is the set of edge labels;

Each node n from N has a set of property P

. An

edge e from the set E represents a relation be-

tween a node n

and a node n

. Each e has a set of

property P



Deﬁnition 2: Knowledge Base. With KB = {p

, p

, . . . }, we refer the Knowledge base of Documents.

Each d is described by a set of property (e.g. title,

abstract, ...) and it is represented as a node n of N in

G (see Deﬁnition 1).

As described in (Jarke et al., 1989), a KB is a rep-

resentation of heuristic and factual information, in

the form of facts, assertions and deduction rules.

GRASPneeds two features to extract information from

its KB:

• An Inference engine that playing the role of an

interpreter;

• A Man-machine interface that transfers queries

from and answers to the user.

GRASPuses a particular form of KB, a Knowledge

Graph

. In GRASPKnowledge Graph nodes are en-

tities with different types and attributes, meanwhile,

edges are relations of different types. 

Deﬁnition 3: Sub-graph Query. With

O = { f

: {F, I, N, D,V, A}, f

: {F, I, N, D,V, A}, . . . }

we deﬁne a Sub-graph Query O. The Sub-graph

Query O deﬁnes a domain of research interests com-

posed by a list of items f

99K {p

, p

, . . . } = P

Each item f

generates a list of phrases using alter-

natives and variations, we call each generated list of

phrases a Phrases Set P

. Each p

identiﬁes a relevant

phrase for the domain of research interests. The con-

cept of relevant phrase is one key of the technique:

the search space is expanded, by adding information

In 1960, Semantic Networks were deﬁned as representa-

tion frameworks that can capture a wide range of entities.

Knowledge Graph is a variant of semantic network with

added constraints. The particular feature of a Knowledge

graph is the ability to encode structured information of en-

tities and their rich relations.

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

178

O = {

"classifi": {

"family": "MODEL",

"importance": 0.8,

"needs": ["ai"],

"drug":["expert systems", "learning system", "path"],

"alternatives": ["artificial intelligence",

"machine learning",

"artificial intelligence system"],

"variations": ["er", "cation"]

}

Figure 2: Example of sub-graph Query for the domain of

artiﬁcial intelligence.

related to the relevant phrases, on the basis of rela-

tions present in the knowledge graph.

To generate the list of phrases p, each f

is charac-

terized by the following speciﬁcation planned by the

researcher during the initial phase of the literature re-

view:

(i) F is the family name related to the ﬁeld of

study;

(ii) I ∈ (0, 1] (Importance) identiﬁes with a value

the weight of f ;

(iii) N (Needs) describe the mandatory context

where f is considered valid for the research

purpose;

(iv) D (Drugs) identify the not-mandatory terms

that improve the relation between an item with

(v) V (Variations) with respect to the ﬁeld of

study;

(vi) A (Alternatives) to the deﬁned ﬁeld of study.

Figure 2 presents an example of a small sub-graph to

mining papers related to the domain of the artiﬁcial

intelligence. In this case our domain research if arti-

ﬁcial intelligence, where the F family is represented

by MODEL, which in this case identiﬁes a model in

our domain as representing the context expressed by

the needs (i.e. ai); this family has importance 0.8 for

our research domain; expert systems, learning system

and AI path are not-mandatory terms that improve

the relationship between the retrieve documents and

the MODEL family; artiﬁcial intelligence, machine

learning, artiﬁcial intelligence system are alternatives

for the MODEL family in the AI domain; with respect

to the ﬁeld of study, we will use terms such as er and

caution calculate new variations for our research do-

main. 

Deﬁnition 4: Family. With

F = {(P, N, D, I) s.t. P Phrases Set}

we deﬁne a Family F . A Family F is a set of research

document characterized by same ﬁeld of study (i.e.

the same Family name F).

Each F is deﬁned by the following attributes:

(i) P is a Phrases Set generated by an item f of

the Sub-graph Query with his variations V and

alternatives A;

(ii) N represents the sets of Needs N of the item f ;

(iii) D is the sets of Drugs D of the item f ;

(iv) I is the importance of of each phrase generated

by f within his Family F .



We can now specify the concept of Query, to illus-

trate our approach to sub-graph query-based mining a

Knowledge base of Scientiﬁc Papers.

Deﬁnition 5: Query. Given a Document d and the

sub-graph query O, a query q is a couple of q: <d,

O>. 

Problem 1: Given a knowledge base KB of scientiﬁc

documents and a query q, return the result set RS =

, d

, . . . }, that contains documents d

retrieved in

KB d

∈ KB such that d

satisﬁes query q. 

5 DATA

To build the knowledge base of scientiﬁc papers

we used the citation network dataset from DBLP

and ArnetMiner

; the former gives the citation

network, the latter adds further data, such as the

ﬁeld of study, by searching and performing data

mining operations against academic publications on

the Internet, using social network analysis to iden-

tify connections between researchers, conferences,

and publications (Tang et al., 2008). This allows

ArnetMiner(or AMiner) to provide services such as

expert ﬁnding, geographic search, trend analysis,

reviewer recommendation, association search, course

search, academic performance evaluation, and topic

modeling.

For our work we used the v11 version (Sinha

et al., 2015), containing 4,107,340 papers and

36,624,464 citation relationships.

As speciﬁed in section 4, we choose a graph-

based approach that enable us to generate an easy

representation of relationships, while achieving

higher performance and ﬂexibility.

To obtain this scope, we choose to build the

knowledge base on top of Neo4j (Robinson et al.,

2015) that uses the Cypher declarative query lan-

guage (Francis et al., 2018) to query the graph.

https://dblp.org/

https://www.aminer.org/

GRASP: Graph-based Mining of Scientiﬁc Papers

179

Figure 3: Data Model.

The knowledge base data model is represented in

Figure 3. Raw data are processed and transformed to

ﬁt the data model structure, and loaded into a Neo4j

graphDB instance; in this way we are able to use

Cypher to query the knowledge base.

The relationships shown in the data model are created

as speciﬁed in (Sinha et al., 2015), directly form the

dataset; then we have the following labels for each

type of node:

• Author: is an author of one or more Papers; is

identiﬁed by id, and has a name;

• Paper: is the main label of the model; can ref-

erences to other Papers, is published on a venue,

and can have Fields of study (FOS); is identiﬁed

by an id, and its mostly relevant attributes are title,

year, abstract;

• Venue: can be a conference, journal, workshop or

book, with their speciﬁc attributes; common at-

tributes are id and name;

• Field of Study (FOS): is the ﬁelds of study, com-

ing from MAG database (Sinha et al., 2015) and

extracted using NLP techniques; a Paper can have

more than one FOS, with the associated score.

6 METHODOLOGY

The following section describes the formulas used to

calculate the scoring values of a document with re-

spect to a phrase and to a Family.

Document. A document d is a couple formed by title

and body.

Title

= {(p

, ..., p

) s.t.p

phrase}

Body

= {(p

, ..., p

) s.t.p

phrase}

Thus,

Document d = (Title

, Body

)

For scoring purposes, we applied TF-IDF (term

frequency-inverse document frequency) formula to

the corpus of the documents.

T FIDF

= {(p

, r(p

))

s.t. p

∈ Body

and r(p

) T F −IDF rank o f p

in Body

}

We recall that a Family F is a collection of Phrases

Set P with related features: Needs, Drugs and Impor-

tance.

F = {(P, N, D, I) s.t. P Phrases Set}

6.1 Scoring

The deﬁnition of the Document scoring formula con-

siders the composition of four weights functions:

• Title scoring function;

• Body scoring function;

• Needs coefﬁcient;

• Drugs coefﬁcient.

Title scoring function of a Phrases Set P and docu-

ment d:

T (P, d) =

|Title

∩ P|

|Title

Body scoring function of a Phrases Set P and docu-

ment d:

B(P, d) =







|P|

∑

p∈P



1 −

r(p)

|T FIDF



if Body

∩ P 6=

0 otherwise

Finally, as scoring functions, drugs and needs related

to a Phrases Set P and document d can be represented

as:

N(d) =

(

|N∩Body

|N|

N 6=

1 otherwise

D(d) =

(

1 +

|D∩Body

|D|

D 6=

1 otherwise

Thus the overall scoring function of the Document d

with respect to the an element of a Family F given by

(P, N, D, I) (here represented using only P) is:

s(P, d) = (T (P, d) +B(P, d)) I · D(d) · N(d)

Once the scoring function for the elements of a fam-

ily is deﬁned, it is possible to introduce the scoring

function of the document d with respect to an entire

family F . To avoid unbalances related to different

family sizes, only elements of the family with Phrases

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

180

Set that have not-empty intersection with the body of

the document are taken into account in this evaluation.

For simplicity, we represent an element (P, N, D, I) us-

ing only P. Given a family F and a document d, we

deﬁne S the scoring function of the document d with

respect to the family F as:

S(F ,d) =

∑

P∈F

s(P,d)

|{P s.t. Body

∩P 6=

0}|

In addition, for queries containing multiple families

we can deﬁne a Total Rank calculation for a doc-

ument d as the product of scoring functions for all

families {F }

F ∈O

∏

F ∈O

max(S(F , d), ε), ε > 0

7 RESULTS

7.1 User Evaluation Methodology

GRASPwas developed and tested on several use cases.

The results of those cases were passed to a pool of

experts (PhD students and researchers) for a model

evaluation. Two sets of metrics are adopted, both in-

spired by Information Retrieval performance metrics

(see (Buttcher S., 2016; Dupret, 2011)).

The ﬁrst one is focused on the overall consistency

of documents extracted by GRASP, despite of the grade

of coherence. It is based on the following metrics:

• Precision at k Documents (P@k);

• Precision at R (P@R);

• Average Precision (AP).

The second one is more focused on the grade of

properness and appropriateness of documents ex-

tracted by GRASP, based on following metrics:

• Discounted Cumulative Gain (DCG);

• Normalized Discounted Cumulative Gain

(nDCG).

7.2 Evaluated Use Cases

In this section, some user evaluation cases are re-

ported with associated evaluation metrics. Each use

case adopts its own sub-graph.

The evaluation process of the system is structured

as following:

(i) Sub-graph query authoring;

(ii) Choice of partitioning keys;

(iv) Executing a selection query on the knowledge

database;

(v) Application of our scoring technique;

(vi) Evaluation of results with ten university re-

searchers.

The research question we addressed can be summa-

rized as follows: can the result be effectively used for

the literature overview process? Is the document re-

turned by the system actually relevant to the search

domain and query? Is the position of the document

within the list returned by GRASPcorrect? In other

words, is the ranking of documents with respect to

their relevance to the research domain?

As far as evaluation is concerned, the results of the

scoring phase are subject to validation by ten univer-

sity researchers who have calculated the indicators as

indicated in section 7.1. The evaluation phase is fo-

cused on the 10 most relevant papers, extracted by

GRASP.

Figure 4: Working example - Computer Vision and

Deep Learning (the original ﬁgure is available here

http://tiny.cc/93gwtz).

GRASPwas then evaluated on eight different scenarios:

(i) AI & Natural language processing;

(ii) Computer Vision and Deep Learning (Figure

4);

(iii) Data Quality and KDD Processes;

(iv) Robotics and Autonomous systems;

(v) Function as a Services and cloud computing;

(vi) Big Data and New Data warehouse;

(vii) Map-Reduce for econometric;

(viii) Programming language and type-safe chal-

lenge.

For each scenario, each expert provided evaluations

on coherence and properness. For example, for Com-

puter Vision and Deep Learning, the following Fami-

lies, partitions and ontology has been provided:

1 F a m i l i e s = [ ”COMPUTERVISION” , ”DEEPLEARNING” ]

2 P a r t i t i o n = [ ” c o m p u t e r v i s i o n ” , ” image r e c o g n i t i o n ” ,

GRASP: Graph-based Mining of Scientiﬁc Papers

181

3 ” d ee p l e a r n i n g ” ]

5 Ont = {

6 ” i m a g e ” : {

7 ” fa m i l y ” : ”COMPUTERVISION” , ” i m p o r t a n c e ” : 1 ,

8 ” n e e d s ” : [ ’ a n a l y s i s ’ , ’ r e c o n s t r u c t i o n ’ , ’

p r o c e s s i n g ’ , ’ s e g m e n t a t i o n ’ , ’

enh a n ce m e nt ’ , ’ r e c o g n i t i o n ’ ] , ” d r u g ” : [ ”

t e x t ” ] ,

9 ” a l t e r n a t i v e s ” : [ ” fr a m e ” , ” s i g n a l ” ] ,

10 ” v a r i a t i o n s ” : [ ] } ,

11 ” p a t t e r n ” : {

12 ” fa m i l y ” : ”COMPUTERVISION” , ” i m p o r t a n c e ” : 1 ,

13 ” n e e d s ” : [ ” r e c o g n i t i o n ” ] , ” d r u g ” : [ ] ,

14 ” a l t e r n a t i v e s ” : [ ’ a n a l y s i s ’ ] ,

15 ” v a r i a t i o n s ” : [ ] } ,

16 ” dee p l e a r n i n g ” : {

17 ” fa m i l y ” : ”DEEPLEARNING” , ” i m p o r t a n c e ” : 1 ,

18 ” n e e d s ” : [ ] , ” dru g ” : [ ” t e x t ” ] ,

19 ” a l t e r n a t i v e s ” : [ ” m ac h i ne l e a r n i n g ” , ” n e u r a l

n e t w o r k ” ] ,

20 ” v a r i a t i o n s ” : [ ] } ,

21 ” i n f o r m a t i o n ” : {

22 ” fa m i l y ” : ”COMPUTERVISION” , ” i m p o r t a n c e ” : 0 . 5 ,

23 ” n e e d s ” : [ ” r e t r i e v a l ” ] ,

24 ” d r u g ” : [ ” s y s t e m ” ] , ” a l t e r n a t i v e s ” : [ ] ,

25 ” v a r i a t i o n s ” : [ ] } ,

26 ” g r a p h i c s ” : {

27 ” fa m i l y ” : ”COMPUTERVISION” , ” i m p o r t a n c e ” : 0 . 6 ,

28 ” n e e d s ” : [ ” c o m p u t e r ” ] ,

29 ” d r u g ” : [ ] , ” a l t e r n a t i v e s ” : [ ] ,

30 ” v a r i a t i o n s ” : [ ] }

31 }

The table 1 reports the output.

Table 1: Top 5 output papers applied to the example de-

scribed above.

TITLE

TOTAL

RANK

COMPUTER

VISION

DEEP

LEARNING

A massively

parallel architecture

for a self-organizing

neural pattern

recognition machine

9295 98.2472 94.6083

Learning

hierarchical invariant

spatio-temporal

features for action

recognition with

independent

subspace analysis

8091.76 98.298 82.3186

Measuring

Invariances in

Deep Networks

7202.72 90.2564 79.803

Recognizing

lower face action

units for facial

expression analysis

6422.5 73.8933 86.9158

Tracking faces

6112.99 73.4586 83.2168

7.3 Evaluation Results

Results have been summarized evaluating Mean and

Standard Deviation for:

• Precision@R (P@R);

• Discounted Cumulative Gain (DCG);

• Ideal Discounted Cumulative Gain (iDCG);

• Normalized Discounted Cumulative Gain

(nDCG).

Summarized results are presented in Table2

Table 2: Summarized results for GRASPEvaluation.

P@R DCG@k iDCG@k nDCG

Case 1 0.83±0.14 4.51±1.09 5.37±1.00 0.84±0.11

Case 2 0.92±0.15 7.43±1.43 8.34±0.61 0.88±0.11

Case 3 0.84±0.07 7.02±0.70 7.54±0.57 0.93±0.02

Case 4 0.93±0.05 7.72±0.98 7.91±0.87 0.98±0.02

Case 5 0.95±0.04 6.95±1.05 7.25±0.91 0.96±0.03

Case 6 0.80±0.18 6.27±1.98 7.31±1.26 0.84±0.12

Case 7 0.83±0.07 4.50±1.00 5.34±1.24 0.85±0.07

Case 8 0.89±0.05 6.42±1.10 6.78±0.97 0.95±0.04

P@R: We can observe that the system performs well

with all the evidences. We have a positive peak at

Case 2, Case 4 and Case 5 and the lowest values for

Case 6.

• It tends to be high (greater than 0.90) when the

domain, and sub-graph delivery, is narrower.

• In the case where the search domain is wider and

the sub-graph’s contours are less deﬁned, it tends

to lower around 0.83. For example, the Big Data

and New Data warehouse is wider and it includes

several sub-domains (eg. machine learning, data

models, NoSql database that are not in scope of

the case) that decrease precision.

DCG: emphasizes the position of the single document

in the array of returned documents. As you can see

from the results, we have very good DGC values (> 7)

on 37.5% of results. The negative cases are Case 1,

Case 7 and Case 8. Normalized Discounted Cumula-

tive Gain measures our effectiveness among results of

various scenario. nDCG metric doesn’t penalize bad

documents so our scoring algorithm is able to achieve

high performance with a degree major of 0.9 in 50%

of cases. Generally, GRASPevaluation highlights two

main patterns of our scoring method:

• the algorithm can achieve the best results when we

evaluate the overall list of documents returned by

the system; because GRASPis always able to cor-

rectly detect and sorts relevant document;

• GRASPassigns a wrong rank to documents that are

relevant for a part of the domain (i.e. only for a

single family that has an high value of relevance);

it doesn’t perform well distinguishing between

uniform and skewed performance across the do-

main of interest.

The full evaluation is available here http://tiny.cc/93gwtz

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

182

8 CONCLUSION AND FUTURE

WORK

In this paper, we addressed the problem of perform-

ing an efﬁcient literature review considering the nu-

merous quantity of articles, papers and documents to

be retrieved and mined in, to ﬁlter them based on the

usefulness and retrieval of the other sources that refer

to or cited by the target document. The proposed so-

lution is able to identify the path of the linked sources

which can contribute to the research topic provided by

the researcher as an input in the form of a simplistic

sub-graph.

As the future works, we are currently working to

improve GRASPby:

• increasing the number of user study participants

to increase the robustness of the evaluations;

• a more in-depth analysis of the competitor solu-

tions;

• enhancing the way the sub-graph query is mapped

on the graph;

• considering more robust information retrieval

methods utilizing zone indexes and n-gram tok-

enizing;

• generating time-anchored graphs which show the

path through the years.

REFERENCES

Angles, R. and Gutierrez, C. (2008). Survey of graph

database models. CSUR.

Buttcher S., Clarke C. L. A., C. G. V. (2016). Information

Retrieval: Implementing and Evaluating Search En-

gines. MIT Press, Ltd.

Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato,

V., and Picariello, A. (2018). GraphDBLP Re-

leased: Querying the Computer Scientists Network as

a Graph. CEUR.

Dupret, G. (2011). Discounted cumulative gain and user

decision models. In Grossi, R., Sebastiani, F., and Sil-

vestri, F., editors, String Processing and Information

Retrieval, pages 2–13, Berlin, Heidelberg. Springer

Berlin Heidelberg.

Effendy, S., Jahja, I., and Yap, R. H. (2014). Relatedness

measures between conferences in computer science: a

preliminary study based on dblp. In WWW.

Francis, N., Green, A., Guagliardo, P., Libkin, L., Lin-

daaker, T., Marsault, V., Plantikow, S., Rydberg, M.,

Selmer, P., and Taylor, A. (2018). Cypher: An evolv-

ing query language for property graphs. In SIGMOD.

Hoang, D. T., Kaur, J., and Menczer, F. (2010). Crowd-

sourcing scholarly data. Proc. Web Science Con-

ference: Extending the Frontiers of Society On-Line

(WebSci).

Jarke, M., Neumann, B., Vassiliou, Y., and Wahlster, W.

(1989). Kbms requirements of knowledge-based sys-

tems. In Foundations of knowledge base management,

pages 381–394. Springer.

Khan, S., Liu, X., Shakil, K. A., and Alam, M. (2017). A

survey on scholarly data: From big data perspective.

IP&M.

Mercorio, F., Mezzanzanica, M., Moscato, V., and Sperl

ı,

G. (2019). A Tool for Researchers : Querying Big

Scholarly Data through Graph Databases. ECML

PKDD.

Ororbia, A. G., Wu, J., Khabsa, M., Williams, K., and Giles,

C. L. (2015). Big scholarly data in citeseerx: Informa-

tion extraction from the web. In WWW.

Robinson, I., Webber, J., and Eifrem, E. (2015). Graph

databases: new opportunities for connected data. ”

O’Reilly Media, Inc.”.

Sch

afer, U. and Kiefer, B. (2011). Advances in deep parsing

of scholarly paper content. Lecture Notes in Computer

Science (including subseries Lecture Notes in Artiﬁ-

cial Intelligence and Lecture Notes in Bioinformatics),

6699 LNCS:135–153.

Sch

afer, U., Uszkoreit, H., Federmann, C., Marek, T., and

Zhang, Y. (2008). Extracting and querying relations

in scientiﬁc papers on language technology. LREC.

Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-

J. P., and Wang, K. (2015). An overview of microsoft

academic service (mas) and applications. In WWW.

Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z.

(2008). Arnetminer: Extraction and mining of aca-

demic social networks. In SIGKDD.

Tran, D. H., Takeda, H., Kurakawa, K., and Tran, M. T.

(2012). Combining topic model and co-author net-

work for KAKEN and DBLP linking. Lecture Notes in

Computer Science (including subseries Lecture Notes

in Artiﬁcial Intelligence and Lecture Notes in Bioin-

formatics), 7198 LNAI.

Williams, K., Wu, J., Choudhury, S. R., Khabsa, M., and

Giles, C. L. (2014). Scholarly big data information

extraction and integration in the CiteSeerχ digital li-

brary. Proceedings - International Conference on

Data Engineering.

Xia, F., Wang, W., Bekele, T. M., and Liu, H. (2017). Big

scholarly data: A survey. IEEE Transactions on Big

Data.

Za, S. and Spagnoletti, P. (2013). Knowledge creation

processes in information systems and management:

Lessons from simulation studies. In Organizational

Change and Information Systems, pages 191–204.

Springer.

Zhao, D. and Strotmann, A. (2015). Analysis and visualiza-

tion of citation networks. Synthesis lectures on infor-

mation concepts, retrieval, and services, 7(1):1–207.

GRASP: Graph-based Mining of Scientiﬁc Papers

183