Biomedical Dataset Recommendation

Xu Wang, Frank van Harmelen and Zhisheng Huang

Vrije University Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands

Keywords:

Dataset Recommendation, Scientiﬁc Datasets.

Abstract:

Dataset search is a special application of information retrieval, which aims to help scientists with ﬁnding the

datasets they want. Current dataset search engines are query-driven, which implies that the results are limited

by the ability of the user to formulate the appropriate query. In this paper we aim to solve this limitation by

framing dataset search as a recommendation task: given a dataset by the user, the search engine recommends

similar datasets. We solve this dataset recommendation task using a similarity approach. We provide a simple

benchmark task to evaluate different approaches for this dataset recommendation task. We also evaluate the

recommendation task with several similarity approaches in the biomedical domain. We benchmark 8 different

similarity metrics between datasets, including both ontology-based techniques and techniques from machine

learning. Our results show that the task of recommending scientiﬁc datasets based on meta-data as it occurs

in realistic dataset collections is a hard task. None of the ontology-based methods manage to perform well on

this task, and are outscored by the majority of the machine-learning methods. Of these ML methods only one

of the approaches performs reasonably well, and even then only reaches 70% accuracy.

1 INTRODUCTION

Dataset search is an application of dataset re-

trieval, which is specialization of information re-

trieval (Kunze and Auer, 2013). Dataset search aims

to help people to ﬁnd datasets they want. Scien-

tists, journalists and decision makers would be typi-

cal users of such a dataset search service. A dataset

is “a set of related observations which are organized

and formatted for a speciﬁc purpose” (Chapman et al.,

2020). A dataset can be of widely different for-

mats: tables, ﬁles, images, structured objects or oth-

ers (Google Developers, 2021).

As one particular application case of dataset

search, a dataset search engine returns datasets based

on a user query. A number of such dataset search

engines are now operating in practice, for exam-

ple Google Dataset Search

or the Mendeley Data

search engine

operated by Elsevier. Such query-

driven search engines come with the limitation that

the search engine is highly reliant on an appropriate

search query presented by the user.

In (Wang et al., 2020), we provided a recom-

mendation paradigm for dataset search, which can be

paraphrased as "if you like this dataset, you’ll also

https://datasetsearch.research.google.com/

https://data.mendeley.com/

like that one". This recommendation paradigm gives

us the motivation for this paper, with bringing a new

paradigm "if user like a query or a dataset, the dataset

search engine could recommend a similar dataset to

the user". This paradigm could be applied to a dataset

search engine in two scenarios: a) when users can

provide a search query with an appropriate descrip-

tion, the engine could recommend datasets similar to

the search query; b) when user cannot provide that

query, the user can provide a dataset as a query, and

the search engine will recommend a similar dataset.

In this paper we are aiming to answer following

research questions: 1) How to precisely deﬁne the

task of recommending similar datasets from queries

or from other datasets? 2) How to evaluate different

approaches to this recommendation task? 3) Can we

set up experiments to test this evaluation task in a spe-

ciﬁc scientiﬁc domain?

There are many related works on dataset recom-

mendation. In (Leme et al., 2013), they provide a

dataset recommendation task by using the link be-

tween dataset and bibliographic domain (e.g. DBLP).

In (Elleﬁ et al., 2016), dataset recommendation works

by considering schema overlap between datasets. In

(Patra et al., 2020), they provide a dataset recommen-

dation approach with help of publications and inter-

ests of researchers. In (Singhal et al., 2013), the au-

192

Wang, X., van Harmelen, F. and Huang, Z.

Biomedical Dataset Recommendation.

DOI: 10.5220/0010521801920199

In Proceedings of the 10th International Conference on Data Science, Technology and Applications (DATA 2021), pages 192-199

ISBN: 978-989-758-521-0

Figure 1: Overview of main works in this paper.

thors used the context of user’s interest to ﬁnd inter-

esting research datasets. All these approaches dif-

fer from the approach we take in this paper, namely

providing dataset recommendations based on only the

meta-data descriptions of datasets as provided by the

authors of the datasets.

We given an overview of our paper in Figure 1,

which shows the two parts of our work: the recom-

mendation task and the evaluation of different ap-

proaches for this task. The recommendation task will

try to recommend similar datasets chosen from can-

didate datasets for a given target (which could be

a dataset or a query). The evaluation will evaluate

the recommendation work by checking if the recom-

mended dataset belongs to the same scientiﬁc do-

main as the target. This choice of success-criterion

(do we manage to recommend a dataset from the

same domain as the target) may sound like a low-

barrier. However, our experimental results in section

5 will actually show that even this seemingly simple

success-criterion is far from trivial to achieve.

The main contributions of this paper are: 1) We

provide a novel dataset recommendation task. This

task recommends similar datasets from a set of can-

didate datasets given a target that consists of an-

other dataset or a query. We also provide several

approaches for this similar dataset recommendation

task, by applying a number of similarity approaches.

2) We provide a procedure for evaluating the differ-

ent approaches to the similar dataset recommendation

task. Different from expert judgement, our (simpler)

evaluation task is to compare the research domain be-

tween the given dataset/query (the target) and the rec-

ommended dataset (the output). As argued above,

this seemingly easy success-criterion will turn out to

be already very hard to achieve by many of the ap-

proaches we test. 3) We run experiments to test sev-

eral approaches to the similar dataset recommenda-

tion task for targets in the biomedical domain. The re-

sults show that single approach (BM25) outperforms

others in our experiments.

2 PRELIMINARIES

In this section we will introduce the different similar-

ity deﬁnitions we will use in this paper.

Wu-Palmer. Wu-Palmer is an edge-based semantic

similarity metric to calculate the similarity between

concepts in hierarchical structure or ontology (Wu

and Palmer, 1994). The Wu-Palmer similarity be-

tween two concept C

and C

is deﬁned as

Sim

WuP

) =

2 ∗ Path(R, LCS(C

))

Path(R,C

) + Path(R,C

)

where LCS(C

) is the least common subsumer of

and C

; Path(R, n) is the length of the path be-

tween the ROOT node R and node n in the ontology

structure.

Resnik. Resnik is a content-based semantic similar-

ity metric to calculate the similarity between concepts

with the help of their information content (Resnik,

1995). The Resnik similarity between two concepts

and C

is deﬁned as:

Sim

Res

) = − log p(LCS(C

))

= − log

∑

n∈words(LCS(C

))

count(n)

where LCS(C

) is the least common subsumer

of C

and C

; p(LCS(C

)) is the probability of

LCS(C

) in the deﬁnition of information content;

words(LCS(C

)) is the set of concepts which are

subsumed by LCS(C

) in the ontology structure;

N is the total number of concepts in the ontology

structure.

LDA. LDA (latent Dirichlet allocation) is a gener-

ative probabilistic model for a corpus, which could

used a set of topics to represent a document (Blei

et al., 2003). The LDA similarity between two con-

tent ﬁelds is calculated by ﬁrst extracting topics from

each content ﬁeld and then to check the coverage be-

tween two sets of topics.

Doc2Vec. Doc2vec is a method to create a numeric

representation of a document (Le and Mikolov, 2014).

The Doc2vec similarity metric between two content

ﬁelds is calculated by ﬁrst transferring each of con-

tent ﬁelds into pretrained a vector space and then cal-

culate the distance between the two content ﬁelds in

this vector space.

Biomedical Dataset Recommendation

193

BM25. BM25 (Best Match 25) is a ranking func-

tion that scores documents based on their relevance to

a given text query (Robertson et al., 1995). The BM25

approach is used by search engines to rank matching

documents according to their relevance to a search

query. In this paper, we consider textual content of

our target as "query" and use BM25 to ﬁnd relevant

content from candidate datasets.

Bert. Bert (Bidirectional Encoder Representations

from Transformers) is a multi-layer neural language

representation model (Devlin et al., 2019). We use

Bert to compute the similarity between two contents

ﬁelds by ﬁrst transferring each content to a vector us-

ing Bert’s encoder. Then we will calculate the simi-

larity between the two vectors with cosine similarity.

This similarity score will be used as the similarity be-

tween two content ﬁelds.

3 DATASET RECOMMENDATION

In this section we will introduce the dataset rec-

ommendation task and a number of different ap-

proaches to this task based on similarity metrics be-

tween datasets.

3.1 Concept-based Similarity between

Datasets

Based on the similarity metrics we introduced above,

here we will introduce the similarity approach be-

tween datasets. Our concept-based similarity ap-

proach between datasets only focuses on the ontol-

ogy concepts extracted from the textual content of the

meta-data ﬁelds of datasets. These extracted ontology

concepts are the concepts which appear both in the

ontology and in the textual content of the meta-data.

Restricting the meta-data ﬁelds to the concepts

from ontologies will enable the use of ontology-based

similarity approaches. In particular, if the ontology

captures the terminology of a research domain, the

concept extraction of a dataset will contain the con-

cepts from that domain, and will ignore the "noise"

terms in the meta-data ﬁelds of datasets, which are

useless for similarity approach. For example, if we

use the UMLS ontology

for concept extraction, the

extracted concepts will always belong to the biomed-

ical research domain, and concepts from other do-

mains (e.g. Computer Science) will be ignored. This

will then beneﬁt us when we generate recommenda-

tions for biomedical datasets.

https://www.nlm.nih.gov/research/umls/index.html

After the introduction of our concept extraction

approach, we will now introduce concept-based sim-

ilarity between datasets based on the Wu-Palmer and

Resnik metrics.

Def. 1 (Concept-based Similarity between Datasets).

Given two datasets D

and D

, and given a simi-

larity approach between concepts, denoted as sim.

Concept-based similarity ConSim(D

, D

) between

and D

is deﬁned as follows:

ConSim(D

, D

) =

∑

{sim(c

, c

)|c

∈ D

, c

∈ D

}

| ∗ |D

where c

is a concept from textual meta-data con-

tent of dataset D

and similarly c

; sim(c

, c

) is the

similarity between c

and c

, calculated by similarity

metric sim; |D

| is the number of all concepts in the

meta-data of D

and similarly |D

We can instantiate this generic deﬁnition with spe-

ciﬁc similarity metrics by replacing sim with Wu-

Palmer or Resnik.

We also deﬁne concept-based similarity between a

query and a dataset. In this deﬁnition, we extract con-

cepts from the query and then apply the same concept-

based similarity calculation as above. The deﬁnition

of concept-based similarity between a query Q and a

dataset D is then as follows:

ConSim(Q, D) =

∑

{sim(c

, c

)|c

∈ Q, c

∈ D}

|Q| ∗ |D|

(1)

where c

is a concept from content query Q and |Q|

is the number of all concepts occurring in Q.

3.2 Content-based Similarity between

Datasets

Here we will introduce the content-based similarity

between datasets which could be used with ML-based

and IR-based approaches. The content we consider

here is still the textual meta-data content of datasets,

which are title and description.

Def. 2 (Content-based Similarity between Datasets).

Given two datasets D

and D

. Given a similarity

approach between contents, denoted as sim. Con-

tent similarity ContSim(D

, D

) between D

and D

is the similarity between C

and C

, where C

is the

textual content of D

and C

is the textual content of

We can then instantiate this deﬁnition of content-

based similarity with content metrics such as LDA,

Doc2vec, BM25 and Bert. For LDA, BM25 and Bert,

we can directly apply them to content-based similar-

ity because all of them are similarity metrics between

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

194

contents. For Doc2vec, we consider the whole tex-

tual meta-data content of a dataset as a "document"

and then calculate the similarity between such "docu-

ments" of datasets.

Similar to concept-based similarity between query

and dataset, we also have to deﬁne content-based sim-

ilarity between query and dataset. We deﬁne content-

based similarity between a query Q and a dataset D as

followa:

ContSim(Q, D) = sim(C

) (2)

where C

is the content of query Q; C

is the textual

meta-data content of dataset D; sim(C

) is a sim-

ilarity metric between contents, which could be LDA,

Doc2vec, BM25 and Bert.

3.3 The Similar Dataset

Recommendation Task

After introducing the two types of approaches for

the similarity between datasets (concept-based and

content-based), we will now introduce the task of rec-

ommending a similar dataset.

Def. 3 (Similar Dataset Recommendation Task SDR

for Dataset). Given a target dataset D

and a list

of candidate datsasets CL

, the similar dataset rec-

ommendation SDR task is to return a sorted list of

candidate datasets SCL

= {D

, D

, ......, D

} so that

Sim(D

, D

) ≥ Sim(D

, D

) ≥ ...... ≥ Sim(D

, D

) >

0, where D

, D

, ......, D

∈ CL

. Then D

∈ SCL

the dataset recommended by this task for D

, denoted

as RD

SDR

We also deﬁne the similar dataset recommenda-

tion task for a query.

Def. 4 (Similar Dataset Recommendation Task SDR

for a Query). Given a target query Q and a list

of candidate datasets CL

, the similar dataset rec-

ommendation task SDR is to return a sorted list of

candidate datasets SCL

= {D

, D

, ......, D

} so that

Sim(Q, D

) ≥ Sim(Q, D

) ≥ ...... ≥ Sim(Q, D

) > 0,

where D

, D

, ......, D

∈ CL

. Then D

in SCL

the dataset recommended by this task for Q, denoted

as RD

SDR

After introducing the similar dataset recommen-

dation task for both dataset and query, we can now

turn to a variety of similar dataset recommendation

approaches by applying different similarity mentrics

to this task.

Also, we identify three types of recommendation

approaches, Concept-based and Content-based, and

Hybrid one. The Concept-based similar dataset rec-

ommendation approach applies concept-based simi-

larity metrics (Wu-Palmer or Resnik) to the dataset

recommendation task. The Content-based approach

on the other hand applies content-based similarity

metrics (LDA, Doc2vec, BM25 or Bert) to the dataset

recommendation task. Finally, the Hybrid approach

combines concept-based and LDA metrics.

4 EVALUATION FOR DATASET

RECOMMENDATION

In this section we will introduce both our evaluation

method and the gold standard we will use in the eval-

uation.

We deliberately choose an evaluation task that is

at ﬁrst sign very easy to achieve: we count a dataset

recommendation as correct, if the recommendation is

in the same scientiﬁc domain as the original target. As

we will see, even this apparently rather easy evalua-

tion task will turn out to be difﬁcult to achieve by al-

most all of the methods that we test in our experiment.

A beneﬁt of this choice of evaluation task is that a

clear gold standard is available: In many online open-

source dataset repositories, such as ﬁgshare

or Har-

vard Dataverse

, there are categories or subjects that

show the subject-area or research domain and topics

of datasets. Two examples are shown in Figure 2 and

Figure 3.

Since in our evaluation procedure we will take all

our targets from biomedical datasets or queries, the

evaluation task comes down to predicting which of

the candidate datasets (some 200.000 datasets from

all kinds of scientiﬁc domains) belong to the biomed-

ical domain.

We can now give a more precise deﬁnition of our

evaluation task:

Def. 5 (Simple Evaluation Task on Similar Dataset

Recommendation for Dataset). Given a list of tar-

get datasets L

; a list of datasets L

as rec-

ommendation candidate datasets; a list of similar

dataset recommendation approaches L

SDR

; and the

gold standard scientiﬁc domains Standard

for each

in L

. The evaluation task for similarity dataset

recommendation is to ﬁnd the approach SDR

from

SDR

so that for every SDR

∈ L

SDR

(SDR

6= SDR

|{RD

SDR

⊆

Domain

Standard

}| ≥ |{RD

SDR

⊆

Domain

Standard

}|, where ⊆

Domain

means subset relation-

ship on domain area; RD

SDR

⊆

Domain

Standard

means the domain of dataset RD

SDR

is the subset of

Standard

https://ﬁgshare.com/

https://dataverse.harvard.edu/

Biomedical Dataset Recommendation

195

Figure 2: Gold Standard Example 1 (ﬁgshare).

Figure 3: Gold Standard Example 2 (Harvard Dataverse).

and similarly for evaluating recommendations based

on a target that is a query:

Def. 6 (Simple Evaluation Task on Similar Dataset

Recommendation for Query). Given a list of target

datasets L

; a list of datasets L

as recommenda-

tion candidate datasets; a list of similar dataset rec-

ommendation approaches L

SDR

, and the gold stan-

dard scientiﬁc domains Standard

for each Q

. The valuation task for similarity dataset rec-

ommendation is to ﬁnd the approach SDR

from

SDR

so that for every SDR

∈ L

SDR

(SDR

6= SDR

|{RD

SDR

⊆

Domain

Standard

}| ≥ |{RD

SDR

⊆

Domain

Standard

}|, where ⊆

Domain

means subset relation-

ship on domain area; RD

SDR

⊆

Domain

Standard

means the domain of dataset RD

SDR

is the subset of

Standard

In all our evaluation matching between two do-

mains, we also consider synonyms of the domain.

This means that we do not only do exact matching,

but also match one domain to the synonym of that do-

main.

5 EXPERIMENTS

In this section we will introduce our experiments.

5.1 Preparation

First off, we will introduce the datasets and the gold

standard used in our experiments.

Target Datasets. Thanks to our colleagues from El-

sevier, all the datasets we used are from the Else-

vier data repository

. All the target datasets in our

experiments are 190 datasets derived from Elsevier

https://data.mendeley.com/research-data/

?repositoryType=NON_ARTICLE_BASED_

REPOSITORY

DataSearch

based on 19 biomedical queries (shown

in Table 1), with the top10 datasets per query. These

queries all come from a carefully validated set of test

queries for Elsevier’s Mendeley data search engine.

As a result, we can be certain that all of these 190

datasets are indeed from the biomedical domain, as

they are the top10 results on 19 validated biomedical

queries.

Candidate Datasets. We use 209.998 datasets from

Mendeley Data

. These datasets are randomly se-

lected from about 18 million datasets in the Elsevier

data repository. Recommendation datasets in our ex-

periments must be selected from these ≈210k candi-

date datasets. There are about 600 domains for these

candidate datasets without considering synonym of

domain.

These datasets contain textual meta-data (title and

description of the dataset), topic meta-data (scientiﬁc

domain of the dataset) and so on. In the recommen-

dation approach, we only consider the textual meta-

data (title and description) of datasets. We also use

the "topic" meta-data, (the "subjectArea" in the meta-

data schema of the datasets) only in the evaluation.

This is because of the particularity of our evalua-

tion approach. Our evaluation approach is to evalu-

ate the scientiﬁc research domain. In the metadata of

the dataset, the "topic" meta-data is the scientiﬁc re-

search domain. Therefore, we only consider topic in

the evaluation approach, not in the recommendation

approach.

Training Datasets. For training the ML-based dis-

tance metrics described above, we used ≈21M

datasets (21.110.372 to be precise) as our training cor-

pus. We obviously took care that this training set is

disjoint from the 210k candidate datasets, so as to

avoid data leakage from the training phase in our ex-

periment.

https://datasearch.elsevier.com/

https://data.mendeley.com/

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

196

Table 1: The corpus of 19 biomedical queries.

Query Content Query Content

E2 Protein Degradation mechanisms E54 glutamate alcohol interaction

E7 oxidative stress ischemic stroke E66 calcium signalling in stem cells

E8 middle cerebral artery occlusion mice E67 phylogeny cryptosporidium

E17 Risk factors for combat PTSD E68 HPV vaccine efﬁcacy and safety

E26 mab melting temperature E78 c elegans neuron degeneration

E28 mutational analysis cervical cancer E79 mri liver ﬁbrosis

E31 metformin pharmacokinetics E80 Yersinia ruckeri enteric red mouth disease

E35 prostate cancer DNA methylation E89 Electrocardiogram variability OR ECG variability

E50 EZH2 in breast cancer E94 pinealectomy circadian rhythm

E88 Flavonoids cardiotoxicity

Gold Standard. According to what we discussed in

last section, we consider the categories or subjects of

datasets as gold standard for evaluation task. In the

≈210k candidate datasets, there are 60.484 datasets

which are labeled by "None" subject or "Uncatego-

rized".

Evaluation. We use check if domain of recom-

mended dataset is same or similar to the one of target

datasets by hand. This is doable and easy to do be-

cause it’s not hard to check if two domains are same

or similar for human (not even be a domain expert).

5.2 Experimental Set-up

In this part we will introduce the set-up of our two

evaluation experiments. In total we use eight dataset

recommendation approaches for these experiments:

• Two concept-based recommendation (with Wu-

Palmer and Resnik)

• Two hybrid recommendation (with LDA+Wu-

Palmer and LDA+Resnik)

• Four content-based recommendation (with LDA,

Doc2vec, BM25 and Bert_Base)

Exp1: Evaluation of Dataset Recommendation

Approaches from Dataset Targets. In this exper-

iment, we will evaluate the dataset recommendation

approaches when given a dataset as the input tar-

get. Given a list of recommendation approaches, the

pipeline of Exp1 is:

1. Using 19 biomedical queries to get 190 target

candidates: We send each query to the Elsevier

Datasearch API, and we select the top 10 returned

datasets as target candidates.

2. For each recommendation approach SDR in the

list of recommendation approaches given above:

(a) For each target dataset, using recommenda-

tion approach SDR to ﬁnd the Top1, Top5 and

Top10 similar datasets from ≈210k candidate

datasets.

(b) We consider a dataset recommended by SDR as

a success-count when the domain of this dataset

meets the domain of the target dataset.

3. The recommendation approach with the best per-

formance is the one with the highest success-

count over all 190 target candidates.

Exp2: Evaluation of Dataset Recommendation

Approaches from Query Targets. Different from

Exp1, we will here recommend similar datasets from

a given query as target. Given a list of recommenda-

tion approaches, the pipeline of Exp2 is:

1. Using the 19 biomedical queries as target queries.

2. For each recommendation approach SDR from the

list of approaches given above:

(a) For each target query, we use SDR to ﬁnd the

Top1, Top5 and Top10 similar datasets from

≈210k candidate datasets.

(b) We consider a dataset recommended by SDR as

a success-count when the domain of this similar

dataset meets the domain of the target query.

3. The recommendation approach with best perfor-

mance is the one with highest success-count over

19 target queries.

5.3 Results and Analysis

In this section we will show the results of the two ex-

perimental scenarios that we set up above. In Table 2

and Table 3, we show the results of two experimental

scenarios for both Exp1 and Exp2:

• Scenario 1: consider all TopN datasets, whether

or not they contained subject-area metadata.

• Scenario 2: consider only TopN datasets with

subject-area metadata.

Scenario 1. In the ﬁrst experimental scenario

(shown in Table 2), we consider all the TopN simi-

lar datasets for each target dataset or query, even if

some similar datasets do not contain the subject-area

in their meta-data. In such a case, we take the conser-

vative position, and consider it a failure if the recom-

mendation approach recommends a dataset that has

Biomedical Dataset Recommendation

197

Table 2: Scenario 1: Top10, Top5 and Top1 results.

Experiment Measure

Top10 Judgement Top5 Judgement Top1 Judgement

success All success All success All

total fraction total fraction total fraction

Exp1 (Dataset pair)

Wu-Palmer 772 1660 0.47 464 830 0.56 92 166 0.55

Resnik 859 1660 0.52 514 830 0.62 96 166 0.58

LDA+Wup 674 1700 0.4 506 850 0.6 41 170 0.24

LDA+Res 429 1700 0.25 359 850 0.42 19 170 0.11

BM25 1351 1900 0.71 744 950 0.78 143 190 0.75

Bert_Base 1465 1900 0.77 800 950 0.84 159 190 0.84

LDA 1360 1900 0.72 767 950 0.81 154 190 0.81

Doc2vec 941 1900 0.5 582 950 0.61 144 190 0.76

Exp2 (Query-dataset pair)

Wu-Palmer 83 190 0.44 44 95 0.46 6 19 0.32

Bert_Base 118 190 0.62 66 95 0.69 11 19 0.58

BM25 151 190 0.79 83 95 0.87 17 19 0.89

Resnik 64 190 0.34 39 95 0.41 6 19 0.32

LDA 60 190 0.32 34 95 0.36 8 19 0.42

Doc2vec 123 190 0.65 76 95 0.8 16 19 0.84

Table 3: Scenario 2: Top10, Top5 and Top1 results.(Only consider datasets with subject-area).

Experiment Measure

Top10 Judgement Top5 Judgement Top1 Judgement

success Only-SubjectArea success Only-SubjectArea success Only-SubjectArea

total fraction total fraction total fraction

Exp1 (Dataset pair)

Wu-Palmer 772 1253 0.62 464 760 0.61 92 163 0.56

Resnik 859 1306 0.66 514 795 0.65 96 164 0.59

LDA+Wup 674 1264 0.53 506 839 0.6 41 170 0.24

LDA+Res 429 1126 0.38 359 817 0.44 19 170 0.11

BM25 1351 1715 0.79 744 948 0.78 143 190 0.75

Bert_Base 1465 1736 0.84 800 950 0.84 159 190 0.84

LDA 1360 1694 0.8 767 950 0.81 154 190 0.81

Doc2vec 941 1539 0.61 582 940 0.62 144 190 0.76

Exp2 (Query-dataset pair)

Wu-Palmer 83 155 0.54 44 93 0.47 6 19 0.32

Bert_Base 118 167 0.71 66 95 0.69 11 19 0.58

BM25 151 169 0.89 83 95 0.87 17 19 0.89

Resnik 64 154 0.42 39 95 0.41 6 19 0.32

LDA 60 159 0.38 34 95 0.36 8 19 0.42

Doc2vec 123 150 0.82 76 95 0.8 16 19 0.84

no subject-area in their meta-data. In Table 2, there

is a difference between the "total" number of judge-

ments performed for each approach. This is because

in the concept-based approaches, no concept could be

extracted for some datasets or queries. We have re-

moved these cases from the counts in the table, since

no meaningful similarity score can be computed. Ta-

ble 2 shows that the Bert-based approach outperforms

all others in Exp1, reaching 80%, while LDA-based

approach and BM25-based approaches also perform

well by reaching about 80% accuracy. The table also

shows that the concept-based approaches all seriously

underperform on the task when using datasets as the

original target.

The table also shows the reason why we test they

Hybrid approach by combining LDA with concept-

based approaches. As Table 2 shows, the LDA ap-

proach performs very well on Exp1 (third best perfor-

mance), which means that LDA can extract relatively

correct topics in the majority of cases. This prompted

the hypothesis: extracted topics of datasets by LDA

could be used as concepts to help the concept-based

approach. However, the data in Table 2 shows that this

hypothesis was false: the hybrid approach of LDA

plus either Wu-Palmer or Resnik performs no better

(and sometimes even substantially worse) than Wu-

Palmer or Resnik on their own.

In Exp2 in Table 2, we test 6 approaches with-

out considering the Hybrid approach. Because the

LDA approach doesn’t perform well in Exp2, we do

not expect it to boost the concept-based approaches.

In Exp2, the BM25-based approach outperforms oth-

ers by reaching 85% accuracy. The Bert-based and

Doc2vec-based approaches also perform well. Again,

we see a substantial underperformance of either of the

two concept-based approaches.

Scenario 2. Scenario 2 is a revised version of sce-

nario 1, in which we only consider TopN datasets that

mention the subject-area in their meta-data, so that

we can be certain about the correctness of the rec-

ommendation. For Exp1 of this scenario, the results

of Table 3 show that (again) both the Bert-based and

the LDA-based approach perform well, reaching 80%

accuracy, with the Bert-based approach outperform-

ing others. Also, the BM25-based approach performs

well by reaching 75%. For Exp2, the BM25-based ap-

proach even outperforms all others by reaching about

89% accuracy. The Doc2vec-based approach also

performs well with reaching about 80% accuracy.

Overall, we can conclude as the ﬁnal result that

BM25-based approach performs best in both Exp1

and Exp2 among all the approaches we tested.

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

198

6 CONCLUSION

In this paper, we provided a novel task for recom-

mending scientiﬁc datasets. This task recommends

similar dataset based on a target that is either a query

or another dataset. Based on this task, we intro-

duced several approaches, using some popular sim-

ilarity methods. Also, we executed experiments to

evaluate these approaches on biomedical datasets.

There are a number of lessons that we can draw

from our experiments: 1) it is notable that the task

of recommending similar datasets based on only the

meta-data from these datasets, and possibly a query,

is much harder than one might expect, with even the

best performing methods rarely scoring higher than

80%. 2) we see that the semantic, ontology-based

methods are not capable of solving this task, and that

the statistical methods from machine learning or in-

formation retrieval far outperform the semantic meth-

ods. Even boosting the ontology-based methods with

a machine learning method did not give an accept-

able result. 3) our results show that the BM25-based

approach performs well on the task of dataset recom-

mendation from both a query target and a dataset tar-

get, reaching 70% accuracy in our experiments.

In this paper, we have used only textual meta-data

for dataset recommendation. Given the high vari-

ability in the syntax and semantics of the content of

datasets (ranging from gene sequences to geograph-

ical maps to spreadsheets with ﬁnancial data), it is

near impossible to use this dataset contents. Never-

theless, there are other signals that could be consid-

ered for similar dataset recommendation. Authors of

datasets would be one of these signals: the co-author

network could be used for dataset recommendation

if we could match dataset’s author into this network.

The Open Academic Graph (OAG) is a very popular

and large knowledge graph that uniﬁes two separate

billion-scale academic graphs MAG and AMiner (Mi-

crosoft, 2021; Sinha et al., 2015; Tang et al., 2008).

We will investigate in future work if this resource can

be exploited to improve the task of recommending

similar datasets that we deﬁned in this paper.

ACKNOWLEDGEMENT

This work has been funded by the Netherlands Sci-

ence Foundation NWO grant nr. 652.001.002 which

is also partially funded by Elsevier. The ﬁrst author

is funded by by the China Scholarship Council (CSC)

under grant number 201807730060.

REFERENCES

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of machine Learning re-

search, 3(Jan):993–1022.

Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G.,

Ibáñez, L.-D., Kacprzak, E., and Groth, P. (2020).

Dataset search: a survey. The VLDB Journal,

29(1):251–272.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding.

Elleﬁ, M. B., Bellahsene, Z., Dietze, S., and Todorov, K.

(2016). Dataset recommendation for data linking: An

intensional approach. In ESWC 2016, pages 36–51.

Springer.

Google Developers (2021). Dataset | google search

central. https://developers.google.com/search/docs/

data-types/dataset.

Kunze, S. R. and Auer, S. (2013). Dataset retrieval. In 7th

IEEE Int. Conf. on Semantic Computing, ICSC ’13,

page 1–8.

Le, Q. V. and Mikolov, T. (2014). Distributed representa-

tions of sentences and documents. In ICML.

Leme, L. A. P. P., Lopes, G. R., Nunes, B. P., Casanova,

M. A., and Dietze, S. (2013). Identifying candidate

datasets for data interlinking. In Web Engineering,

pages 354–366. Springer.

Microsoft (2021). Open academic graph - microsoft re-

search. https://www.microsoft.com/en-us/research/

project/open-academic-graph/.

Patra, B. G., Roberts, K., and Wu, H. (2020). A

content-based dataset recommendation system for re-

searchers—a case study on Gene Expression Omnibus

(GEO) repository. Database, 2020.

Resnik, P. (1995). Using information content to evaluate se-

mantic similarity in a taxonomy. In IJCAI, IJCAI’95,

page 448–453.

Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu,

M. M., and Gatford, M. (1995). Okapi at trec-3.

In Overview of the Third Text REtrieval Conference

(TREC-3), pages 109–126.

Singhal, A., Kasturi, R., Sivakumar, V., and Srivastava, J.

(2013). Leveraging web intelligence for ﬁnding inter-

esting research datasets. In WI-IAT 2013, pages 321–

328. IEEE.

Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-

J. P., and Wang, K. (2015). An overview of microsoft

academic service (mas) and applications. In WWW

Conference, page 243–246. ACM.

Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z.

(2008). Arnetminer: Extraction and mining of aca-

demic social networks. In SIGKDD, page 990–998.

ACM.

Wang, X., Huang, Z., and van Harmelen, F. (2020). Evalu-

ating similarity measures for dataset search. In WISE

2020, pages 38–51. Springer.

Wu, Z. and Palmer, M. (1994). Verb semantics and lexical

selection. In 32nd Annual Meeting of the Association

for Computational Linguistics, pages 133–138.

Biomedical Dataset Recommendation

199