Ontology-based Methods for Classifying Scientiﬁc Datasets into Research

Domains:

Much Harder than Expected

Xu Wang, Frank Van Harmelen and Zhisheng Huang

Vrije University Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands

Keywords:

Ontology Classiﬁcation, Domain Classiﬁcation, Semantic Similarity, Data Science, Google Distance.

Abstract:

Scientiﬁc datasets are increasingly stored, published, and re-used online. This has prompted major search

engines to start services dedicated to ﬁnding research datasets online. However, to date such services are limited

to keyword search, and provide little or no semantic guidance. Determining the scientiﬁc domain for a given

dataset is a crucial part in dataset recommendation and search: ”Which research domain does this dataset

belong to?”. In this paper we investigate and compare a number of novel ontology-based methods to answer

that question, using the distance between a domain-ontology and a dataset as an estimator for the domain(s)

into which the dataset should be classiﬁed. We also deﬁne a simple keyword-based classiﬁer based on the

Normalized Google Distance, and we evaluate all classiﬁers on a hand-constructed gold standard. Our two main

ﬁndings are that the seemingly simple task of determining the domain(s) of a dataset is surprisingly much harder

than expected (even when performed under highly simpliﬁed circumstances), and that (again surprisingly), the

use of ontologies seems to be of little help in this task, with the simple keyword-based classiﬁer outperforming

every ontology-based classiﬁer. We constructed a gold-standard benchmark for our experiments which we

make available online for others to use.

1 INTRODUCTION

Scientiﬁc datasets play a crucial role in scientiﬁc re-

search. Dataset search engines collect many scientiﬁc

datasets online, and provide these to researchers. Some

existing dataset search engines that aim to satisfy this

demand are Google DataSet Search

, Mendeley Data

and Elsevier DataSearch

Determining the research domain of a dataset is

a key point for researchers when reusing this dataset,

because topical relevance is a very import information

to consider for secondary data (Gregory et al., 2020).

If we represent each candidate domain by a domain-

speciﬁc ontology, the task of domain-classiﬁcation

turns into the task of ontology-selection: which ontol-

ogy (and therefore which domain) should be selected

based on the description of the dataset? Ontology

selection is the process of selecting and ranking a

list of ontologies, sorted by how well they meet a

certain ontology evaluation task (Sabou et al., 2006).

https://toolbox.google.com/datasetsearch

https://data.mendeley.com/

https://datasearch.elsevier.com/

Existing ontology selection approaches can be classi-

ﬁed into three types: based on popularity (Patel et al.,

2003) (Ding et al., 2005) (Buitelaar et al., 2004), based

on richness of knowledge (Alani and Brewster, 2005)

and based on topic coverage (Lopez et al., 2006). We

provide a new ontology selection task. In this pa-

per, the ontology selection task is to ﬁnd the ontology

which best describes a given dataset. Ontology se-

lection becomes a process of ﬁnding an ontology for

a given dataset, which is why our new task is called

”ontology classiﬁcation”.

We develop and test a number of ontology-based

methods for classifying a dataset into a particular

domain, using ontology-based similarity measures.

There are many existing ontology-based similarity

measures to calculate similarity between terms, such

as (Wu and Palmer, 1994) and (Resnik, 1995), Lin (Lin

et al., 1998). We develop and test a number of

ontology-based classiﬁcation methods, and compare

them against a simple domain-name classiﬁer using

Normalized Google Distance. To our surprise, the

simple keyword-based classiﬁer outperforms all the

ontology-based approaches.

Wang, X., Van Harmelen, F. and Huang, Z.

Ontology-based Methods for Classifying Scientiﬁc Datasets into Research Domains: Much Harder than Expected.

DOI: 10.5220/0010056101530160

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 153-160

ISBN: 978-989-758-474-9

153

2 MOTIVATION

A number of existing dataset search engines exist to

ﬁnd datasets provided by other researchers. Users

of such dataset search engines often want to classify

datasets by research domain, and then explore the

datasets from the particular domain which they are

interested in. The most obvious approach would be

to rely on domain-labelling provided by the author of

the dataset. However, user-provided labels are known

to be notoriously unreliable (Hovy and Lavid, 2010).

For datasets from scientiﬁc papers, the domain of the

paper or the domain of the journal or conference of the

paper could be a good way to determine the domain

of a dataset. However, this approach obviously only

applies to datasets that have an associated publication

in a journal or conference.

An inspection of three popular dataset search en-

gines, Google Dataset Search, Mendeley Data and

Elsevier DataSearch reveals that we can easily sort

datasets by source, data type, date and so on. How-

ever, none of them consider the domain of datasets.

This is because many dataset providers do not anno-

tate their dataset with a clear domain. Consequently,

in this paper we aim to ﬁnd an effective approach to

automatically classify scientiﬁc datasets into the right

domain.

3 PRELIMINARIES

In this section, we introduce our approach to keyword

extraction from the dataset description, as well as the

similarity measures used in our classiﬁers.

3.1 Keyword Extraction Approach

In this paper, we will extract keywords from text

(the description of the dataset), without having any

pre-training model for keywords extraction available.

Consequently, unsupervised keyword extraction ap-

proaches are our only choice. There are many popular

unsupervised keywords extraction approaches, such

as TextRank (Mihalcea and Tarau, 2004), Rake (Rose

et al., 2010), TF-IDF (Salton and Buckley, 1988) and

so on. We choose to use TextRank because it consid-

ers not only context but also recursive information of

text, and we use the TextRank implementation from

Gensim

https://radimrehurek.com/gensim/summarization/

keywords.html

3.2 Similarity Measures

We use several similarity metrics for calculating the

similarity between a dataset and an ontology. The cov-

erage metric is a simple measure, which just considers

the intersection of two sets. The Jaccard metric con-

siders not only the intersection but normalises this by

the union of the two sets. The Normalized Google Dis-

tance (NGD) is a semantic similarity measure based

on the number of co-occurences in the Google search

engine. Word2vec measures are based on a text corpus

converted into a set of vectors and returns the cosine

similarity between two word vectors.

Jaccard Similarity.

Given two sets of keywords

and B, the Jaccard similarity between A and B is:

Jaccard(A,B) =

A ∩ B

A ∪ B

(1)

Google Distance.

The Google Distance between a

dataset and an ontology is based on the Normalized

Google Distance (NGD) (Cilibrasi and Vitanyi, 2007),

which is a semantic similarity measure computed from

results of the Google search engine. The NGD be-

tween two terms a and b is deﬁned as:

NGD(a, b) =

max{log f (a),log f (b)} − log f (a,b)

logM − min{log f (a),log f (b)}

(2)

where

f (a)

is the number of Google hits of

;

f (a, b)

the number co-occurences of

and

on the same web

page; and

is the total number of web pages searched

by Google times the average number of search terms

occurring on pages (estimated to be

25x10

). Roughly

speaking this computes the normalised probability of

two terms co-occurring on a web-page (adjusted log-

arithmically for scale). Using NGD, we can provide

the deﬁnition of the Google Distance GD between two

sets of keywords A and B as:

GD(A,B) =

{

∑

NGD(a, b)|a ∈ A,b ∈ B}

|A| ∗ |B|

(3)

where |A| and |B| is the size of A and B, respectively.

Word2Vec.

Word2Vec (Mikolov et al., 2013) is a

very popular NLP algorithm, which produces a vector

space of words based on a given corpus. With the help

of this vector space, similarity measures between two

vectors can be calculated, such as the Cosine distance

or Euclidean distance. We use the DL4J-Word2Vec

library

for learning the word embedding of all the

http://deeplearning4j.org/docs/latest/

deeplearning4j-nlp-word2vec

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

154

keywords in our experiments. The training corpus we

used for word2vec is the Google News corpus

en-

riched with a corpus trained on the Mendeley datasets

that we used in this paper. We use the cosine similarity

measure to calculate the similarity between terms.

4 ONTOLOGY CLASSIFIER AND

DOMAIN-NAME CLASSIFIER

We will now introduce our approaches to classify the

research domain for scientiﬁc datasets. The domain

classiﬁer is a simple baseline method that ﬁnds the

domain for a dataset by calculating the similarity be-

tween the dataset and the name of the domain (e.g.

”Computer Science”). Beyond this baseline method,

the ontology classiﬁers will consider the ontology of

a research domain, in other words we will reduce the

problem of domain classiﬁcation to the problem of

ontology classiﬁcation.

4.1 Domain-name Classiﬁer Approach

As baseline, we use a simple classiﬁer that calculates

the Google Distance between the keywords from the

metadata of the dataset and a single term that repre-

sents the domain name (e.g. ”Computer Science”).

To allow comparison with the ontology classiﬁers,

we ensure that this domain name includes everything

that is covered by the ontology. Therefore, the deﬁni-

tion of domain in this paper is the broadest term for

each scientiﬁc domain, which means that ”Semantic

Web” or ”Machine Learning” are not domain names

in this paper but ”Computer Science” and ”Physics”

are, ensuring that the domain name has the same cov-

erage as the corresponding ontology. We use this very

simple approach as a baseline to compare with all

the ontology-based approaches. Different from the

ontology-based classiﬁers, the domain-name classi-

ﬁer just considers keywords from (the meta-data of)

the datasets and the domain names. Intuitively, the

Google search engine can be considered as a huge

”knowledge source”, which covers most concepts and

relationships across every research ﬁeld. Using the

Google search engine, the domain-name classiﬁer can

show how close a dataset is to each domain by cal-

culating the similarity between the description of the

dataset and the name of the domain.

Def. 1

(

Google Distance between Dataset and

Domain-name

)

Given a set of keywords

ex-

tracted from dataset

and a domain which has name

https://code.google.com/archive/p/word2vec/

Domain

, the Google Distance between these is:

GD(D,Domain

) =

{

∑

NGD(d,Domain

)|d ∈ K

}

(4)

where

is the number of keywords extracted from

dataset D.

Then we can introduce our domain-name classiﬁer

algorithm, denoted as DnC(D, List

Domain

Algorithm 1: DnC(D,List

Domain

Input :D: a dataset, List

Domain

: a list of domain

names

Output :Most similar domain Domain

Sim

max

← 0.0;

Domain

← empty;

foreach Domain name DN ∈ List

Domain

Sim

D,DN

← GD(D,DN);

if Sim

D,DN

> Sim

max

then

Domain

← DN;

Sim

max

← Sim

D,DN

;

end

return Domain

;

4.2 Ontology Classiﬁer Approaches

Different from the ontology selection introduced in

(Sabou et al., 2006), our approach to ontology selec-

tion is to ﬁnd a suitable ontology based on the sim-

ilarity between the keywords from a dataset and the

keywords from the ontology. We use the keywords

extraction method from the Gensim implementation of

TextRank introduced above to extract keywords from

the title and the description of datasets.

In order to apply the similarity metrics deﬁned

above, we need to extract the keywords of the candi-

date ontologies (each representing a particular scien-

tiﬁc domain). However, for an ontology with rich con-

cepts, it’s not a good choice to consider all the concepts

as keywords because especially for large ontologies,

many concepts from the ontology will be irrelevant for

any speciﬁc dataset, adversely affecting the distance

metric even for datasets belonging to the same domain

as the ontology. Additionally, the calculation of the

similarity between a dataset and an ontology will be

more efﬁcient when not all the concept from the on-

tology will be considered. We therefore introduce a

new notion called an ”ontology speciﬁc view”, to cal-

culate the similarity between an ontology and a dataset

more effectively and efﬁciently. Given a dataset

and

an ontology

, the ontology speciﬁc view of

is the set of keywords from

which match with the

name of some concepts in

. To retrieve such names,

we used the commonly used semantic web vocabulary

Ontology-based Methods for Classifying Scientiﬁc Datasets into Research Domains: Much Harder than Expected

155

”rdfs:label”. In other words, the ontology speciﬁc view

gives a way to recognize keywords that match with the

concepts from an ontology.

Informally, just like looking at the world with col-

ored glasses, we consider the ontology speciﬁc view

as the ”colored glasses”. The ontology we use deter-

mines the ”color of the glasses”, and we see the set of

keywords only through this ”ontological color”. The

”coloured glasses” that give the best view of the set of

keywords is the best selection.

Def. 2

(

Ontology Speciﬁc View

)

Given a dataset

and an ontology

, the ontology speciﬁc view

OSV

D,O

based on

is a set of concepts which are both

concepts from the ontology

as well as keywords

appearing in the dataset D:

OSV

D,O

= {c|c ∈ W

∩C

} (5)

where

is the set of concepts from ontology

and

is the set of keywords from dataset D.

We consider the ontology speciﬁc view as the

”domain-speciﬁc keywords of a dataset”. Then, we

can calculate similarity between a dataset and an on-

tology using the keywords of dataset and the ontology

speciﬁc view.

Def. 3

(

Similarity between Dataset and Ontology

)

Given a dataset

and an ontology

, the similarity

sim

D,O

between

and

is the average of the similar-

ity between the keywords from D and OSV

D,O

sim

D,O

{

∑

i=1

∑

|OSV

D,O

j=1

sim(d

)|d

∈ K

∈ OSV

D,O

}

||OSV

D,O

(6)

where K

is the set of keywords of dataset D.

Before we introduce the deﬁnition of ontology clas-

siﬁer, we ﬁrst look back at the similarity measures

introduced in the last section. All the similarity mea-

sures are deﬁned between two sets of terms, and can be

applied to determine the similarity between a dataset

(reduced to the ontology-speciﬁc view of its extracted

keywords) and an ontology. This is because both the

keywords from a dataset and from the ontology spe-

ciﬁc view are sets of terms. This results in the follow-

ing deﬁnitions of similarity measures:

Def. 4

(

Jaccard Similarity between Dataset and

Ontology

)

Given a dataset

and an ontology

the Jaccard similarity between D and O is:

Jaccard(K

,OSV

D,O

) =

∩ OSV

D,O

∪ OSV

D,O

(7)

where

is the set of keywords from

, and

OSV

D,O

is the ontology speciﬁc view of D based on O.

Def. 5

(

Google Distance between Dataset and On-

tology

)

Given a dataset

and an ontology

, the

Google Distance GD

D,O

of D and O is:

GD(D,O) =

{

∑

NGD(d,o)|d ∈ K

,o ∈ OSV

D,O

}

| ∗ |OSV

D,O

(8)

We also provide a simple ontology classiﬁer ap-

proach with coverage similarity. Coverage similarity

just considers the size of the ontology speciﬁc view,

which means that it only considers the coverage of the

ontology concepts.

Def. 6

(

Coverage Similarity between Dataset and

Ontology). Coverage similarity between a dataset D

and an ontology

measures the size of the ontology

speciﬁc view OSV

D,O

of D and the size of O:

Cover(D,O) =

|OSV

D,O

|O|

(9)

where |O| is the number of concepts in ontology O.

Using these similarity measures between dataset

and ontology, we can now provide the deﬁnition of an

ontology classiﬁer.

Def. 7

(

Ontology Classiﬁer Task

)

Given a dataset

and a list of ontology candidates

List

, an ontology

classiﬁer should ﬁnd the suitable ontology

so that

Sim

D,O

≥ Sim

D,O

for each O

∈ List

Based on the ontology classiﬁer task, we can intro-

duce our algorithm

OC(D, List

,Sim)

for an ontology

classiﬁer, in which the similarity measure

Sim

could

be any one of the similarity measures between dataset

and ontology.

Algorithm 2: OC(D,List

,Sim).

Input :D: a dataset, List

: a list of ontology

candidates, Sim: similarity measure

between dataset and ontology

Output :most similar ontology O

Sim

max

← 0.0;

O ← empty;

foreach ontology O

∈ List

if Sim

D,O

> Sim

max

then

O ← O

;

Sim

max

← Sim

D,O

;

end

return O

5 EXPERIMENTS AND RESULTS

In this section we will introduce the datasets and on-

tology candidates used in experiments, pipeline of

experiment, evaluation method for experiment and re-

sults.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

156

5.1 Experiments Setup

Dataset.

The datasets we used in our experiments

are from Mendeley Data

. We choose from Mendeley

Data 960 datasets, which are associated with a pub-

lished paper in a known journal. The distribution of

research domains across all the datasets are:

• 60 datasets from the biomedical domain.

• 33 datasets from the computer science domain.

• 180 datasets from the physics domain.

• 683 datasets from the ﬁnance domain.

• 4 datasets from the environment domain.

The URI’s of all of these datasets are made available

by us

We chose these these 960 datasets for the following

reason. First, all of these datasets are retrieved from

Mendeley, which means these are scientiﬁc datasets

actually shared by scientists; secondly, these datasets

are all annotated with a link to an associated paper

in their metadata, which means we can retrieve the

gold standard label through the link to the journal of

the paper associated with the dataset. The ﬁrst reason

ensures the ecological validity of our benchmark, the

second reason ensures that we have a gold standard to

evaluate our results against.

On inspection of the 960 datasets in our gold stan-

dard, we ﬁnd that there is a strong bias on the distri-

bution of the domains of these datasets, with 70% la-

belled with ”ﬁnance”. To compensate for this, we add

a balanced-distribution experiment to check whether

this bias inﬂuences experiment results or not. In

the balanced-distribution experiment, we choose 217

datasets (60 from biomedical, 60 from physics, 60

from ﬁnance, 33 from computer science and 4 from

environment).

i d : 10 9 5 3 1 2 1 1 2 2 5 4 1 1812111 9 1 1 9 9 8 :MENDELEY DATA,

t i t l e : ” D a t a f o r : D i s t r i b u t i o n n e t w o r k p r i c e s an d

s o l a r PV: R e s o l v i n g r a t e . . . ” ,

d e s c r i p t i o n : ” A b s t r a c t o f a s s o c i a t e d a r t i c l e :

1−in −4 d e t a c h e d h o u s e h o l d s i n . . . ” ,

s u b j e c t A r e a s : F i n a n c e ,

Keywords CSO : [ a r t i c l e , h o u s e h o l d , r a t e , . . . ] ,

K e y w o r d s P h y s i c s : [ a r t i c l e , r a t e , d i s t r i b u t i o n , . . . ] ,

Keywords FINANCE : [ a r t i c l e , h o u s e h o l d , r a t e , . . . ] ,

Ke yw ords Envo : [ a r t i c l e , s o l a r , r a t e , d i s t r i b u t i o n , . . . ] ,

Ke y w o rds Bio : [ s i g n a l r e c o g n i t i o n p a r t i c l e 7 s r n a , . . . ] ,

Keywor ds : [ r a t e , t a r i f f , h o u se h o l d , n e t w o r k , . . . ] ,

d a t a s e t u r l : h t t p s : / / d a t a . mendel e y . com / d a t a s e t s / bwwyv6zy5m ,

DOI : 1 0 . 1 7 6 3 2 / bwwyv6zy5m . 1 ,

l i c e n c e : ”CC BY NC 3. 0 ”

Figure 1: Meta-Data of Mendeley Dataset in JSON.

https://data.mendeley.com/

https://github.com/eva01wx/WISE

ClassiﬁactionPaper Datasets

We give an example of the meta-data of a Mendeley

dataset in Figure 1. There are the descriptive metadata

(id, title, description, etc.) and administrative metadata

(licence) in the Mendeley collection. The metadata

”id” is the unique identiﬁer used to index the dataset.

The metadata-ﬁelds ”title” and ”description” give a

description of the content and usage of the dataset. We

use these to extract keywords of datasets and in order

to compute the ontology speciﬁc view. The metadata-

ﬁeld ”extractedKeywords” is the set of keywords of

a dataset given in Mendeley. We compute ﬁve other

metadata ﬁelds for ontology-speciﬁc ”extractedKey-

words”, such as ”extractedKeywords CSO”, where for

example ”extractedKeywords CSO” is the ontology

speciﬁc view of the computer science ontology CSO

for this dataset, and similar for the other ontologies.

The metadata-ﬁeld ”dataset url” is the URL linked to

the Mendeley Data search engine. Through this URL,

one can ﬁnd the description of the dataset (such as title,

associated paper, etc.).

We only use the title and description of datasets for

our classiﬁcation task, without considering any other

information from the dataset itself. This is because

that we treat the dataset itself as a ”black box” from

which we cannot get any information except for the

title and description. Many scientiﬁc datasets have

highly specialised data formats (gene sequences, im-

ages, geo-coordinates, etc.), and these are not suitable

for extracting information in a general purpose search

engine. So we chose to take the hardest case possible,

namely assuming that no information can be gained

from the dataset itself, and all we have are the human

readable descriptions.

Ontology Candidates.

We use ﬁve ontology can-

didates from ﬁve domain for our ontology selec-

tion task: FIBO

(Finance), UMLS

(Biomedical),

CSO

(Computer Science), ENVO

(Environment)

and OPB

+physics

(Physics). We chose these on-

tology candidates because they are the richest or most

popular ontologies in their domain. For the physics

domain, since there is not any existing ontology that

can cover most concepts in physics domain, we com-

bined physics for biology ontology with a physics for

astronomy ontology.

https://spec.edmcouncil.org/ﬁbo/

https://www.nlm.nih.gov/research/umls/

https://cso.kmi.open.ac.uk/home

http://environmentontology.org/

https://sites.google.com/site/

semanticsofbiologicalprocesses/projects/

the-ontology-of-physics-for-biology-opb

http://www.astro.umd.edu/

∼

eshaya/astro-onto/owl/

physics.owl

Ontology-based Methods for Classifying Scientiﬁc Datasets into Research Domains: Much Harder than Expected

157

Figure 2: Pipeline for Ontology Classiﬁer Experiemnt.

Pipeline for Ontology Classiﬁer Experiment.

The

whole pipeline for the ontology classiﬁer experiment

is depicted in Figure 2. Given a list

List

of Mende-

ley datasets and a list

List

of ontology candidates

the process of the ontology classiﬁer experiment is as

follows:

Extract keywords

from Mendeley datasets

D ∈

List

For each ontology

O ∈ List

, extract the ontology-

speciﬁc view OSV

O,D

based on O from D ∈ List

Calculate the similarity between

and

OSV

O,D

by using different similarity metrics, and consider

this as the similarity between dataset

and ontol-

ogy O.

Choose the most suitable ontology from

List

for

D ∈ List

based on the similarity between

and

each O ∈ List

5.2 Evaluation

We use the associated paper of each of our datasets as

the gold standard for our evaluation. This is because

Mendeley does not list a domain for the datasets in

the above collection. Instead, we constructed a gold

standard by following for each of the listed datasets the

link to the journal in which the paper was published

that is mentioned in the dataset’s metadata, and then

determining by hand what is the appropriate domain

based on the information about the journal.

Figure 3: Mendeley dataset (https://data.mendeley.com/

datasets/bwwyv6zy5m).

As we can see in Figure 1, each Mendeley dataset

used in our experiment is associated with a Mendeley

dataset link. Through the Mendeley dataset link, we

can ﬁnd the associated paper (Figure 3). Then we

ﬁnd the associated journal which the associated paper

is published in. It’s easy to decide which domain a

journal belongs to through the link to the journal of the

associated paper. So with the help of the associated

journal, we can decide the gold standard domain of

all Mendeley datasets. We publish this gold standard

online

According to the gold standard above, we use two

measures to evaluate the experiment results. The ﬁrst

one is simple accuracy, which means that we just con-

sider the score of the number of datasets, which are

classiﬁed to right domain, divided by the total num-

ber of datasets as accuracy. The second one is F1-

measure (Chinchor, 1992), which is always used to

evaluate the accuracy of results of classiﬁcation mea-

sures. In our experiments, F1-measure is used to eval-

uate results for each domain.

For F1-measure in our experiments, given a partic-

ular domain, we deﬁne True-Positive, True-Negative,

False-Positive and False-Negative as follow:

• True-Positive:

The list of datasets which are not

only classiﬁed to given domain in gold standard but

also are classiﬁed to given domain by classiﬁcation

measures.

• True-Negative:

The list of datasets which are not

classiﬁed to given domain by both gold standard

and classiﬁcation measures.

• False-Negative:

The list of datasets which are

classiﬁed to given domain by gold standard but

not by classiﬁcation measures.

• False-Positive:

The list of datasets which are clas-

siﬁed to given domain by classiﬁcation measures

but not by gold standard.

We also introduce a novel approach to evaluate

the result of F1 measure. In classiﬁcation task, there

are always several classiﬁcation targets, which are the

candidates the given data/datasets would be classiﬁed

into. For instance, if we want to classify dataset into

research domain, and we have three domain candidates.

Then we can say that these three domain candidates

are the classiﬁcation targets.

Based on the number of classiﬁcation targets, we

can have the random accuracy. For example, if we have

https://github.com/eva01wx/WISE

ClassiﬁactionPaper Datasets

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

158

Table 1: Simple Accuracy Results.

Measures UnBalanced Balanced

Google Distance (with Domain Name) 72.5% 63.5%

Coverage 76.8% 47.5%

Jaccard 22.4% 31.8%

Coverage + Jaccard 48.9% 30.2%

Word2Vec(Google News) 30.6% 27.5%

Word2Vec(Self-training) 29.6% 16.4%

Google Distance 36.1% 26.2%

three classiﬁcation targets, we have random accuracy

33.3% (1/3) for classifying dataset into right domain.

With the help of random accuracy, we can compute

the random F1 score:

Random F1 = 1/n (10)

where

is the number of classiﬁcation targets. Let’s

continue the example above. When we have random

accuray 33.3% (1/3) with three classiﬁcation targets,

we have random F1 socore 33.3% (1/3). This is be-

cause if random accuracy is 33.3% (1/3), then we can

know that True positive, True negative, False negative

and False positive is 11.1% (1/9), 44.4% (4/9), 22.2%

(2/9) and 22.2% (2/9), respectively. Then we can easily

know that precision is 33.3% (1/3) and recall is 33.3%

(1/3). With precision and recall, we can compute that

F1 score is 33.3% (1/3). We compare the F1 scores of

our experiments against this random F1 score.

5.3 Results

We run two versions of our experiments: with the

unbalanced distribution of the datasets (with a strong

bias in favour of the ﬁnance domain), and a balanced

distribution of datasets which compensates for this

bias, as described in the section on our experimental

setup. Both experiments aim to ﬁnd out the best metric

to use for classifying datasets into right domain. The

balanced experiment is to see if the bias of distribution

will impact the performance of these metrics.

All results are given as the accuracy of the domain-

classiﬁcation by comparing it with the results from the

gold standard. As we can see in Table 1, we tested

7 different approach for both the balanced and the

unbalanced scenario, including both the domain-name

classiﬁer and the different ontology-based classiﬁers.

In the unbalanced experiment, two measures reach a

high accuracy (

70%). In the balanced experiment,

only one measure reaches 60% accuracy.

We also split out these results for each of the dif-

ferent domains, again for both the balanced and the

unbalanced scenario, in tables 2 and able 2, compar-

ing against the random f-1 score, which is 20% in our

experiment.

Unsurprisingly, in the unbalanced scenario in Ta-

ble 2, all methods have a good F1-score on the ﬁnance

domain and outperform the random F1 score. But dis-

appointingly, for any of the other domains only the

Google Distance (distance from the domain name) and

the Coverage metric have a better than random F1

score an any of the other domains. For the balanced

scenario, which is shown in Tabel 2, the scores are

even lower: only the Google Distance from the Do-

main Name performed reasonably well, outperforming

the random score in three domain. The coverage metric

managed to do this in two domains, as did the mixture

measure of Coverage+Jaccard in the same domains.

Summarising, across both the unbalanced and the

balanced scenario, the simple domain-name classiﬁer

based on Google Distance outperforms the Coverage-

based ontology classiﬁer approach, which by itself was

already the best performing among all the ontology-

based approaches.

6 CONCLUSION

In this paper we have deﬁned the novel task of domain

classiﬁcation for research datasets. We ran several

experiments not only with ontology-based classiﬁers,

but also with a simple domain-name classiﬁer, to test

the performance of these classiﬁer approaches. Our

surprising ﬁnding is that our experimental results show

that the simple domain classiﬁer approach outperforms

all the ontology-based approaches when classifying

the research domain for a collection of datasets for

which we had obtained gold standard answers. This is

contrary to our initial intuition, where we had expected

that a rich vocabulary as contained in a high quality

domain-speciﬁc ontology would provide a better clas-

siﬁer then simply the single word name of the research

domain.

There are some possible improvements in future

work. In this paper we just considered title and de-

scription of datasets for classiﬁcation. Other parts of

the content of datasets could be considered in future

work, such as other metadata of datasets and the actual

underlying data in datasets (such as a ﬁgure or a table).

Considering these additional information would im-

prove the outcome of classiﬁcation. In this paper we

ran experiments with 960 datasets from Mendeley, and

only 217 datasets in the balanced scenario. In future

work, we aim to do our classiﬁcation experiments with

large scale datasets.

We intended to use this domain classiﬁcation for

further steps in future work. Users often publish their

datasets without mentioning the domain (as is clear

from the dataset on Mendeley). A service that reliably

determines the domain of dataset (our currents score

is over 70%) will make datasets much easier to ﬁnd by

Ontology-based Methods for Classifying Scientiﬁc Datasets into Research Domains: Much Harder than Expected

159

Table 2: F1-Score Results for Unbalanced and Balanced Scenario.

Unbalanced scenario Balanced scenario

Measure CS Physics Finance Bio Environment CS Physics Finance Bio Environment

Google Distance

(Domain Name)

0.03 0.24 0.71 0.05 0.01 0.11 0.31 0.37 0.26 0.02

Coverage 0.0 0.27 0.76 0.01 0.0 0.0 0.38 0.38 0.02 0.0

Jaccard 0.04 0.06 0.25 0.05 0.0 0.16 0.12 0.09 0.21 0.0

Coverage

Jaccard

0.01 0.18 0.55 0.01 0.01 0.03 0.21 0.24 0.03 0.03

Word2Vec

(Google

News)

0.01 0.13 0.36 0.02 0.0 0.03 0.19 0.18 0.10 0.0

Word2Vec

(Self-trained)

0.0 0.10 0.38 0.01 0.01 0.0 0.08 0.15 0.05 0.03

Google Distance 0.01 0.14 0.43 0.02 0.01 0.02 0.17 0.19 0.09 0.01

other scientists. Once we have classiﬁed a dataset into

the correct domain, we can try to ﬁnd similar datasets

from the same domain. This will be an important

support function to help researchers ﬁnd more datasets

for their research.

ACKNOWLEDGEMENTS

This work has been funded by the Netherlands Science

Foundation NWO grant nr. 652.001.002 which is also

partially funded by Elsevier. The ﬁrst author is funded

by by the China Scholarship Council (CSC) under

grant number 201807730060.

REFERENCES

Alani, H. and Brewster, C. (2005). Ontology ranking based

on the analysis of concept structures. In Proc. of KCAP

2005, pages 51–58. ACM.

Buitelaar, P., Eigner, T., and Declerck, T. (2004). Ontoselect:

A dynamic ontology library with support for ontology

selection. In ISWC Demo session. Citeseer.

Chinchor, N. (1992). Muc-4 evaluation metrics. In Proc.

the 4th Conf. on Message Understanding, MUC4 ’92,

page 22–29. ACL.

Cilibrasi, R. L. and Vitanyi, P. M. B. (2007). The google

similarity distance. IEEE Transactions on Knowledge

and Data Engineering, 19:370–383.

Ding, L., Pan, R., Finin, T., Joshi, A., Peng, Y., and Kolari,

P. (2005). Finding and ranking knowledge on the se-

mantic web. In ISWC 2005, pages 156–170. Springer.

Gregory, K., Groth, P., Scharnhorst, A., and Wyatt,

S. (2020). Lost or found? discovering data

needed for research. Harvard Data Science Review.

https://hdsr.mitpress.mit.edu/pub/gw3r97ht.

Hovy, E. and Lavid, J. (2010). Towards a ‘science’of corpus

annotation: a new methodological challenge for corpus

linguistics. Internat. J. of translation, 22:13–36.

Lin, D. et al. (1998). An information-theoretic deﬁnition of

similarity. In ICML, volume 98, pages 296–304.

Lopez, V., Motta, E., and Uren, V. (2006). Poweraqua:

Fishing the semantic web. In Sure, Y. and Domingue, J.,

editors, The Semantic Web: Research and Applications,

pages 393–410. Springer.

Mihalcea, R. and Tarau, P. (2004). TextRank: Bringing order

into text. In Conf. on Empirical Methods in Natural

Language Processing, pages 404–411. ACL.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances

in neural information processing systems, pages 3111–

3119.

Patel, C., Supekar, K., Lee, Y., and Park, E. (2003). On-

tokhoj: A semantic web portal for ontology search-

ing, ranking and classiﬁcation. Proceedings of the

Interntational Workshop on Web Information and Data

Management, pages 58–61.

Resnik, P. (1995). Using information content to evaluate

semantic similarity in a taxonomy. In IJCAI’95, pages

448–453.

Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010).

Text Mining: Applications and Theory, chapter Auto-

matic Keyword Extraction from Individual Documents,

pages 1 – 20. Wiley.

Sabou, M., Lopez, V., Motta, E., and Uren, V. (2006). Ontol-

ogy selection: ontology evaluation on the real semantic

web. In WWW Conference 2006.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information Pro-

cessing and Management, 24(5):513 – 523.

Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical

selection. In ACL, pages 133–138.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

160