Blending Topic-based Embeddings and

Cosine Similarity for Open Data Discovery

Maria Helena Franciscatto

, Marcos Didonet Del Fabro

, Luis Carlos Erpen de Bona

Celio Trois

and Hegler Tissot

Department of Informatics, Federal University of Paran

a, Curitiba, Brazil

Technology Center, Federal University of Santa Maria, Santa Maria, Brazil

Drexel University, Philadelphia, U.S.A.

Keywords:

Source Discovery, Open Data, LDA, Word2Vec, Cosine Similarity, Machine Learning.

Abstract:

Source discovery aims to facilitate the search for speciﬁc information, whose access can be complex and

dependent on several distributed data sources. These challenges are often observed in Open Data, where users

experience lack of support and difﬁculty in ﬁnding what they need. In this context, Source Discovery tasks

could enable the retrieval of a data source most likely to contain the desired information, facilitating Open

Data access and transparency. This work presents an approach that blends Latent Dirichlet Allocation (LDA),

Word2Vec, and Cosine Similarity for discovering the best open data source given a user query, supported by

joint union of the methods’ semantic and syntactic capabilities. Our approach was evaluated on its ability

to discover, among eight candidates, the right source for a set of queries. Three rounds of experiments were

conducted, alternating the number of data sources and test questions. In all rounds, our approach showed

superior results when compared with the baseline methods separately, reaching a classiﬁcation accuracy above

93%, even when all candidate sources had similar content.

1 INTRODUCTION

Open data repositories makes information public to

all citizens, promoting the monitoring and evaluation

of government actions, data reuse, and improvement

of services provided to the population. Such ini-

tiatives empower citizens, not only by making them

more informed, but allowing them to transform data

into something else, which is the true value of Open

Data (Nikiforova and McBride, 2021).

However, the global increase of Open Data has

led to the need to maintain numerous databases for

storing important information, making the manipula-

tion of this data a non-trivial task (Zhang and Yue,

2016). In other words, retrieving the information

the user expects requires accessing structured and un-

structured data, which lose signiﬁcance if they are not

presented clearly and meaningfully (Beniwal et al.,

2018). With respect to Open Data, the information

is usually available through data tables or CSV ﬁles,

and when the amount and diversity of data are ele-

vated, the visualization becomes confusing, affecting

the users capability of performing comparisons and

evaluations (Porreca et al., 2017). Thus, existing open

data portals are considered complex by non-technical

users, either by the format in which the data are pre-

sented, or the difﬁculty in ﬁnding the desired infor-

mation (Osagie et al., 2017; Attard et al., 2015).

In fact, users usually can not easily analyze Open

Data without expert assistance, and when they do, the

required information may be scattered over several

data tables or data sources (Djiroun et al., 2019). As a

consequence, it becomes their responsibility to spend

time ﬁnding, downloading, and evaluating datasets

without the proper support from open data portals

and platforms (Helal et al., 2021). The difﬁculty in

ﬁnding the desired information not only affects the

general user experience, but impacts scientiﬁc coop-

eration regarding the analysis and integration of het-

erogeneous data sources, which can be applied for

solving complex problems in several areas (Sowe and

Zettsu, 2015). These issues could be alleviated if the

data source that best ﬁts the user need was retrieved at

ﬁrst-hand, decreasing the manual work when search-

ing for the right information; this task is known as

Source Discovery (Abell

o et al., 2014).

Source Discovery tasks have potential to lever-

age Open Data access and provide support to the

Franciscatto, M., Fabro, M., Erpen de Bona, L., Trois, C. and Tissot, H.

Blending Topic-based Embeddings and Cosine Similarity for Open Data Discovery.

DOI: 10.5220/0011040700003179

In Proceedings of the 24th International Conference on Enterprise Information Systems (ICEIS 2022) - Volume 1, pages 163-170

ISBN: 978-989-758-569-2; ISSN: 2184-4992

163

end user, by identifying, among several candidates

sources, which one is most likely to contain the infor-

mation needed. Several discovery methods have been

proposed to measure, e.g., table relatedness (Narge-

sian et al., 2018), or ﬁnd similarity joins between data

collections (Xu et al., 2019). Although these studies

focus on ﬁnding similarities between pairs of datasets,

data tables or attribute columns, user-centered ap-

proaches that consider the query context for dataset

recommendation are sparse, especially concerning

Open Data (Dawes et al., 2016). With Open Data

Discovery, the overall user experience when query-

ing portals could be improved, favoring the access

to public data and consequently their use in human-

oriented applications. Therefore, there is a need to

investigate approaches focused on reducing the com-

plexity in discovering relevant open data sets.

This work presents a Source Discovery approach

that blends topic-based embeddings and Cosine Sim-

ilarity for inferring an optimal open data source,

given an input query. Speciﬁcally, we apply a hy-

brid method named LDA-W2V, which uses LDA for

detecting the most representative content of a data

source, and Word2Vec for measuring how semanti-

cally close it is from a user query. Complementary,

Cosine is applied as a syntactic similarity measure be-

tween source and query, so the best source is retrieved

based on its structure and context.

Our approach was evaluated on its ability to dis-

cover the right source for a query among a set of eight

candidates. For proving its consistency, we conducted

three rounds of experiments, alternating the number

of data sources and test questions, and comparing it

with Cosine measure and LDA-W2V separately. The

results showed that our approach was superior in all

rounds of experiments, reaching a classiﬁcation ac-

curacy above 93%. This rate demonstrates that our

approach based on LDA-W2V and Cosine Similarity

is able to discover the most related data source for a

user question, even when all candidate sources have

similar content.

2 BACKGROUND

This section presents related work and concepts in-

volved in the present study, including Source Discov-

ery, Cosine Similarity, and LDA-W2V algorithm.

2.1 Source Discovery

A data discovery problem occurs when users and an-

alysts spend more time looking for relevant data than

analyzing it (Fernandez et al., 2018). So, Source Dis-

covery is a process that aims to mitigate this obsta-

cle, ﬁnding one or more relevant data sources (among

many possible sources) suitable to a user query.

The concept has been widely investigated in sev-

eral domains, especially in Business Intelligence (BI),

assuming that data sources must be discovered on-

the-ﬂy for dealing with real time and situational

queries (Abell

o et al., 2013). Considering this need,

many organizations have been encouraged to build

a navigational data structure to support source dis-

covery or to use tools for deriving insights from

datasets (Helal et al., 2021). With respect to Open

Data, Source Discovery is able to facilitate the access

to publicly available datasets, designed to be reused

for human beneﬁt.

Several studies propose Source Discovery mecha-

nisms to handle user queries. We can mention, e.g.,

the Aurum system (Fernandez et al., 2018), which

allows people to ﬂexibly ﬁnd relevant data through

properties of the datasets or syntactic relationships

between them. The DataMed approach (Chen et al.,

2018) also includes a Source Discovery task for ﬁnd-

ing relevant biomedical datasets from heterogeneous

sources, making them searchable through a web-

based interface.

Source Discovery implementation may be sup-

ported by several tasks such as schema discovery and

query reformulation (Hamadou et al., 2018). Mostly,

similarity methods are also used to determine the

source that best relates to the user query. Some exam-

ples of these methods are presented in the following.

2.2 Cosine Similarity

Cosine Similarity measures similarity as the angle

between two vectors being compared, assuming that

each word in a text or document corresponds to a di-

mension in a multidimensional space (Gomaa et al.,

2013). When measuring the angle of the documents,

smaller the angle, higher the similarity. So, consider-

ing that cosine of 0

◦

is 1, two vectors are said to be

similar when Cosine Similarity is 1 (Gunawan et al.,

2018). Cosine Similarity calculation is represented in

Equation 1, where

A and

B are attribute vectors.

cos(A,B) =

A ·

A| · |

(1)

Cosine Similarity is perhaps the most fre-

quently applied proximity measure in information re-

trieval (Korenius et al., 2007). It has been widely

studied in several application domains, such as med-

ical diagnosis (Raﬁq et al., 2019) and recommenda-

tion systems (Kotkov et al., 2018). In this study, it is

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

164

investigated for ﬁnding related data sources in a dis-

covery task. Concerning this goal, the next subsec-

tions present a hybrid approach based on LDA and

Word2Vec.

2.3 Hybrid LDA-W2V

Latent Dirichlet Allocation (LDA) is a generative

probabilistic model of a corpus, based on the idea that

documents are represented as random mixtures over

latent topics, and each topic is characterized by a dis-

tribution over words (Blei et al., 2003). Thus, given

a document, paragraph, or sentence, LDA determines

the most relevant topics, based on words in a topic

that appear more often in the narrative compared to

the words related to other topics (Bastani et al., 2019).

In (Jedrzejowicz and Zakrzewska, 2020), the au-

thors propose a hybrid model based on LDA. Specif-

ically, the model joins LDA with Word2Vec algo-

rithm (Goldberg and Levy, 2014), considering that

word embeddings allow to capture semantics when

processing vast amounts of linguistic data. First, it

assumes a set of documents, which are preprocessed

for obtaining a list of representative words or to-

kens. Each document set of tokens is given as in-

put to the LDA algorithm, which predicts the top-

ics (i.e., words and their proportions) that best de-

scribe the document content. For exemplifying, a doc-

ument containing information on universities could

be represented by a topic containing the following

words and proportions: 0.048*“high” + 0.047*“ed-

ucation” + 0.033*“students” + 0.033*“university”

+ 0.032*“course” + 0.032*“public” + 0.032*“fed-

eral” + 0.031*“private” + 0.031*“administrative.”

The next step of the approach performs a

Word2Vec (W2V) Extension, which aims to ex-

tend words in sources topics by using similar words

acquired from Word2Vec model. The similar-

ity is measured by Cosine Similarity (see Sub-

section 2.2): Supposing a topic word W

, the

W2V model is traversed to ﬁnd similar words

[w2vWord

,...,w2vWord

]. Then, the similarity score

between a pair [W

,w2vWord

] is multiplied by the

LDA proportion score for W

, obtaining a derived pro-

portion DP

. Each similar word found represents an

extended word EW

. Following the previous topic

example, the W2V model retrieves the word insti-

tution as similar to the topic word university, with

a similarity score 0.71. This score is multiplied by

LDA proportion score for university, 0.033, thus ob-

taining DP

=0.023. institution becomes an extended

word EW

, and generates a tuple [W

,DP

,EW

], e.g.,

[university, 0.023, institution]. After word extension,

the approach tries to classify an input sentence (the

test set) into a topic, based on the probabilities (DP

)

within all possible tuples. In short, the input sentence

is assigned to the topic that contains the highest prob-

abilities for each sentence word.

LDA has become the most popular topic model-

ing algorithm used, due to its applicability in several

contexts and ability to analyze large documents (Got-

tfried et al., 2021). We leveraged the beneﬁts from

both LDA and Word2Vec, applying this hybrid solu-

tion for source discovery involving open datasets.

3 SOURCE DISCOVERY

APPROACH BASED ON COSINE

AND LDA-W2V

Accessing Open Data often represents an obstacle for

regular users, due to the amount of data made avail-

able, its format and diversity. A Source Discovery

mechanism could replace users manual work when

querying, by retrieving desirable data and improving

the overall experience. Thus, this section presents a

Source Discovery approach based on LDA-W2V (Je-

drzejowicz and Zakrzewska, 2020) and Cosine Sim-

ilarity for recommending the open data source that

best ﬁts a user question.

Cosine Similarity was chosen due to its simplic-

ity, as it only requires term-frequency vectors from

two sets being compared to calculate similarity. Also,

it is commonly applied to measure similarity be-

tween a query and an item, based on common fea-

tures between them (Ristoski et al., 2014; Di Noia

et al., 2012). However, when context information

is not available, this measure may fail in determin-

ing similarity (Orkphol and Yang, 2019). Thus,

our approach is complemented with a hybrid LDA

approach based on (Jedrzejowicz and Zakrzewska,

2020), which combines semantic capabilities from

both LDA and Word2Vec for classiﬁcation tasks. The

LDA model has clear internal structure that allows ef-

ﬁcient inference, and it is independent of the training

documents number, thus being suitable for handling

large scale corpus (Liu et al., 2011). The combination

LDA-W2V + Cosine allows to apply both syntactic

and semantic capabilities for discovering which data

source is more adequate to answer an input question

Our approach is demonstrated in Figure 1.

The approach takes as input a set of candidate

sources (represented by CS1 to CS4 in the ﬁgure) and

a user input sentence. First, we extract from each can-

Other syntactic and semantic methods could be joined

for similar purposes.

Blending Topic-based Embeddings and Cosine Similarity for Open Data Discovery

165

didate source its schema information

, preprocessing

it for removing special characters and stopwords from

the text. The preprocessing stage results in a set of to-

kens for each source (T

CSn

) and for the input sentence

). The sources tokens are given as input to our

LDAW2V-based implementation, where a Topic De-

tection module outputs several topics for each source.

In other words, after LDA topic detection, each source

is represented by several topics composed by words

) and their proportions (P

), that determine word

relevance in the analyzed text. The word extension

through Word2Vec occurs as mentioned in Subsec-

tion 2.3: words that are similar to a topic word W

are captured along with their similarity score, in order

to form tuples containing extended words EW

and

probabilities DP

derived from P

multiplication.

After W2V Extension, we obtain tuples

,DP

,EW

] for each candidate source. So,

the Matching step

of our LDA-W2V algorithm

receives and traverses all tuples, aiming to ﬁnd the

best correlations for an input sentence, i.e., a query.

In other words, the query tokens (T

, extracted after

preprocessing), are used to search for equivalent W

or EW

with higher DP

. When a match is found, a

probability array PA

CSn

is created for each candidate

source, containing one DP

for each query token

. Otherwise, if a query token is not found in the

extension, a default probability (0.000001) is inserted

in PA

CSn

While the LDA-W2V process occurs, the sources

tokens T

CSn

and the input sentence tokens (T

) are

sent to the Cosine Similarity module, responsible for

vectorizing them for application in the Cosine for-

mula (see Equation 1). At this step, each source,

as well as the input sentence, are converted to term-

frequency vectors V

and V

CSn

, so the Cosine Simi-

larity cos

CSn

is the distance measured between these

vectors. The output from this module is the Co-

sine Similarity for each candidate source, which is

inserted in the probability array PA

CSn

, along with

the LDA-W2V derived probabilities. After complet-

ing the LDA-W2V and Cosine tasks, each candidate

source is represented by an array of joint probabil-

ities, and the average probability for each array is

calculated. Finally, the source most likely to meet a

user question is selected by considering the array with

highest average.

We consider the schema information all metadata con-

tained in a CSV ﬁle, or text in a data dictionary ﬁle, which

summarize the source content or its purpose.

Since the original LDA-W2V approach was focused

on classifying texts into different topics, we implemented

an adapted version of the Matching step, where the text (i.e.,

the input sentence) is classiﬁed into one source based on a

probability array.

Matching

W2V Extension

CS1

Preprocessing

CS2 CS3 CS4

Topic Detection (LDA)

CS4

CS1

= avg [DP

, cos

CS1

]

SelectedSource =

max(PA

CS1

,...,PA

CS4

)

Input

Sentence

LDA-W2V Approach

CS1

CS2

CS3

Vectorizer

CS1

CS4

CS2

CS3

cos

CSn

. V

CSn

| V

| . | V

CSn

Cosine Similarity

CS2

= avg [DP

, cos

CS2

]

CS3

= avg [DP

, cos

CS4

]

CS4

= avg [DP

, cos

CS4

]

Probability Arrays

Figure 1: Overview of the source discovery approach based

on Cosine Similarity and LDA-W2V.

Practical Example: Suppose two candidate sources,

CS1 and CS2, and a user input sentence IS “How

many rural schools have computers?”. After prepro-

cessing, the query tokens T

are [rural, school, com-

puter]. After W2V Extension, each candidate source

is represented by several tuples [W

,DP

,EW

], from

which T

will be searched. If all query tokens are

found in the CS1 extension, the array PA

CS1

for this

source could be, e.g., [0.0024, 0.032, 0.0096]. The

same query tokens for CS2 could originate an ar-

ray PA

CS2

[0.0012, 0.0018, 0.000001], considering

that “computer” token was not found in the exten-

sion. Thus, each source array will contain different

probabilities, one for each query token. After Co-

sine Similarity calculation, cos

CS1

is 0.85 for CS1,

whereas cos

CS2

is 0.7 for CS2. Both values are in-

cluded in their respective probability arrays, resulting

in PA

CS1

= [0.0024,0.032,0.0096,0.85] and PA

CS2

[0.0012,0.0018,0.000001,0.7]. After the arrays are

arranged, the probabilities average is calculated for

each source, so the source with highest average is cho-

sen as the most likely one to meet the input query. In

this example, CS1 would be the selected one.

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

166

4 EXPERIMENTS

In this section we present how our approach was im-

plemented to deal with candidate open data sources

and input queries.

We applied our approach to discover the data

source that best matches an input question, consid-

ering eight possible datasets, i.e., candidate sources,

named FIES

, INEP

, PROUNI

, CadUnico

, School

Census

, IBGE

, DataSUS

, and Ibama

. All can-

didate sources were extracted from Brazilian Open

Data datasets. Half of them (INEP, FIES, PROUNI,

and School Census) contain educational data from

different contexts. The other four datasets are very

distinct: CadUnico contains socioeconomic informa-

tion on low-income citizens and families, IBGE ag-

gregates social, economic and environmental indica-

tors from Brazilian cities and states, DataSUS man-

ages health information from Brazilian healthcare

networks, and Ibama dataset contains information re-

lated to environment actions and use of natural re-

sources.

In the next subsection, we describe implementa-

tion details concerning LDA-W2V and Cosine Simi-

larity methods included in our approach.

4.1 Implementation Details

First of all, each candidate source schema informa-

tion was manually extracted either from a CSV or

a data dictionary in its respective open data portal.

The extracted information was placed in an auxiliar

CSV ﬁle that summarizes metadata from all sources,

so that each row contains information from a differ-

ent source. Each source content within this ﬁle was

preprocessed (see Figure 1) through lowercasing, re-

moval of special characters, removal of stopwords,

and stemming, resulting in a set of meaningful to-

kens (represented by T

CSn

in the Figure). LDA-W2V

and Cosine Similarity were implemented in Python

taking the CSV ﬁle as input, along with test questions

(the input sentences). Each test question was prepro-

cessed the same was as the sources content, resulting

http://dadosabertos.mec.gov.br/ﬁes

https://www.gov.br/inep/pt-br/acesso-a-informacao/

dados-abertos/microdados/censo-da-educacao-superior

https://dados.gov.br/dataset/mec-prouni

https://dados.gov.br/dataset/

microdados-amostrais-do-cadastro-unico

https://www.gov.br/inep/pt-br/acesso-a-informacao/

dados-abertos/microdados/censo-escolar

https://www.ibge.gov.br/

https://opendatasus.saude.gov.br/dataset

http://www.ibama.gov.br/dados-abertos

https://github.com/mariahelenaaf/OpenDataDiscovery

Table 1: Examples of Test Questions (English Translation)

Query Source

How many rural schools have computers? School Census

How many families in CadUnico receive income

greater than R$1000?

CadUnico

How many students that beneﬁted from racial quotas

dropped out of a course in 2018?

INEP

What is the territory size of the Amazonas state? IBGE

in a list of tokens T

. Examples of test questions are

shown in Table 1.

For the LDA-W2V module of our approach, the

Gensim implementation

was used, and each T

CSn

was sent as a corpus to the LDA model. Multiple runs

were performed with LDA, alternating the num topics

parameter in the model, which determines the number

of latent topics to be extracted from each corpus. The

parameter value ranged from 8 to 10, based on the co-

herence measure

for each source, which evaluates

how coherent the produced topics are, by capturing

their interpretability on the LDA distribution.

Since all candidate sources are from Brazilian

open data portals, we implemented W2V Exten-

sion (see Figure 1) with a Portuguese pre-trained

model from FastText

. The model receives the LDA

topic distribution (words and proportions) from each

source, and searches for similar words for the exten-

sion. For measuring the similarity between a topic

word W

and a model word, we used Word2Vec’s

similar by vector method that ﬁnds the top-N simi-

lar words given a word vector. The N parameter was

set as 50, and the retrieval of model words similar to

topic words was based on a similarity threshold, rep-

resented as K, which assumed value 0.45. Only simi-

larity scores above K were multiplied with the W

pro-

portion score to derive the proportion DP

(see Sub-

section 2.3).

For measuring Cosine Similarity between a query

and a candidate source, we extracted the term-

frequency vectors for each query and source. For

this task, Python Counter tool

was applied in the

preprocessed query and source (T

and T

CSn

). The

Counter maps tokens in T

and T

CSn

as dictionary

keys, and their counts as dictionary values, thus ob-

taining term-frequency vectors V

and V

CSn

in the for-

mat {“word”: 4, “otherword”: 2}. The vectors are

sent to the Cosine function, which performs the cal-

culation shown in Equation 1.

https://radimrehurek.com/gensim/models/

ldamulticore.html

https://radimrehurek.com/gensim/models/

coherencemodel.html

https://fasttext.cc/

https://docs.python.org/3/library/collections.html

Blending Topic-based Embeddings and Cosine Similarity for Open Data Discovery

167

After executing both LDA-W2V and Cosine mod-

ules, we obtain a probability array PA

CSn

for each can-

didate source, as demonstrated in Figure 1. Next, we

present the results of our approach performing Source

Discovery tasks.

5 RESULTS AND DISCUSSION

In this section we describe the results obtained with

our approach performing Source Discovery tasks,

evaluating its capability to choose the most likely

open dataset to answer an input query. The evaluation

involved eight candidate datasets (see Section 4) and

test sets containing 48 and 74 questions, respectively,

to which a source should be inferred. All questions

and their correct answers (i.e., sources names) were

deﬁned manually, based on indicators available on

the data monitoring platforms SIMOPE

and LDE

Examples of the test questions are shown in Table 1.

The evaluation was performed in three rounds, al-

ternating the number of datasets and questions for in-

vestigating possible changes in classiﬁcation results.

Also, in each evaluation round, our approach was

tested against Cosine and LDA-W2V individually for

observing accuracy variation. The comparison be-

tween all methods accuracy is shown in Table 2.

In the ﬁrst evaluation round, we conducted the

evaluation with all data sources (i.e., Prouni, School

Census, FIES, INEP, CadUnico, IBGE, Ibama, and

DataSus) and the test set containing 48 questions.

Considering Cosine individually, the right source was

chosen for 87.5% of the test questions, whereas for

LDA-W2V, the accuracy rate reached 79.17%. We

can observe, in Table 2, that our approach blending

the two methods improved the classiﬁcation accuracy

considerably, reaching 93.75% (an increase of 6.25%

in the individual Cosine result, and 14.28% in the

LDA-W2V result).

For assessing our approach consistence, we con-

ducted a second round of evaluations, using the test

set containing 74 questions and the same eight open

data sources. Although a small improvement had

been observed for LDA-W2V (79.73%), Cosine ac-

curacy was impacted negatively (83.78%), hence im-

pacting our model’s accuracy (87.74%). This can

be explained by the structure of some test questions,

whose content were less speciﬁc. E.g., if a question

is mostly composed by words that are common in

many candidate sources, there might be a misclassi-

ﬁcation. The accuracy reduction can also be due to

https://seppirhomologa.c3sl.ufpr.br/

https://dadoseducacionais.c3sl.ufpr.br

Table 2: Classiﬁcation accuracy in three evaluation rounds

with different sources (S) and questions (Q).

S Q Cosine Sim. LDA-W2V Our Approach

8 48 87.5 79.17 93.75

8 74 83.78 79.73 87.74

5 48 87.5 81.25 93.75

the stemming method applied in preprocessing stage,

since the reduction of some Portuguese words may

originate ambiguous tokens, thus confusing the clas-

siﬁer. Despite that, our approach had superior results

than the isolated methods: the right source was cho-

sen for 87.74% of the questions, meaning 4.05% of

increase when compared to Cosine, and 8.11% when

compared to LDA-W2V.

In the third evaluation round, the consistency of

our approach was estimated from another perspec-

tive: we removed three of the most distinctive data

sources from the pool of candidate sources, leaving

only the datasets Prouni, School Census, FIES, INEP,

and CadUnico. From these, CadUnico is the only

source that does not contain education data. Our ob-

jective was to verify whether the classiﬁcation accu-

racy would be satisfactory with data sources contain-

ing very similar content (i.e., metadata). The results

are summarized in Table 2 and Figure 2.

Cosine and LDA-W2V reached 87.5% and

81.25% of accuracy, respectively, whereas our ap-

proach reached 93.75%. The rates had little varia-

tion when compared to the ones in the ﬁrst evaluation

round, which is a very promising result. Indeed, four

out of ﬁve datasets contained information on Educa-

tion, representing a more complex classiﬁcation task

compared to predicting distinct classes; yet, our ap-

proach accuracy rate was over 93%. Moreover, as we

can see in Figure 2, there was a signiﬁcant prediction

improvement for each of the ﬁve datasets in compar-

ison to Cosine and LDA-W2V predictions, especially

for School Census and Cadunico datasets, where no

incorrect classiﬁcation occurred.

We observed that LDA-W2V was the method with

the lowest accuracy (around 80%) in all executions.

This can be explained by the high variability caused

by the probabilities assigned in W2V Extension step

(see Subsection 2.3). For exemplifying, let us con-

sider the input query “Number of students who have

housing assistance at federal universities”. Although

the source that best answers this question is INEP,

FIES source was the predicted one, since its proba-

bility array contained higher values for some tokens

such as assistance and federal. This speciﬁc behav-

ior may have impacted our model accuracy, especially

considering that half of the databases used contained

educational information.

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

168

Figure 2: Classiﬁcation accuracies in the third evaluation

round.

Despite that, the experiments involving Open

Data Discovery tasks with our approach have shown

promising results: the accuracy rates in all evalua-

tion rounds were above 87%, where half of the open

datasets used were of very similar subject (i.e., Educa-

tion). From these results, we infer that possible weak-

nesses of each method individually (Cosine or LDA-

W2V) were overcome by their combination, allow-

ing to explore both semantic and syntactic features

for discovery tasks. It is important to highlight that,

in our blended model, only metadata and/or sources

descriptions were used to classify the test questions,

which reinforces the quality of the results and their

potential for Open Data Discovery domain.

6 CONCLUSIONS

This paper presented a Source Discovery approach

that joins topic-based embeddings from the LDA-

W2V algorithm and Cosine Similarity for inferring an

open data source that is most likely to answer an input

query. By blending both measures, we leverage syn-

tactic and semantic advantages for detecting meaning-

ful content. We evaluated the approach by conducting

three rounds of Source Discovery experiments involv-

ing eight candidate open datasets and alternated test

sets: for each test question, our approach has to in-

fer which of the candidate sources was the most suit-

able. The classiﬁcation results showed an accuracy

above 93%, i.e., a superior rate when compared with

LDA-W2V and Cosine separately. From these exper-

iments, we conclude that our hybrid approach is able

to discover the most related data source for a user

question. The encouraging results also demonstrate

that our approach has potential to improve Open Data

transparency and user support.

As future work, we aim to implement a case-based

recommendation strategy allied to Source Discovery,

since it can leverage user feedback on the suggested

source to improve future classiﬁcations. Also, our ob-

jective is to investigate other approaches and voting

mechanisms that can be applied to Source Discovery

tasks, for building a uniﬁed solution that combines

different advantages. Finally, we aim to evaluate a

Source Discovery solution with real users, in order to

promote Open Data access through a good querying

experience.

ACKNOWLEDGMENTS

This study was ﬁnanced in part by the Coordenac¸

de Aperfeic¸oamento de Pessoal de N

ıvel Superior,

Brazil (CAPES), Finance Code 001, and Fundo Na-

cional de Desenvolvimento da Educac¸

ao (FNDE),

through research project “Pesquisa e Inovac¸

ao na

Integrac¸

ao e Visualizac¸

ao de Dados do Programa Na-

cional do Livro Did

atico”.

REFERENCES

Abell

o, A., Darmont, J., Etcheverry, L., Golfarelli, M.,

Maz

on, J.-N., Naumann, F., Pedersen, T., Rizzi, S. B.,

Trujillo, J., Vassiliadis, P., et al. (2013). Fusion

cubes: Towards self-service business intelligence. In-

ternational Journal of Data Warehousing and Mining

(IJDWM), 9(2):66–88.

Abell

o, A., Romero, O., Pedersen, T. B., Berlanga, R.,

Nebot, V., Aramburu, M. J., and Simitsis, A. (2014).

Blending Topic-based Embeddings and Cosine Similarity for Open Data Discovery

169

Using semantic web technologies for exploratory

olap: a survey. IEEE transactions on knowledge and

data engineering, 27(2):571–588.

Attard, J., Orlandi, F., Scerri, S., and Auer, S. (2015). A

systematic review of open government data initiatives.

Government information quarterly, 32(4):399–418.

Bastani, K., Namavari, H., and Shaffer, J. (2019). Latent

dirichlet allocation (lda) for topic modeling of the cfpb

consumer complaints. Expert Systems with Applica-

tions, 127:256–271.

Beniwal, R., Gupta, V., Rawat, M., and Aggarwal, R.

(2018). Data mining with linked data: Past, present,

and future. In 2018 Second International Confer-

ence on Computing Methodologies and Communica-

tion (ICCMC), pages 1031–1035. IEEE.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. the Journal of machine Learning

research, 3:993–1022.

Chen, X., Gururaj, A. E., Ozyurt, B., Liu, R., Soysal,

E., Cohen, T., Tiryaki, F., Li, Y., Zong, N., Jiang,

M., et al. (2018). Datamed–an open source dis-

covery index for ﬁnding biomedical datasets. Jour-

nal of the American Medical Informatics Association,

25(3):300–308.

Dawes, S. S., Vidiasova, L., and Parkhimovich, O. (2016).

Planning and designing open government data pro-

grams: An ecosystem approach. Government Infor-

mation Quarterly, 33(1):15–27.

Di Noia, T., Mirizzi, R., Ostuni, V. C., Romito, D., and

Zanker, M. (2012). Linked open data to support

content-based recommender systems. In Proceedings

of the 8th international conference on semantic sys-

tems.

Djiroun, R., Boukhalfa, K., and Alimazighi, Z. (2019). De-

signing data cubes in olap systems: a decision mak-

ers’ requirements-based approach. Cluster Comput-

ing, 22(3).

Fernandez, R. C., Abedjan, Z., Koko, F., Yuan, G., Madden,

S., and Stonebraker, M. (2018). Aurum: A data dis-

covery system. In 2018 IEEE 34th International Con-

ference on Data Engineering (ICDE), pages 1001–

1012. IEEE.

Goldberg, Y. and Levy, O. (2014). word2vec explained:

deriving mikolov et al.’s negative-sampling word-

embedding method. arXiv preprint arXiv:1402.3722.

Gomaa, W. H., Fahmy, A. A., et al. (2013). A survey of

text similarity approaches. International Journal of

Computer Applications, 68(13):13–18.

Gottfried, A., Hartmann, C., and Yates, D. (2021). Mining

open government data for business intelligence using

data visualization: A two-industry case study. Jour-

nal of Theoretical and Applied Electronic Commerce

Research, 16(4):1042–1065.

Gunawan, D., Sembiring, C., and Budiman, M. (2018).

The implementation of cosine similarity to calculate

text relevance between two documents. In Journal

of Physics: Conference Series, volume 978, page

012120. IOP Publishing.

Hamadou, H. B., Ghozzi, F., P

eninou, A., and Teste, O.

(2018). Querying heterogeneous document stores. In

20th International Conference on Enterprise Informa-

tion Systems (ICEIS 2018), volume 1, pages 58–68.

Helal, A., Helali, M., Ammar, K., and Mansour, E. (2021).

A demonstration of kglac: a data discovery and en-

richment platform for data science. Proceedings of

the VLDB Endowment, 14(12):2675–2678.

Jedrzejowicz, J. and Zakrzewska, M. (2020). Text clas-

siﬁcation using lda-w2v hybrid algorithm. In Intel-

ligent Decision Technologies 2019, pages 227–237.

Springer.

Korenius, T., Laurikkala, J., and Juhola, M. (2007). On

principal component analysis, cosine and euclidean

measures in information retrieval. Information Sci-

ences, 177(22):4893–4905.

Kotkov, D., Konstan, J. A., Zhao, Q., and Veijalainen, J.

(2018). Investigating serendipity in recommender sys-

tems based on real user feedback. In Proceedings of

the 33rd Annual ACM Symposium on Applied Com-

puting, pages 1341–1350.

Liu, Z., Li, M., Liu, Y., and Ponraj, M. (2011). Performance

evaluation of latent dirichlet allocation in text mining.

In 2011 Eighth International Conference on Fuzzy

Systems and Knowledge Discovery (FSKD), volume 4,

pages 2695–2698. IEEE.

Nargesian, F., Zhu, E., Pu, K. Q., and Miller, R. J. (2018).

Table union search on open data. Proceedings of the

VLDB Endowment, 11(7):813–825.

Nikiforova, A. and McBride, K. (2021). Open government

data portal usability: A user-centred usability analysis

of 41 open government data portals. Telematics and

Informatics, 58:101539.

Orkphol, K. and Yang, W. (2019). Word sense disam-

biguation using cosine similarity collaborates with

word2vec and wordnet. Future Internet, 11(5):114.

Osagie, E., Waqar, M., Adebayo, S., Stasiewicz, A., Por-

wol, L., and Ojo, A. (2017). Usability evaluation of

an open data platform. In Proceedings of the 18th An-

nual International Conference on Digital Government

Research, pages 495–504.

Porreca, S., Leotta, F., Mecella, M., Vassos, S., and Catarci,

T. (2017). Accessing government open data through

chatbots. In International Conference on Web Engi-

neering, pages 156–165. Springer.

Raﬁq, M., Ashraf, S., Abdullah, S., Mahmood, T., and

Muhammad, S. (2019). The cosine similarity mea-

sures of spherical fuzzy sets and their applications in

decision making. Journal of Intelligent & Fuzzy Sys-

tems, 36(6):6059–6073.

Ristoski, P., Menc

ıa, E. L., and Paulheim, H. (2014). A hy-

brid multi-strategy recommender system using linked

open data. In Semantic Web Evaluation Challenge,

pages 150–156. Springer.

Sowe, S. K. and Zettsu, K. (2015). Towards an open

data development model for linking heterogeneous

data sources. In Knowledge and Systems Engineer-

ing (KSE), 2015 Seventh International Conference on,

pages 344–347. IEEE.

Xu, P., Lu, J., et al. (2019). Towards a uniﬁed framework

for string similarity joins. Proceedings of the VLDB

Endowment.

Zhang, C. and Yue, P. (2016). Spatial grid based open

government data mining. In Geoscience and Remote

Sensing Symposium (IGARSS), 2016 IEEE Interna-

tional, pages 192–193. IEEE.

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

170