When owl:sameAs is the Same: Experimenting Online Resolution of

Identity with SPARQL Queries to Linked Open Data Sources

Rapha

el Gazzotti

and Fabien Gandon

Universit

e C

ote d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France

Keywords:

Equivalence Links, Coreference Resolution, SPARQL, Linked Data, Data Curation, sameAs.

Abstract:

Equivalence links are the cornerstone of Linked Data and their integration. However, it is not easy to es-

tablish and manipulate them, since the Web is always evolving with datasets emerging and disappearing.

Inconsistencies may also be present on the Web, leading to erroneous assertions and inferences. We propose

a method to identify owl:sameAs relationships of a resource relying on online SPARQL querying of dis-

tributed datasets and to correct results using declarative curation rules. We also exploit and inspect the quality

of owl:InverseFunctionalProperty and owl:FunctionalProperty relationships, using the deﬁnitions

given by their schemata, endpoints and a voting approach. We evaluate our method on an existing bench-

mark and compare to state of the art baselines. We show that a heuristic approach can retrieve high quality

equivalence links without requiring the extraction of all the alleged existing equivalence relations.

1 INTRODUCTION

The ability to establish links is key to weaving the

Web in general and the semantic Web in particu-

lar (Gandon, 2018). But the Web is in constant evo-

lution with resources added and deleted all the time.

Linked data, in particular, rely on our ability to estab-

lish links between the different datasets on the Web

and as such, the detection of equivalence links is a

central task. And as the Web evolves, this linking has

to evolve with it. Establishing equivalences between

resources is also key to data integration use cases

to join knowledge graphs with different provenances.

More recently this ability to combine such graphs also

proved important in machine learning approaches re-

lying on embeddings based on a set of linked graphs

to capture the semantics surrounding a concept more

accurately and more richly.

Approaches discovering equivalence links that

rely on a snapshot of the Web run the risk of capturing

relationships that already belong to the past. Equiva-

lence links can also be retrieved by exploiting OWL

semantics, e.g., properties of type (inverse) functional

can indicate the uniqueness of a resource, leading to

the inference of owl:sameAs relationships when dif-

ferent URIs are used as their subject or object (Alle-

https://orcid.org/0000-0002-5618-9776

https://orcid.org/0000-0003-0543-1232

mang et al., 2020). Moreover, data on the Web is

of variable quality, which requires caution in using

it. Therefore, there is a need for on-demand online

search and reasoning for equivalence relations. To

tackle this problem, we explored the research ques-

tion: Can valid owl:sameAs relationships for a given

URI be detected automatically and online?. In this

article, we answer the following sub-questions:

• Where to ﬁnd and how to retrieve SPARQL end-

point information to be explored for equivalence

detection?

• How to identify and correct misinformation about

owl:sameAs statements?

• How to detect wrong type for assserted

owl:InverseFunctionalProperty and

owl:FunctionalProperty properties?

The paper is structured as follows. We survey

the related work in Section 2 and position our con-

tribution. In Section 3 we introduce the deﬁnition of

the equivalence detection task and the vocabulary we

used to solve it. Then, in Section 4 we describe how

we proceed to obtain information on datasets and their

SPARQL endpoints. Section 5 details how we pro-

ceed to collect and curate equivalence links. We eval-

uate the quality of the retrieved equivalent links on a

public benchmark, comparing to state of the art base-

lines, and discuss our results in Section 6. We con-

clude with some perspectives in Section 7.

Gazzotti, R. and Gandon, F.

When owl:sameAs is the Same: Experimenting Online Resolution of Identity with SPARQL Queries to Linked Open Data Sources.

DOI: 10.5220/0010654400003058

In Proceedings of the 17th International Conference on Web Information Systems and Technologies (WEBIST 2021), pages 41-52

ISBN: 978-989-758-536-4; ISSN: 2184-3252

2 RELATED WORK

Surprisingly, there are not so many available services

to establish owl:sameAs relationships for a given

URI.

The sameAs.org (Jaffri et al., 2008) is one of the

pioneers to propose a service of URIs coresolution.

The equivalences are collected through different RDF

ﬁles and SPARQL endpoints chosen by the authors.

Equivalent URIs are stored with different iden-

tiﬁers depending on the context. Different corefer-

ence contexts are captured by different coreference

resolution services, because depending on the us-

age and context, an equivalence either holds or not.

A context is represented as a bundle attached to a

URI. However the approach mixes different predi-

cates which can be far from a owl:sameAs, e.g.,

ov:similarTo

is a property for things that are not

linked by owl:sameAs, but that are similar to a cer-

tain extent. The system also does not ensure that re-

sources are actually equivalent and proposes only one

concept of equivalence. Therefore, it does not allow

to distinguish different notions of equivalence as indi-

cated by (Halpin et al., 2010) and deﬁned in the Simi-

larity Ontology.

In addition, with the online portal, it

is unfortunately not possible to distinguish the differ-

ent contexts considered for a given resource. Unlike

the other works, sameAs.org does not use a curation

algorithm for owl:sameAs relationships.

The LODsyndesis

(Mountantonakis and Tz-

itzikas, 2018) platform, in addition to providing vari-

ous services and metrics related to many datasets, also

performs coreference resolution. The algorithm in-

troduced by the authors incrementally uses the same

identiﬁer for pairs of URIs (linked by the owl:sameAs

property) deﬁning the same resource. The authors

show that by leveraging content or graph metrics, er-

roneous equivalence links can be detected. Their ap-

proach relies on the data provided by (Schmachten-

berg et al., 2014) that are crawled from the LOD

Cloud with LDSpider,

on various datasets: Yago,

datahub.io, DBpedia v3.9, Wikidata, Freebase and

LinkLion. Unfortunately, the number of coreferences

that this service can offer is sometimes limited.

The sameAs.cc dataset accessible through the

SPARQL endpoint

(Raad et al., 2020) exploits the

LOD-a-Lot dataset (Fern

andez et al., 2017) and the

Louvain algorithm (Blondel et al., 2008) for detect-

PREFIX ov: http://open.vocab.org/terms/

https://web.archive.org/web/20170510073633/http:

//kakapo.dcs.qmul.ac.uk/ontology/musim/0.2/musim.html

https://demos.isl.ics.forth.gr/lodsyndesis/

https://github.com/ldspider/ldspider

http://sage.univ-nantes.fr/see/sameAs

ing communities, an approach that leads them to iden-

tify errors between owl:sameAs relations. This ap-

proach succeeds in extracting more equivalence links

than in the previous work, LODsyndesis (Mountan-

tonakis and Tzitzikas, 2018). Different human an-

notators evaluated a subset of 200 owl:sameAs rela-

tions based on their descriptions to assess the rele-

vance of these relations based on different thresholds

of degrees of error. They apply their ﬁndings on these

thresholds to cluster equivalence links that relate to

the former U.S. president Barack Obama.

provid-

ing a valuable benchmark. One of the disadvantages

of using the Louvain algorithm is that it must be run

several times on the complete set of alleged equiva-

lences to get the “best clusters” with no ensurance of

reaching the global maximum of modularity, as it is a

greedy and non-deterministic method.

Our approach differs from previous work in that

it is not performed on a locally stored dataset but dy-

namically and online, on the SPARQL endpoints of

many datasets. The list of considered endpoints is

open and can be extended at will with new public

and private endpoints as well as their description rel-

evant to the application. Moreover, it uses other rela-

tions than owl:sameAs to deﬁne equivalence relations

between different resources since we also rely on

the properties owl:InverseFunctionalProperty

and owl:FunctionalProperty. We also pro-

pose several techniques to curate the equiva-

lences found on the ﬂy and the properties de-

clared as owl:InverseFunctionalProperty and

owl:FunctionalProperty. We did this after notic-

ing their deﬁnitions on datasets may differ from what

was intended by the creators of the ontologies (see

Section 5). Our algorithm also relies almost exclu-

sively on SPARQL queries, ensuring a high compat-

ibility with different engines and a declarativity that

brings ﬂexibility and extensibility in the sources and

rules considered. The idea is to provide and evaluate

a mechanism that could be implemented on top of any

SPARQL engine and customized to any application.

3 SAMELIVE APPROACH:

ALGORITHM AND TASK

DEFINITION

Let S = s

, ..., s

be the set of k seed URIs s

for

which we want to obtain equivalent URIs. Some of

these seed URIs may be equivalent, i.e., they share a

owl:sameAs relationship, which will result in faster

https://github.com/raadjoe/obama-lod-identity-

analysis

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

convergence of the algorithm. We show this in the

case study described in sub-section 6.3. First we need

to identify a set of endpoints N = E

, ..., E

that we

will query to discover equivalent URIs. The process

to calculate the closure of the equivalent relations is a

greedy incremental algorithm that iterates on a grow-

ing collection of URIs linked by equivalence state-

ments until no more new statement can be obtained

from the set of endpoints N. At each iteration we also

test an extensible set of declarative heuristics to detect

if an equivalence is an error and avoid its transitive

propagation.

Moreover, to ensure portability, distributability,

and federability, the algorithm we have written is pri-

marily in SPARQL 1.1, with calls in SPARQL 1.0

when we contact online SPARQL endpoints. Many

endpoints on the Linked Open Data still do not sup-

port SPARQL 1.1, but the functionalities of this lan-

guage allow us on a local endpoint to process proper-

ties of type (inverse) functional as well as to perform

curation of owl:sameAs relationships and to provide

statistics on the extracted data. We also trace all the

steps of the algorithm by the generation of several

named graphs and with the help of a small dedicated

vocabulary introduced in this section and summarized

in Table 1. We use a set of classes and properties to

represent the input and output of each iteration. To-

gether with the named graphs they capture the state of

each iteration and the provenance of our results.

In the rest of this section, anchors in footnotes re-

fer to labelled functions in the source code. The ini-

tialization of the algorithm consists of:

1. identifying datasets and their corresponding

SPARQL endpoints and populating the named

graph same:N with their description and check-

ing their availability (see Section 4, done with 8

SPARQL queries, 3 of them being used for data

cleaning);

2. populating the named graph Q

with the set

S of seed URIs for which we want to obtain

equivalent URIs. These seeds URI are typed as

same:Target (done with 1 SPARQL query);

3. identifying and storing the deﬁnitions of

owl:InverseFunctionalProperty and

owl:FunctionalProperty in the named graph

kg:default: in a preselection step the (inverse)

functional properties identiﬁed from N are placed

in the named graph same:Properties alongside

https://www.w3.org/TR/rdf-sparql-query/

Functions N1 to N3 load information about datasets,

CN1 is responsible for data cleaning, A1 is responsible for

checking endpoint availability.

Function P1 populates Q

with seed URIs.

with their namespaces to deference them. If they

cannot be dereferenced they are stored in the

named graph same:NotDeferencedProperties

to be put to a vote (see the sub-section 5.1,

done with 12 SPARQL queries, 2 of them being

run in a loop).

This step refers to the block

corresponding to the ﬁrst “if” in algorithm 1.

The core of the algorithm (mainly inside the

“while” loop of algorithm 1) consists of iterating on

a growing collection of named graphs Q

for which at

each iteration i:

1. we query Q

i−1

from the previous iteration for in-

stances of same:Target that are not instances of

same:Rotten (see the Figure 1), and if the result

is empty, we stop (done with 1 SPARQL query);

2. for each target URI we query the available end-

points in same:N for owl:sameAs relationships

(done with 1 SPARQL query);

3. for each target URI we query the

available endpoints in same:N for

owl:InverseFunctionalProperty and

owl:FunctionalProperty concerning them

and infer owl:sameAs relations (done with 2

SPARQL queries);

4. the obtained owl:sameAs are added in a speciﬁc

named graph with a name based on the endpoint

and the iteration in which the owl:sameAs rela-

tionship was identiﬁed;

5. we check that the owl:sameAs relationships con-

form to some rules (described in the sub-section

5.2.1) and if it is not the case, identiﬁed instances

of same:Target become of type same:Rotten

and are placed in the named graph Q

−1

be-

fore their relationships are removed: incom-

ing and outgoing owl:sameAs relationships from

a same:Rotten are deleted. This also ap-

plies to its owl:InverseFunctionalProperty

and owl:FunctionalProperty, and the out-

going same:Target from a same:Rotten are

deleted if it has no other same:Target incoming

relationship (done with 5 SPARQL queries, 3 of

Function G − (I)FP1 retrieves the properties known as

(inverse) functional, LDD − (I)FP1 attempts to deference

and load the RDF document of properties from their names-

paces, LDS −(I)FP1 stores as information that an endpoint

has the schema of a property, V − (I)FP1 performs the vot-

ing on the type of the property.

Function T 1 gets the same:Target resources in Q

i−1

Function S1 retrieves owl:sameAs relationships.

Function (I)FP1 retrieves the actual instances of (in-

verse) functional properties and (I)FP2 infers owl:sameAs

relationships from them.

When owl:sameAs is the Same: Experimenting Online Resolution of Identity with SPARQL Queries to Linked Open Data Sources

them being used to remove relationships);

6. the new resources are added to named graph Q

The stop condition is that the resources to be ex-

plored are exhausted, i.e., there are no more resources

to explore for the current iteration i and Q

is empty.

A last check with (c.f., the rule #2, sub-section 5.2.1)

is necessary to ensure that there is no relationship be-

tween two resources sharing the same authority (re-

sources obtained at i). The pseudo-code of SameLive

is represented by the algorithm 1.

Algorithm 1: Online resolution of identity with the Same-

Live algorithm.

Initialize Vocabulary // Also initialize Q

−1

Initialize N // Availability is checked

Initialize Q

// Variable to consider or not

(inverse) functional properties

properties condition = {True, False}

if properties condition then

Search properties typed as (inverse) functional

over N

Deferenciation of these properties

Search schema of properties not deferenced

over N

Voting for the properties not deferenced

i = 1

L = length(Q

)

while L != 0 do

Initialize Q

+= Retrieve owl:sameAs relations of Q

i−1

over N

if properties condition then

Retrieve triples comporting Q

i−1

who

have a property typed typed as (inverse)

functional over N

+= Infer owl:sameAs relations with the

triples previously extracted over N

// Curation rules have an impact on

all Q

−1

+= Apply curation rules

i += 1

L = length(Q

i−1

)

−1

+= Apply curation rule #2

4 EXTRACTION AND

INTEGRATION OF

ENDPOINTS’ INFORMATION

The ﬁrst step is to identify the datasets that can con-

tribute to solving this problem. Various strategies ex-

ist for this purpose such as relying on search engines,

Function R1 refers to the rule #1 used to detect

same:Rotten, respectively R2 refers to the rule #2, RC1

is used to remove relationships ; details in Section 5.2.

using previously crawled RDF data (i.e., using the

RFC 8615,

etc.) or even catalogs referencing these

datasets. To achieve our means, we rely on famous

catalogs listing datasets, using their metadata as well

as the URLs for their SPARQL endpoints. These cat-

alogs are regularly updated so we can keep up with

the latest updates of the Web of Data. Moreover this

approach supports the addition of new catalogs at any

time, including private endpoints.

However, just because a dataset is referenced in

a catalog does not mean it is available. To deﬁne

if a SPARQL endpoint is available or not, we rely

on the EndS ontology (Endpoint Status Ontology)

which is an extension of VoiD.

We use in partic-

ular the property ends:statusIsAvailable. This

ontology is involved in the description of the differ-

ent datasets retrieved from the catalogs detailed in the

following sub-sections. For each source we create a

distinct named graph and then combine all undupli-

cated results (i.e., different SPARQL endpoint URL)

into one initial named graph same:N containing all

our sources. From the different catalogs we present,

we only rely on the asserted availability of endpoints

from YummyData. The other catalogs listed here do

not have regular updates on this property and we per-

form the availability ourselves.

4.1 voiD Store

We rely on the voiD store

and query it for instances

of void:Dataset with their void:sparqlEndpoint

to access them. To avoid duplicating entries for

a same dataset -as datasets can be referred with

several types in the voiD store- or incomplete

results we limit ourselves to the extraction of re-

sources of type void:Dataset and we ensure to

not retrieve blank nodes. We also check that these

resources have a title (property dcterms:title)

and, of course, a SPARQL endpoint (property

void:sparqlEndpoint). Any doubloon is elim-

inated (i.e., datasets using the same SPARQL

endpoint), datasets can be represented several times

in the voiD store. The retrieval of the information

about endpoints and the removal of the doubloons is

carried out in two steps to overcome the limitations of

https://tools.ietf.org/html/rfc8615

PREFIX ends: https://labs.mondeca.com/vocab/

endpointStatus/, archive link: https://web.archive.org/

web/20210302021149/https://labs.mondeca.com/vocab/

endpointStatus/

https://www.w3.org/TR/void/

http://void.rkbexplorer.com

PREFIX dcterms: http://purl.org/dc/terms/

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

Figure 1: Workﬂow diagram of the steps followed to retrieve equivalent links from the starting resources in Q

with the

available endpoints in same:N.

Table 1: Main vocabulary introduced for the equivalence discovering algorithm.

Element Main type Role

same:Rotten owl:Class Resource that is source of potentially erroneous relationships.

same:Target owl:Class Targeted resource for discovering equivalence relationships.

same:hasAuthority owl:DatatypeProperty Authority of the resource.

same:hasIteration owl:DatatypeProperty

Iteration of the equivalence relationships algorithm during

which the named graph was generated.

same:hasNamespace owl:DatatypeProperty Namespace of the resource.

same:statementInDataset rdf:ObjectProperty Points to the void:Dataset that a statement is a part of.

same:votingType owl:ObjectProperty rdf:type of the resource determined after voting.

same:N rdfg:Graph

Named graph that contains information about the

void:Dataset resources.

same:Properties rdfg:Graph

Named graph that contains asserted owl:InverseFunctionalProperty

and owl:FunctionalProperty.

same:PropertiesNotDeferenced rdfg:Graph

Named graph that contains not deferenced properties after

the LOAD clause.

same:Q-1 rdfg:Graph Named graph that contains the same:Rotten resources.

same:Q0 rdfg:Graph Named graph that contains the starting same:Target.

same:Qi rdfg:Graph

Named graph that contains the same:Target retrieved

at the i iteration of the equivalence relationships algorithm.

the voiD store SPARQL endpoint.

. The endpoint

of the voiD store has recently been closed due to the

decision of the service maintainers, however our ap-

proach still works with the other catalogs and we pro-

vide code to integrate its data if someone decides to

continue maintaining this service. Our results were

obtained while this service was still in operation. A

large majority of the core datasets of the Linked Open

Data are also referenced by the LODCloud.

4.2 LODCloud

The main purpose of the LODCloud

website is to

provide a diagram of the LOD cloud. The JSON data

used to build it are available

and we translate se-

lected parts of them in RDF as different data are rep-

resented within this JSON document. We ﬁrst check

if a dataset has an entry related to a SPARQL endpoint

http://void.rkbexplorer.com/sparql

https://lod-cloud.net/

https://lod-cloud.net/lod-data.json

(“sparql” ﬁeld) in which case we include it. If we ﬁnd

information about a voiD page, we store this data as

well. However, some of the information contained in

the lod-cloud is not always up to date: an endpoint

marked as unavailable may be in fact available at the

time of our query and vice-versa. Therefore, before

executing our approach we check the availability of

the endpoints to avoid waiting unnecessarily for a re-

sponse from an unavailable endpoint.

4.3 YummyData

YummyData

(Yamamoto et al., 2018) is a site ref-

erencing and monitoring various SPARQL endpoints

in the biomedical domain. We ﬁrst get the URL lead-

ing to the SPARQL endpoint of the datasets from the

JSON data obtained from their API. Then we check

if the datasets have a VoiD annotation (‘void’ ﬁeld

equal to True) and in such case we remove the suf-

ﬁxes ”virtuoso/sparql” or ”sparql” from the URLs of

https://yummydata.org/

When owl:sameAs is the Same: Experimenting Online Resolution of Identity with SPARQL Queries to Linked Open Data Sources

the SPARQL endpoints and we add as a new sufﬁx

”.well-known/void” to refer to the VoiD page describ-

ing the datasets.

For instance, to obtain the VoiD

page of the Protein Ontology the URL https://sparql.

proconsortium.org/virtuoso/sparql becomes https://

sparql.proconsortium.org/.well-known/void).

Now that we have described how we retrieve in-

formation about SPARQL endpoints, we will get to

the heart of the matter by describing the resources

needed by the algorithm and some of its key steps.

5 EQUIVALENCE LINKS

RETRIEVAL AND DATA

CURATION

5.1 Collecting True Instances of

(Inverse) Functional Properties

To establish identity, we collect rele-

vant owl:InverseFuntionalProperty and

owl:FunctionalProperty that we store in the

named graph same:Properties.

According to OWL speciﬁcations: if a, b, c are

three resources, (i)

if there exist inverse functional

property relations ip(a, b) and ip(c, b) there also ex-

ists an equivalence relation owl : sameAs(a, c) and

(ii)

similarly, if there exist functional property re-

lations f p(a, b) and f p(a, c) there exists an owl :

sameAs(b, c) equivalence relation.

However, depending on the SPARQL end-

points, the properties’ deﬁnitions may vary and

contain abusive usage of these types or at least

usage that should be limited to a local closed

world. For instance, rdfs:label is deﬁned

as an owl:InverseFunctionalProperty in the

National Digital Data Archive of Hungary

agront:isPartOfSubvocabulary

is deﬁned as

a owl:FunctionalProperty by the LusTRE end-

point.

Table 2 provides some statistics we com-

see VoiD documentation on “discovering”: https://

www.w3.org/TR/void/#well-known.

https://www.w3.org/TR/2012/REC-owl2-syntax-

20121211/#Inverse-Functional Object Properties

https://www.w3.org/TR/2012/REC-owl2-syntax-

20121211/#Functional Object Properties

http://lod.sztaki.hu/sparql

http://aims.fao.org/aos/agrontology#

isPartOfSubvocabulary

the endpoint http://linkeddata.ge.imati.cnr.it:

8890/sparql includes the EARTh -Enviromental Ap-

plications Reference THesaurus- and ThIST -Italian

Thesaurus of Earth Sciences- datasets)

puted about this problem. The following steps (sec-

tions 5.1.1 to 5.1.3) refer to the block corresponding

to the ﬁrst “if” in algorithm 1 and the item 3 in the

ﬁrst list in Section 3.

5.1.1 Initialization

We ﬁrst query all available endpoints (prop-

erty ends:statusIsAvailable set to true) in

the named graph same:N to return their proper-

ties of type owl:InverseFunctionalProperty and

owl:FunctionalProperty. A ﬁlter is performed on

the properties deﬁned as blank nodes. Then, these

data are stored as RDF* triples (Hartig and Thomp-

son, 2014) in the named graph same:Properties

and we keep as information the dataset from which

they come with the same:statementInDataset

property. These triples are stored in RDF* to record

the statements typing properties together with the

provenance dataset in which they were found, i.e.:

<<foaf:firstName a owl:FunctionalProperty>>

same:statementInDataset <http://uriburner.com/>

We then perform two types of curation: curation

by schema and curation by voting. Properties for

which the deﬁnition could not be deferenced or found

in the SPARQL endpoints are excluded.

5.1.2 Curation by Schema

We ensure that the use of these properties corresponds

to what was intended by the creators of the ontologies

is to extract the schemas of these properties by def-

erencing them. To do so, we use the namespaces of

these properties which we have inserted into our lo-

cal triplestore in the named graph same:Properties

using the property same:hasNamespace and perform

the SPARQL clause:

LOAD SILENT <namespace> INTO GRAPH kg:default;

Using namespaces to retrieve RDF documents is

more efﬁcient since it avoids downloading multi-

ple times the same document where multiple prop-

erties of type owl:InverseFunctionalProperty

and owl:FunctionalProperty are deﬁned. These

documents are then loaded into the named graph

kg:default. This ﬁrst step already corrects some as-

sertions about properties (see Table 2).

5.1.3 Curation by Voting

In a second step, we handle properties for

which no schema could be deferenced. We

have placed these properties in the graph named

same:NotDeferencedProperties to distinguish

them from the others. We perform queries on the

available endpoints in the set N.

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

After identifying whether the endpoints have

the schemas of a property, we count the number of

endpoints including the schemas against the property

types so that we can perform a vote on the most

accepted deﬁnition of a property. The condition for

assigning the type (property same:votingType)

owl:InverseFunctionalProperty and/or

owl:FunctionalProperty is that a property

must be deﬁned as such, by at least half of the

SPARQL endpoints that store a schema for the

given property. The statistics to compute the type

of these properties are stored in the named graph

same:PropertiesNotDeferencedStatistics. For

this purpose we declared several properties such as

same:nbOfTimesDefinedAsFunctionalProperty

and same:inNbOfDatasetWithSchema. By

this means, we are able to identify for ex-

ample the property sparql:endpoint

as an

owl:InverseFunctionalProperty and the

property semsci:CHEMINF 000009

as a

owl:FunctionalProperty.

From the set of curated properties, we can infer

owl:sameAs relationships of the same:Target of the

current Qi with the available endpoints of same:N.

5.2 Extraction of owl:sameAs

Properties

Now that we have inspected and corrected the

type of the owl:InverseFunctionalProperty and

owl:FunctionalProperty properties, it is possi-

ble to infer owl:sameAs relationships. To do this,

we extract the owl:sameAs relationships for the

same:Target resources and curate them by follow-

ing an iterative mechanism. The goal being not to

question the endpoints twice about the same resource

same:Target for its owl:sameAs relationships. For

this purpose, these resources are stored in different

named graphs same:Qi at each iteration i. We query

the available endpoints in N with the same:Target

of the current iteration (in same:Qi) and we store the

resulting equivalent URIs in the next named graph

same:Qi+1. A ﬁlter clause ensures that we do not

store an already existing same:Target in the named

graph same:Qi+1. The stop condition of the algo-

rithm is to exhaust all instances of same:Target.

This step refers to the block corresponding to the

“while” loop of algorithm 1.

Working with live endpoints of the World

Wild Web, we pay attention to technical details

http://www.w3.org/ns/sparql-service-description#

endpoint

https://semanticscience.org/resource/CHEMINF

000009

such as the fact that we have to query some

endpoints with only URIs that include ASCII

characters since we have identiﬁed some SPARQL

endpoints that do not support non-ASCII char-

acters.

We actually extract and store URIs

containing non-ASCII characters but we perform

a transformation using the following SPARQL

clause: FILTER(!REGEX(str(?URITarget),

"[ˆ\x00-\x7F]", "i")). This point can be im-

proved later in a future work with a more detailed

description of SPARQL access points. We also

extract the authority component of the URIs

and

store them with the same:hasAuthority property.

This will be used to curate owl:sameAs relationships.

5.2.1 Curation of the owl:sameAs Links

We carry out the curation of owl:sameAs relation-

ships assuming that equivalent resources with URIs

deﬁned within the same authority must explicitly be

asserted as equivalent by that authority. These links

were obtained through a direct owl:sameAs relation-

ship or from a owl:InverseFunctionalProperty

and a owl:FunctionalProperty. Thus, we will en-

sure that the resources linked by owl:sameAs rela-

tionships comply with the rules deﬁned below.

We identiﬁed two different patterns for erroneous

relationships linked through resources that we call

same:Rotten. More patterns may be added to fur-

ther extend the constraints that equivalent resources

must meet.

For complexity reasons we split the curation be-

tween two rules: one operating on the results from the

same authority obtained at distinct iterations (rule #1)

and one operating on the results of the same authority

at the same iteration (rule #2):

• Rule #1 : Let a: be the preﬁx of an au-

thority and b: the preﬁx of another author-

ity, if there is a sameAs(a:1,b:1) relation-

ship and a sameAs(b:1,a:2) relationship and no

sameAs(a:1,a:2) relationship, this rule states

that b:1 is of type same:Rotten. This rule is

applied regardless of the length of the path of

owl:sameAs between a:1 and a:2.

• Rule #2 : Let a: be the preﬁx of an au-

thority and b: the preﬁx of another author-

ity, if there is a sameAs(a:1,b:1) relation-

ship and a sameAs(a:1,b:2) relationship and no

sameAs(b:1,b:2) relationship, this query states

that b:1 and b:2 are both of type same:Rotten.

e.g. http://linkedlifedata.com/sparql

URI = scheme:[//authority]path[?query]

[#fragment] in RFC 3986

When owl:sameAs is the Same: Experimenting Online Resolution of Identity with SPARQL Queries to Linked Open Data Sources

The Figure 2 displays two examples on which

these rules are applied. Once identiﬁed as such

we delete the owl:sameAs relations including a

same:Rotten and the same:Target resulting from

these resources. Before the application of the two

rules above, all resources (a:1, a:2, b:1, b:2) are of

type same:Target. Contrarily to other methods, this

approach does not require the extraction of all the

alleged existing equivalence relations for processing

their quality and it also trims as soon as possible the

exploration of bad quality equivalences and their tran-

sitive closure.

6 EXPERIMENTAL PROTOCOL

AND EVALUATION

6.1 Quantitative Evaluation on Linked

Open Data

Table 2 shows statistics about the properties

of type owl:InverseFunctionalProperty

and owl:FunctionalProperty found online

and on which we applied the process de-

scribed in section 5.1 to collect true instances

of these types. Only 22% of the RDF docu-

ments queried that a priori contained properties

of type owl:InverseFunctionalProperty or

owl:InverseFunctionalProperty could be

loaded. Approximately 17% of the properties claimed

of type owl:FunctionalProperty were veriﬁed as

such by voting or dereferencing, while for the prop-

erties of type owl:InverseFunctionalProperty,

they have been veriﬁed as such in 60% of the

cases. The large majority of properties of type

owl:InverseFunctionalProperty have been

deferenced, while about half of the properties of

type owl:FunctionalProperty were identiﬁed by

voting.

6.1.1 Protocol, Dataset and Baselines

The different experiments were conducted on an HP

EliteBook 840 G2, 2.6 hHz, 16 GB RAM with a vir-

tual environment under Python 3.8.5 and the software

Corese Semantic Web Factory

(Corby and Zucker,

2010) version 4.1.6d deployed locally. Corese is

used as a local triplestore on which we mainly use

SPARQL 1.1 Query and Update features. We eval-

uated our approach on the Barack Obama identity

links knowledge graph developed by (Raad et al.,

https://project.inria.fr/corese/

2020).

Regarding the initialization, we declared for

our algorithm the target URI dbr:Barack Obama

same:Target) in the named graph same:Q0. The

different closures on the Barack Obama entity with

which we compare ourselves (with sameAs.cc (Raad

et al., 2020), LODsyndesis (Mountantonakis and Tz-

itzikas, 2018) and sameas.org (Jaffri et al., 2008)) and

our approach, SameLive, are the following:

• Ground truth: The manual annotation of the

closure of the owl:sameAs extracted from the

LOD-a-lot dataset distributed into 8 identity sets

(Barack Obama, Obama’s Presidency, Obama’s

Presidential Transition, Obama’s Senate Career,

Obama’s Presidential Centre, Obama’s Biogra-

phy, Obama’s Photos, Black President). The

undetermined URIs included in the identity

cluster about Barack Obama are essentially URIs

for which we do not have enough semantics to

conﬁdently annotate them.

• sameAs 0.99: Results after removing the relations

with an error degree greater than 0.99 with the

method used by sameAs.cc where the error degree

is based on the communities resulting from the

Louvain algorithm. This approach leads to two

identities sets B

and B

, and enabled the separa-

tion of URIs referring to the Obama’s presidency

and his presidential transition from the other iden-

tity sets. However, these two sets are still incon-

sistent since they do not allow to perform a closure

on a single real world entity.

• sameAs 0.4: Results after removing the relations

with an error degree greater than 0.4 in the method

of sameAs.cc. This approach leads to 219 identity

sets (C

to C

219

) with only one identity set C

with

non-singleton URIs.

• sameAs.org: Results were obtained through the

API of sameAs.org,

and we corrected URI en-

coding issues to avoid counting them as new ones.

• LODsyndesis: To detect errors, both clustering

with similarity function (content based detection)

and shortest path between a pair of instances

(graph based detection) are used. Results were

obtained through the API of LODsyndesis,

and

we corrected their encoding issues too.

https://github.com/raadjoe/obama-lod-identity-

analysis

PREFIX dbr: http://dbpedia.org/resource/

http://sameas.org/rdf?uri=http://dbpedia.org/resource/

Barack Obama

https://demos.isl.ics.forth.gr/lodsyndesis/rest-

api/objectCoreference?uri=http://dbpedia.org/resource/

Barack Obama

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

Figure 2: Examples of resources identiﬁed as same:Rotten (green dots) in a path of owl:sameAs relations with the rules

mentioned in the sub-section 5.2.1. The red pause icon indicates that there is no owl:sameAs relationship between two

resources.

Table 2: Statistics linked to the properties of type owl:InverseFunctionalProperty and owl:FunctionalProperty. The

incorrect properties are stated as such after loading the RDF document deﬁning them.

Category Number of elements

Properties asserted as owl:InverseFunctionalProperty 3784

Properties asserted as owl:FunctionalProperty 13298

Not deferenced properties among those extracted 14192

Incorrect owl:InverseFunctionalProperty properties 10

Incorrect owl:FunctionalProperty properties 35

Properties participating in the voting process 1220

Properties voted as owl:InverseFunctionalProperty 123

Properties voted as owl:FunctionalProperty 1111

Properties voted as not owl:InverseFunctionalProperty 1097

Properties voted as not owl:FunctionalProperty 109

Final number of owl:InverseFunctionalProperty 2288

Final number of owl:FunctionalProperty 2261

Number of loaded RDF documents 168

Number of missing RDF documents 781

• SameLive: Our closure performed on the data ex-

tracted from the SPARQL endpoints indexed by

the voiD Store, the LODCloud and the Yummy-

Data websites. O

only includes resources of

type same:Target, O

contains both resources of

types same:Target and same:Rotten

6.1.2 Closure Evaluation

Table 3, based on the work of (Raad et al., 2020),

reports the results of our approach compared to

sameAs.cc, LODsyndesis and sameAs.org on the

Barack Obama identity links knowledge graph.

6.1.3 Discussion

After three iterations of the algorithm, we reached

the closure of our solution. The closure on

dbr:Barack Obama by considering only owl:sameAs

relationships takes 40 minutes, and consider-

ing the owl:InverseFunctionalProperty and

owl:FunctionalProperty properties takes 21 hours

and a half with a regular laptop setup described in sec-

tion 6.1.1. The execution time comes mainly from the

endpoint querying. We intend to look at approaches to

improve this with timeouts and further parallel query-

ing. We are currently working on a cluster version of

SameLive for this purpose. In terms of statistics that

represent 1 starting URI same:Target (in Q0), 130

in Q1, 9 in Q2, 1 in Q3 and 53 same:Rotten in Q

−1

Assessing our results on this graph presents a clear

disadvantage for our approach because it involves a

snapshot of the 2015 Linked Open Data (LOD). As

a result, links that have disappeared are not included

in our approach. However these are the best base-

lines as far as we know and our crawling-free on-

When owl:sameAs is the Same: Experimenting Online Resolution of Identity with SPARQL Queries to Linked Open Data Sources

Table 3: Comparison of the owl:sameAs closures on dbr:Barack Obama.

Ground truth sameAs 0.99 sameAs 0.4 sameAs.org LODsyndesis SameLive

Real World Entity A

Barack Obama 260 260 0 120 240 19 105 116

Other Real World Entity 78 10 68 0 22 0 0 0

New URIs outside A

0 0 0 0 413 14 27 67

Undetermined URIs 102 92 10 1 32 4 9 11

Identity Sets 1 2 219 1 1 1 1

Total URIs in Identity Set 440 362 78 121 707 37 141 194

line approach still obtains slightly lower results but

comparable to the best approach, the sameAs 0.4 pro-

posed by (Raad et al., 2020). Moreover, if we fo-

cus on the advantages of SameLive, the online na-

ture of it, we are able to identify a total of 141

equivalent resources of type same:Target and 53

same:Rotten (subtraction of sets O

and O

on the

total of URIs), and 67 new URIs about Barack Obama

compared to the LOD-a-Lot dataset. As an exam-

ple, we identiﬁed as same:Rotten URIs coming from

URIBurner that include resources of dubious quality

such as “The Irishman” on Netﬂix.

11 resources of

type same:Rotten are considered as belonging to the

identity set of Barack Obama (subtraction of sets O

and O

on the entity set about Barack Obama).

The SPARQL query we use to detect potential

errors eliminates redirect links in a dataset if they

are not declared as owl:sameAs in it (or by exten-

sion if there is no owl:InverseFunctionProperty

or owl:FunctionalProperty relationship in the

same dataset). To increase this coverage with

our approach, we would have to include spe-

ciﬁc properties to redirects such as the property

dbo:wikiPageRedirects

with DBpedia).

6.2 Qualitative Evaluation on Speciﬁc

Examples

From the Barack Obama identity links knowledge

graph, one can notice that: some of the re-

sources evaluated as being of the same nature as

dbr:Barack Obama are no longer valid,

URLs redi-

recting to valid resources are considered valid,

e.g., http://linkeddata.uriburner.com/about/id/

entity/https/www.nytimes.com/2019/12/06/business/

media/irishman-scorsese-netﬂix-ratings.html?smid=tw-

nytimes&smtyp=cur#entity 534366

PREFIX dbo: http://dbpedia.org/ontology/

e.g., this is not visible on web.archives.org and is now

an advertising website:

http://www.ontosearch.com/2008/01/identiﬁcation/EID-

3b6e3fb1eb4bef8e669277e73d2e7d56

e.g., http://dbpedia.org/resource/44th president of the

united states of america

while other more questionable resources are also con-

sidered valid.

Depending on the application, it may be unnec-

essary to obtain such a degree of conﬁdence for

obtaining equivalence links as proposed by sameAs

0.99, but in the case where one really wants to

obtain owl:sameAs relations, only sameAs 0.4,

LODsyndesis and our approach provide an answer.

With the graph of identity links about Barack Obama

only entities more or less related to Obama exist.

However this is not the case for all the graphs

obtained by extracting owl:sameAs relationships

(i.e., previous extractions present on the English

DBpedia endpoint linked dbr:Berlin to dbr:Tirana,

dbr:Gasp

e Peninsula, dbr:Jersey City, New Jersey,

dbr:Point Reyes National Seashore,

dbr:Flint, Michigan, and this is still the case on

sameAs.org). Thus, even if we lower our similarity

expectations, the sameAs 0.99 approach does not

necessarily guarantee that on such graphs its results

do not keep wrong relationships.

Our system, unlike sameAs.cc, discriminates redi-

rects to a resource if it does not have a direct

owl:sameAs relationship with that resource. This

point may have its advantage in error detection. How-

ever the public benchmark on which we are evaluating

discriminates against this position.

Another important point is that two resources be-

longing to the same identity cluster may result in the

generation of different equivalence links depending

on the starting resources. The reason for this is that

we do not process the whole graph of equivalences,

the greedy algorithm starts from the set S of seeds and

stops as soon as possible to avoid propagating errors

or querying the endpoints more than necessary.

6.3 Application Evaluation on Speciﬁc

Use Case

The source code of the project and its instructions are

available at https://github.com/Wimmics/SameLive

and the algorithm vocabulary at https:

e.g., http://dbpedia.org/resource/B-Rock %

22The Islamic Shock%22 Hussein Superallah Obama

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

//ns.inria.fr/same/same.owl. Two modes of

the algorithm are available to consider or not

the owl:InverseFunctionalProperty and

owl:FunctionalProperty. We also performed

an evaluation on a real application: the integration of

the knowledge graphs obtained from different named

entity recognition and linking methods applied to the

same corpus. In this type of usecase, to consolidate

the knowledge graph and support further processing

and visualization, it is important to detect that two

URIs extracted by two different annotators are in

fact identifying the same resource. We tested our

approach on the public dataset CovidOnTheWeb

(Michel et al., 2020), where the authors extract

named entities but do not propose equivalence links

between the URIs produced by different annotators.

As an example we report here on the result obtained

on the article “COVID-19: what has been learned and

to be learned about the novel coronavirus disease”

(Yi et al., 2020) and more precisely on the 312 dis-

tinct entities from DBpedia and Wikidata extracted

from it with semantic annotators.

By exploring only the owl:sameAs relationships

on these 312 starting URIs (in Q0), the algorithm

ends after 6 iterations with: 1526 same:Target in

Q1, 3762 in Q2, 1797 in Q3, 78 in Q4, 71 in

Q5 and 2 in Q6. We computed a total of 6001

same:Rotten in Q

−1

, correcting errors such as, for

instance, the confusion between URIs coming from

the French chapter of DBpedia identifying the kid-

ney vs. a speciﬁc portion of it called the “distal con-

voluted tubule”.

Among the 312 seeds URIs, 32

are determined as equivalent. Some of these 32 re-

sources considered as equivalent are not available in

sameAs.cc, sameAs.org and LODsyndesis. This is

the case for example for: the respiratory syndrom

caused by SARS coronavirus 2 in DBpedia and Wiki-

data,

and the strain of the COVID-19 itself in these

two sources.

An interesting fact is that so far these

resources speaking about the COVID-19 do not have

any owl:sameAs relationship in the English chapter

of DBpedia.

http://ns.inria.fr/covid19/

0eadf5a901c0d89fad2c202990056556be103e12

e.g., http://fr.dbpedia.org/resource/Tubule distal and

http://fr.dbpedia.org/resource/Rein

e.g., http://dbpedia.org/resource/Coronavirus disease

2019 and http://www.wikidata.org/entity/Q84263196

e.g., http://dbpedia.org/resource/Severe

acute respiratory syndrome coronavirus 2 and

http://www.wikidata.org/entity/Q82069695

7 CONCLUSION

We proposed a method to identify owl:sameAs re-

lationships relying on the online SPARQL query-

ing of distributed datasets and using heuristic rules

to correct results. We also exploit and inspect the

quality of owl:InverseFunctionalProperty and

owl:FunctionalProperty relationships, using the

deﬁnitions given by endpoints and a voting approach.

We show that a heuristic approach can retrieve high

quality equivalence links without requiring the ex-

traction of all the alleged existing equivalence re-

lations. In addition, it is possible to use other al-

gorithms (community detection, similarity function,

graph-based metric, etc.) in addition to/instead of the

curation rules we have implemented.

Because our algorithm works online this also ex-

poses it to return different results for the same input,

for example, if a SPARQL endpoint does not answer.

Inversely, our method has the advantage of giving the

user the possibility to include or exclude endpoints

(public or private) on the ﬂy, and thus to include lo-

cally stored datasets as long as a SPARQL endpoint is

deployed.

It may be interesting to include other dataset cata-

logs, such as the ones using the CKAN API.

to this, integrating SPARQL Micro-Services (Michel

et al., 2018) or other mapping approaches on top of

these catalogs would allow us to query them directly

with SPARQL instead of manipulating beforehand

their data with another language. The use of such

techniques could also help us extend our method to

include non-RDF datasets.

Extracting owl:InverseFunctionalProperty,

owl:FunctionalProperty and owl:sameAs

relationships is not enough for some datasets.

For instance Wikidata formalizes equivalence

relationships with identiﬁers (typed as exter-

nal identiﬁer wikibase:ExternalId using

wikibase:propertyType

) leveraging URI

patterns. Also, although they seem to be rarely used,

we intend to study the inclusion of owl:hasKey to

have a modular and extensive set in the analysis of

equivalence links. In addition we plan to consider

redirects -including different scheme component of

the URIs such as HTTP/S. All these extensions could

be addressed by adding additional rules, some of

which could be the result of mining and learning

approaches. Deferencing resources deemed as

same:Rotten and comparing them to same:Target

with a similarity metric is also a direction to obtain

more results. Exploiting the underlying semantics

https://docs.ckan.org/

PREFIX wikibase: http://wikiba.se/ontology#

When owl:sameAs is the Same: Experimenting Online Resolution of Identity with SPARQL Queries to Linked Open Data Sources

(e.g., owl:differentFrom, owl:AllDifferent...)

of the resources is also worth exploring.

Finally, we intend to follow different leads to

improve the performance of our approach in terms

of speed, from query optimization (in particu-

lar for the owl:InverseFunctionalProperty and

owl:FunctionalProperty properties) and further

parallel querying. We will also study the software

and hardware architecture needed to provide a web

service with a caching system. We plan also to fur-

ther exploit the monitoring capabilities of SPARQL,

by using for example the PROV-O ontology

to bet-

ter track the provenance of results, this could be used

in particular for owl:sameAs relationships stored in

speciﬁc named graphs (see Section 3, item 4 in the

second list). We also want to exploit timestamps to,

among other things, timely re-run queries executed a

long time ago or to query an endpoint that was previ-

ously unavailable.

REFERENCES

Allemang, D., Hendler, J., and Gandon, F. (2020). Semantic

Web for the Working Ontologist. Number 3. ACM.

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefeb-

vre, E. (2008). Fast unfolding of communities in large

networks. Journal of statistical mechanics: theory

and experiment, 2008(10):P10008.

Corby, O. and Zucker, C. F. (2010). The kgram abstract

machine for knowledge graph querying. In Web In-

telligence and Intelligent Agent Technology (WI-IAT),

volume 1, pages 338–341. IEEE.

Fern

andez, J. D., Beek, W., Mart

ınez-Prieto, M. A., and

Arias, M. (2017). Lod-a-lot. In International semantic

web conference, pages 75–83. Springer.

Gandon, F. (2018). A Survey of the First 20 Years of Re-

search on Semantic Web and Linked Data. Revue des

Sciences et Technologies de l’Information - S

erie ISI :

Ing

enierie des Syst

emes d’Information.

Halpin, H., Hayes, P. J., McCusker, J. P., McGuinness,

D. L., and Thompson, H. S. (2010). When owl:

sameas isn’t the same: An analysis of identity in

linked data. In International semantic web conference,

pages 305–320. Springer.

Hartig, O. and Thompson, B. (2014). Foundations of an al-

ternative approach to reiﬁcation in rdf. arXiv preprint

arXiv:1406.3399.

Jaffri, A., Glaser, H., and Millard, I. (2008). Managing URI

synonymity to enable consistent reference on the se-

mantic web. In Proceedings of the 1st IRSW2008 In-

ternational Workshop on Identity and Reference on the

Semantic Web.

Michel, F., Faron-Zucker, C., and Gandon, F. (2018).

SPARQL micro-services: Lightweight integration of

https://www.w3.org/TR/prov-o/

web apis and linked data. In Workshop on Linked Data

on the Web co-located with The Web Conference 2018,

LDOW@WWW 2018.

Michel, F., Gandon, F., Ah-Kane, V., Bobasheva, A.,

Cabrio, E., Corby, O., Gazzotti, R., Giboin, A.,

Marro, S., Mayer, T., et al. (2020). Covid-on-the-web:

Knowledge graph and services to advance covid-19

research. In International Semantic Web Conference,

pages 294–310. Springer.

Mountantonakis, M. and Tzitzikas, Y. (2018). Scalable

methods for measuring the connectivity and quality of

large numbers of linked datasets. Journal of Data and

Information Quality (JDIQ), 9(3):1–49.

Raad, J., Beek, W., van Harmelen, F., Wielemaker, J., Per-

nelle, N., and Sa

ıs, F. (2020). Constructing and clean-

ing identity graphs in the lod cloud. Data Intelligence,

2(3):323–352.

Schmachtenberg, M., Bizer, C., and Paulheim, H. (2014).

Adoption of the linked data best practices in different

topical domains. In International Semantic Web Con-

ference, pages 245–260. Springer.

Yamamoto, Y., Yamaguchi, A., and Splendiani, A. (2018).

YummyData: providing high-quality open life science

data. Database, 2018.

Yi, Y., Lagniton, P. N., Ye, S., Li, E., and Xu, R.-H. (2020).

Covid-19: what has been learned and to be learned

about the novel coronavirus disease. International

journal of biological sciences, 16(10):1753.

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies