A Linked Data Approach for Querying Heterogeneous Sources

Assisting Researchers in Finding Answers to Complex Clinical Questions

Nikolaos Matskanis

, Vassiliki Andronikou

, Philippe Massonet

Kostas Mourtzoukos

and Joseph Roumier

Centre d’ Excellence en Technologies de l’ Information et de la Communication (CETIC),

Rue des Frères Wright 29/3, Chareroi, B-6041, Belgium

Telecommunications Laboratory Department of Electrical & Computer Engineering,

National Technical University of Athens, 9 Heroon Polytechniou, 15773, Athens, Greece

Keywords: Semantic Search, Heterogeneous Data Sources Querying, Semantic Aggregation of Data.

Abstract: Clinical trials for drug repositioning aim at evaluating the effectiveness and safety of existing drugs as new

treatments. This involves managing and semantically correlating many interdependent parameters and

details in order to clearly identify the research question of the clinical trial. This work, which is carried out

within the PONTE (Efficient Patient Recruitment for Innovative Clinical Trials of Existing Drugs) project,

aims to improve the trial design process, by not only offering access to a variety of relevant data sources –

including, but not limited to, drug profiles, diseases and their mechanisms, genes and past trial results – but

also providing the ability to navigate through these sources, perform queries on them and intelligently fuse

the available information through semantic reasoning. This article describes our intention to consume and

aggregate information from Linked Data sources in order to produce answers for the clinical researcher’s

questions.

1 INTRODUCTION

Clinical research, including investigation of the test

of hypothesis, clinical trial design and study

conduction, comprises an expensive and very

important part of the R&D activities of the

pharmaceutical companies. Trials for drug

repositioning, which aim at testing established

compounds to new medical conditions, appear to

gain ground as their compounds have already been

found safe in clinical trials and sometimes are

already present in the market. Although the

uncertainty of safety is lower in such trials, still

problematic situations – for example an adverse

effect of a drug compound - present important risks

(Ashburn, 2004). In many cases evidence of such

issues already exists in published scientific literature

and other clinical information sources but is often

difficult to discover and/or correlate, given the

volume, variety and distributed nature of the

information required.

Scientific results from medical research,

including those from clinical trials, are gradually

being published as Linked Open Data (LOD) (Bizer,

2010). Although not many of the available clinical

data sources are available as Linked Data, there is a

good and growing representation that can be used

for the purposes of the PONTE project

(Chondrogiannis, 2011). This rapidly growing

collection of well-structured (using RDF and OWL)

and easily accessible (using HTTP) data (Berners-

Lee, 2006) contains clinical information about drugs

and their side effects, clinical trials, diseases and

their mechanisms (pathophysiology) and other

aspects of the domain. These data sources comprise

a valuable source for clinical research but more

importantly, because of their structure and links,

they can be used by search tools to produce search

results that are otherwise hard or impossible to come

across.

The idea of deploying semantic web technologies

for querying and creating links between data sources

for enabling automatic knowledge aggregation is an

area of scientific research that several projects are

exploring and developing technologies and tools for.

The Bio2RDF project has created a framework that

provides on demand data for mash-ups in the

bioinformatics domain (Belleau, 2008). There are

411

Matskanis N., Andronikou V., Massonet P., Mourtzoukos K. and Roumier J..

A Linked Data Approach for Querying Heterogeneous Sources - Assisting Researchers in Finding Answers to Complex Clinical Questions.

DOI: 10.5220/0004173004110414

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2012), pages 411-414

ISBN: 978-989-8565-30-3

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

projects that have investigated several techniques on

querying the Linked Data cloud (Bouquet, 2009),

(Hartig, 2012). The FedBench project (Schmidt,

2011) has produced a benchmark framework for

analysing the efficiency and performance of

different strategies for federated query processing on

semantic data. In this paper we will explain the

benefits of our semantic search approach that aims at

assisting the design of clinical trials on existing

drugs by discovering semantic associations within

the Linked Data sources.

2 INCORPORATION AND

AGGREGATION OF LINKED

DATA SOURCES

During clinical research different parameters about

drugs, diseases, trial targets (genes, proteins) are

assessed and carefully examined. This information is

dispersed in different data sources created

autonomously by often-unassociated institutional

bodies, then – some of it - republished as Linked

Data (Samwald, 2011). In order to effectively use

the knowledge existing within these sources we need

to combine the information available. For instance,

the information about the possible disease targets of

a given drug can be found in the DrugBank Linked

Data source. However, the information about the

genes associated with a specific disorder is found in

Diseasome Linked Data source.

Thanks to the

SPARQL endpoints we are able to collect the

information from all these sources. This would

otherwise be a quite difficult task with only the

upstream data sources, since we would have to

overcome a variety of problems, including the

database technology, the various structure logics and

the way information is represented.

Additionally we have chosen to use the

ontologies of some of the above data sources for

describing the query term, defining its domain (by

using a taxonomy graph) and the data aggregation

options for presenting the information. The domain

taxonomy graph is composed using Construct

queries on the ontology of the term, while the related

concepts graph is constructed by using the RDF

links and relations between the query term and

concepts or instances found in the other domain

ontologies. For reasons of availability and

performance of the system the ontologies are hosted

in an RDF repository, which is part of the Ontology

Management component (see figure 1) and are made

available to PONTE components through a

SPARQL endpoint inside a common and closed

network.

2.1 The Linked Data Application and

Query Engine

The Linked Data Application (LDApp) is part of a

software framework created in the context of

PONTE project (Chondrogiannis, 2011). Its primary

function is to assist the investigation on a drug, its

target, and study disorder. The LDApp is a web

application that dynamically generates a data

aggregation of information retrieved from the

PONTE platform – more specifically from the

Ontology Management component - and the Linked

Open Data Cloud. The LDApp uses the SPARQL

query language for RDF to retrieve RDF Graphs

from the Ontology Management component and to

query the Linked Data sources. In terms of the

LDApp interfacing with the other components of the

platform, the queries that are generated from these

components contain information that associates the

search term with the domain of the question. Using

this information, the main term or concept of the

query is linked with a concept from the respective

Ontology. The interactions between the LDApp, the

PONTE components and the Linked Data sources

are presented in the sequence diagram in Figure 1.

Figure 1: Component Interactions for Question

Answering.

The LDApp requests from the RDF repository of

the Ontology Management component the related

concepts of the domain ontologies. It then constructs

a graph that is composed of concepts and instances

representing the taxonomy and relevant information

of the query’s main term. The graph concepts and

instances additionally contain relations and RDF

links to the web of Linked Data sources, which can

be used to execute queries across multiple data

sources. The graph is visualized by the LDApp and

can be used to assist the answering of research

KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

412

questions by providing options to expand or filter

the query results.

The LDApp can perform additional queries to

retrieve data for constructing a Linked Data mash-

up. These queries to Linked Data sources are

performed via the PONTE Query Engine. The Query

Engine is a SPARQL endpoint that is able to execute

queries across multiple data sources and for the

purposes of this prototype implementation, its

developments are based on the SQUIN library that

supports link traversal based query execution

(Hartig, 2009).

Figure 2: The Linked Data Application (LDApp) interface.

Figure 2 presents the LDApp user interface. The

left frame contains the domain graph that is

compiled from ontologies covering three domains:

disease, drug and drug target. The right frame

mainly contains the query results and mechanisms to

assist the user’s search and navigation through them.

In the example depicted in this figure, the term that a

clinician is interested in is the “Myocardial

Infarction” and the domain is a graph based on the

Disease concept of the Diseasome ontology. The

results of DESCRIBE queries are presented to the

users of LDApp at their original form (as pictured in

figure 2); SELECT queries are presented using an

RSS style (Figure 3) with the title, link and

description of each data item.

2.2 Answering Hypothesis Questions

In order to demonstrate the benefits of using the

LDApp, we will use an example that is based on one

of the PONTE use cases. We assume that the

clinician’s question is about clinical trials on the

disease “Myocardial Infarction”. Initially results for

this disease are retrieved from the resource

“Myocardial Infarction” of LD source Diseasome

(see Figure 2).

The Diseasome resource has a list of

diseasome:possibleDrug properties pointing to drug

resources in DrugBank and DailyMed LD sources.

Other links include a list of genes responsible for the

syndrome and owl:sameAs pointers to other data

sources. By selecting one of these properties (RDF

links) the clinician is effectively constructing queries

to Linked Data sources. For example s/he is

interested in possible drugs against the disease

(diseasome:possibleDrugs), so s/he selects

“possibleDrugs” from the expansion list (of figure

2), which triggers a dynamically created query,

which in this example is the following:

SELECT DISTINCT ?drug ?description

WHERE{

?x a diseasome:diseases.

?x diseasome:name ?name.

?x diseasome:possibleDrug ?drug.

OPTIONAL {?drug

drugbank:description ?description.}

FILTER regex (?name,"Myocardial",i)}

Figure 3 shows how the results of this query are

presented in the LDApp. For example, the drug

Etanercept is among the results of the above query

along with data linked to it such as its description,

indications, other diseases for which it can be used

as treatment, clinical trials, side effects and much

more. This additional information on each of the

resulting items is presented to the clinician on the

right frame of LDApp when s/he clicks on its title of

the item.

Additionally, the left frame presents the

“Etarnecept” instance in the DrugBank ontology

again with some expansion options. The clinician

can continue her/his search and choose an

owl:sameAs expansion; results from several Linked

Data sources are returned that can be of great

interest to the specification of study parameters:

such as from SIDER, a source for side effects and

from ClinicalTrial.gov (LinkedCT Linked Data

source) with descriptions on past clinical trials of

this drug. Hence, this process can facilitate the

clinician’s work in finding/extracting the

information that s/he needs.

3 CONCLUSIONS AND FUTURE

WORK

In this article, we presented a prototype application

that consumes Linked Data for assisting the design

process of drug repositioning clinical trials. We have

explained our approach for aggregating information

from heterogeneous data sources by exploiting the

links between them, which allows navigation and

ALinkedDataApproachforQueryingHeterogeneousSources-AssistingResearchersinFindingAnswerstoComplex

ClinicalQuestions

413

Figure 3: Querying Diseasome and exploiting the possibleDrug link.

performing queries across different data sources, and

by assisting users to navigate from one topic (data

source) to another using the ontology based

navigation frame. In order to demonstrate the

capability of answering research questions for the

test of hypothesis and study parameters specification

we have developed a scenario in which the Linked

Data Application (LDApp) provides initial results

and then extends clinical researcher’s search

requests using the Web of Linked Data.

We have plans to extend the use of Linked Data

Application and Query Engine to other processes of

the PONTE platform such as the validation of the

clinical trial protocol and the proposing of eligibility

criteria. We are also looking at further improving the

query engine and especially developing a “query

expansion” mechanism. This mechanism aims at

reformulating a seed query to improve retrieval

performance in information retrieval operations

(Hartig, 2012) and, in the context of semantic data, it

will involve performing additional searches in

semantically associated data sources.

REFERENCES

Ashburn, T., Thor, K., 2004. Drug repositioning:

identifying and developing new uses for existing

drugs. Nature reviews. Drug discovery.

Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P.,

Morissette, J., 2008. Bio2RDF: towards a mashup to

build bioinformatics knowledge systems, Journal of

biomedical informatics, S.706-716,

Berners-Lee, T. 2006. Linked Data - Design Issues. http://

www.w3.org/DesignIssues/LinkedData.html

Bizer C., T., Heath, T., Berners-Lee, T., 2010. Linked

data-the story so far, sbc, S.9

Bouquet, P., Ghidini, C., Serafini, L., 2009. Querying the

Web of Data: A formal approach. In Proc of the 4th

Asian Semantic Web Conference (ASWC).

Chondrogiannis, E., Matskanis, N., Roumier, J., Massonet,

P., Andronikou, V., 2011, Enabling semantic

interlinking of medical data sources and EHRs for

clinical research purposes, eChallenges conference

Hartig, O., Bizer, C, Freytag, J. C., 2009. Executing

SPARQL Queries over the Web of Linked Data, In

Proceedings of the 8th International Semantic Web

Conference (ISWC), Washington, DC, USA.

Hartig, O., Freytag, J. C., 2012: Foundations of Traversal

Based Query Execution over Linked Data. In

Proceedings of the 23rd ACM Conference on

Hypertext and Social Media (HT), Semantic Data

Track, Milwaukee, WI, USA

PONTE Project, Efficient Patient Recruitment for

Innovative Clinical Trials of Existing Drugs to other

Indications, http://www.ponte-project.eu/

Samwald, M., Jentzsch, A., Bouton, C., Kallesøe, C. S.,

Willighagen, E., Hajagos, J., Marshall, M. S.,

Prud'hommeaux, E., Hassanzadeh, O., Pichler, E.,

Stephens, S., 2011. Linked open drug data for

pharmaceutical research and development, Journal of

Cheminformatics, 3:19 doi:10.1186/1758-2946-3-19

Schmidt, M., et al., 2011. FedBench: A Benchmark Suite

for Federated Semantic Data Query Processing, ISWC

KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

414