Bringing Scientiﬁc Blogs to Digital Libraries

Fidan Limani

1,2

, Atif Latif

and Klaus Tochtermann

ZBW - Leibniz Information Center for Economics, Kiel, Germany

South East European University, Faculty of Contemporary Sciences and Technologies, Tetovo, Republic of Macedonia

Keywords:

Digital Libraries, Scientiﬁc Blogs, Semantic Web.

Abstract:

Research publication via scientiﬁc blogging is gaining momentum, with an ever-increasing number of re-

searchers accepting it as their main or complementary research dissemination channel. This development has

prompted both scientiﬁc bloggers and Digital Libraries (DL) to explore the potential of streamlining these

resources along DL collections for increased and complementary user selection. In this paper we explore a

methodology for achieving the integration of DL and a blog post collections, together with some use case

scenarios that demonstrate the values and capabilities of this integration.

1 INTRODUCTION

Web 2.0, or the ”read-write” Web, has enabled richer

levels/channels of engagements/expression with the

audiences. This, in turn, has stirred publication initia-

tives that ﬁnd them practical in terms of networking,

collaboration, ease of publication, (unofﬁcial, peer)

reviewing from the community, etc. In the case of

scientiﬁc blogging, for example, authors can easily

share their research work with the community, even

as the research develops, whereas readers are able to

provide continuous feedback to it. (Burgelman et al.,

2010) report on ”Science 2.0” development as en-

abled by tools and changing research behavior prac-

tices, featuring increased number of authors, publica-

tions, and data available to consume, reuse, and com-

ment by the community. In another study, (Mahrt and

Puschmann, 2014) ﬁnd ”a dual role of blogs as chan-

nels of internal scholarly communication as well as

public debate” among the motivations for scientiﬁc

blogging. Furthermore, the same study ﬁnds that sci-

ence bloggers especially value the community feed-

back on their posts – an additional explanation for the

development and acceptance of ”Science 2.0” from

the research community.

Traditional publication repositories have already

moved on to embrace the beneﬁts of Semantic Web

technologies. Projects that structure and represent

Digital Library (DL) repositories as machine-readable

and link them up and make them available to the

Linked Open Data (LOD) cloud are pretty common.

According to the ”State of the LOD Cloud 2014 re-

port”

publications take the second largest data set

on the LOD. The Library of Congress Linked Data

Service

offering standards and vocabularies used by

the library; the British Library’s LOD initiative

; the

Swedish National Library Open Data project

includ-

ing bibliography and authority data; German National

Library of Economics – EconStor LOD project

; are

just some of the LOD projects from the domain of DL

repositories.

As scientiﬁc blogging is getting more contribu-

tions and prominence in the research community, we

see major beneﬁts from putting its publication con-

tributions to use in different scenarios and environ-

ments. In this work we focus on (i) Integrating them

with the more traditional DL publication archives, and

(ii) ”Porting” them on the Web of Data for supporting

current and future applications scenarios.

2 MOTIVATION AND USE CASES

2.1 Motivation

After many requests from the scientiﬁc blogging com-

munity, the German National Library of Economics

http://linkeddatacatalog.dws.informatik.uni-

mannheim.de/state/#toc1

http://id.loc.gov

http://bnb.data.bl.uk

http://libris.kb.se

http://linkeddata.econstor.eu/beta

284

Limani, F., Latif, A. and Tochtermann, K.

Bringing Scientiﬁc Blogs to Digital Libraries.

DOI: 10.5220/0006295702840290

In Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), pages 284-290

ISBN: 978-989-758-246-2

(ZBW) is considering the opportunity of extending

its repository by including scientiﬁc blog posts from

the domain of economics and offer it to its users

alongside the standard research publications. Science

bloggers seek to make the most of DLs’ dissemina-

tion channels and reach higher audience and visibil-

ity, whereas DLs seek to complement their collections

and increase their value offer to its users. In the face

of increasing scientiﬁc blog contributions and adop-

tion in the scientiﬁc workﬂow, this is the single most

important motivation for this study.

2.2 Use Cases

Following are the main use cases that motivated our

research:

(i) Heterogeneous data integration: Blog collec-

tions do not adhere to a standardized metadata struc-

ture and often rely on different vocabularies from the

ones adopted by DLs. In a situation like this, a user

interested in resources in both DL and scientiﬁc blog

resources would have to query these collections sep-

arately, using the different vocabulary terms. Thus,

there is an opportunity to alleviate this situation and

combine these collections in a uniform ”query space”.

EconStor, our DL of choice, metadata are already

structured and represented in RDF. As a framework,

this representation is well equipped to handle combi-

nation of heterogeneous resources.

(ii) Semantic annotation of blog posts: More and

more resources are being published as LOD, beneﬁt-

ing both publishers and consumers of those resources.

Making their resources machine-understandable en-

ables publishers increase the audience reach, includ-

ing additional (re)use from software agents. On the

other hand, LOD publishing enables consumers to

(re)use these resources in new scenarios not covered

originally by publishers. A ﬁnal and important note

at this point: in case scientiﬁc blog post publishers

are interested in making their content available in this

way, they should not face or be concerned with any

technological barriers in the process.

(iii) Dataset proﬁling: Linking scientiﬁc blog re-

sources with relevant entries in external collections

and knowledge bases (KB) indexed using different

controlled vocabularies (CV) – thesauri and classiﬁ-

cation schemes, in our case – is another value-adding

step for the end user. In this way, depending on the

external resource(s) it links to, we provide different

”proﬁles” for every blog post in our collection, en-

abling a more elaborate and rich (search) experience

to the user. The user potentially beneﬁts from related

resources coming from different disciplines – eco-

nomics, social sciences, or agriculture, in our case;

retrieve additional information (and context) from a

KB such as DBpedia; include relevant resources from

a German-speciﬁc or international, multilingual col-

lection; etc.

(iv) Dataset analysis: Summarizing datasets by

offering useful statistics as exploration tips for users

is quite important. This especially holds for large

datasets that could prove challenging for users to

comprehend (e.g., identifying resources of certain

features, closer to their area of interest). Highly com-

mented/discussed blog posts (expressed via user com-

ments, shares, etc.), the most featured/covered sub-

jects in the collection, ”trending” subjects for a given

period of time (based on the number of blog posts

for a given subject), top contributing authors (or ”ex-

pert groups”) per subject/topic (based on the number

of blog posts that an author has for a given subject),

or, in the context of aligned CVs of different (linked)

datasets, relevant publications by authors in external

KBs, are just some of the available analysis options.

3 RELATED WORK

A lot of diversiﬁed but concrete research work has

been done on mapping relational data (RDB) to graph

representation, as well as publishing and integrating

heterogeneous collections, including social web data.

(Auer et al., 2010) present common motivations for

representing RDB in Resource Description Frame-

work (RDF) data model; the use cases for integrating

RDB with structured sources or existing RDF on the

web (Linked Data) correspond to a great extent with

our motivation for this work. Motivated by semantic

annotation of dynamic web pages and mass genera-

tion of LOD data, (Spanos and Mitrou, 2012) survey

the proposed approaches for mapping and integrating

RDB content to/with that published as LOD, whereas

(Powell et al., 2010) demonstrate fusing library and

non-library data from disparate resources, relying on

RDF as a common data model and using graph-based

analysis and visualization to generate useful informa-

tion on the resulting data.

Some contributions focus on reusing and en-

riching social (user generated) content from exter-

nal resources, such as LOD Cloud, for example:

(Holgersen et al., 2012) in their research consider

bibliographic-related data (in the form of comments

and ratings for books); (Hu et al., 2013) focus on

social content from publication submission and re-

view process of a journal, in the form of review-

ers’ comments, editors’ decisions, author replies, etc.;

whereas (Passant et al., 2010) reuse collaboratively-

built knowledge in the enterprise, contained in differ-

Bringing Scientiﬁc Blogs to Digital Libraries

285

ent fragmented information resources and represented

across heterogeneous data formats in an enterprise

setting.

(Yoose and Perkins, 2013) survey the LOD adop-

tion in libraries and report important and increas-

ing number of LOD projects that ”free” library re-

sources from speciﬁc library representation formats

and enrich them with relevant resources from the

LOD Cloud datasets. In a concrete implementation

example, (Latif et al., 2014) go through both concep-

tual and practical aspects of publishing an Open Ac-

cess (OA) repository as Linked Data.

In this work we integrate library and non-library

resources – a DL and a blog post collection. A blog

post is not pre-related or referring to any DL publica-

tion; bloggers write on any topic they deem important,

regardless of the DL repository publications.

4 METHODOLOGY

This section details the applied methodology, includ-

ing the dataset selection, (pre)processing, modeling

and conversion, as well as enrichment from other re-

sources. You can ﬁnd more details pertaining to sec-

tion 4.1 and 4.2 in our previous work (Limani et al.,

2016).

4.1 Dataset Selection

In order to support our use cases, the dataset selection

contains a DL repository and a blog post collection.

A ﬁnal component, a thesaurus, is also presented here

for contextual information.

(i) DL repository: EconStor

is an open access

publication platform for the domain of economics and

related ﬁelds. It supports the publication dissemina-

tion for many institutions, as well as provides its col-

lection metadata to other academic repositories. At

the time of writing, it holds more than 123 K confer-

ence papers, proceedings, research reports and work-

ing papers.

(ii) Scientiﬁc blog collection: The Wall Street

Journal

dataset holds over 40 K blog posts from the

domain of economics. In our previous work we anno-

tated this collection with STW thesaurus terms (see

section 4.2 for details) to make it terminologically up

to par with the EconStor collection.

(iii) The standard thesaurus for economics

(STW)

with about 6.000 descriptors (both in German and

http://econstor.eu

blogs.wsj.com

http://zbw.eu/stw

English) is a rich vocabulary, primarily covering the

domain of economics. STW is used for annota-

tion/indexing and retrieval operations for EconStor

publications. Moreover, it is aligned with other CVs

(thesauri, classiﬁcation systems) and has relations to

a KB (DBpedia).

4.2 Dataset (Pre)Processing

In order to bridge the ”terminology” gap between the

heterogeneous datasets and for identifying the impor-

tant metadata elements of blog post to our experiment,

we conduct:

(i) Automatic indexation: Having the DL dataset

collection annotated with STW thesaurus terms de-

termined a similar take on the blog post collection an-

notation. As described in our previous work, (Limani

et al., 2016), we used MAUI

for the automatic anno-

tation of every blog post from the WSJ collection with

the STW as a vocabulary of choice. This is an impor-

tant decision as it enables several beneﬁts: bridges

the terminology gap between DL and blog post col-

lections; generates semantic annotations for the blog

post collection without involving the author/blogger

at all; does not require complex integration rules or

application of ontology mapping techniques to termi-

nologically ”cross” from one collection to the other.

(ii) Blog post metadata of interest: Blogs tend

to differ in the vocabularies they select to anno-

tate their collection, although they usually have com-

mon (to not say standard) metadata elements in their

blog posts. In our case, besides the usual blog post

metadata, such as author, title, content, and publica-

tion date, we also retained comments for each blog

post. Some social web features such as ”shares” and

”tweets”, although present in the data set, were left

out of the current research analysis for future studies.

4.3 Dataset Modeling and Conversion

This part covers three activities that complete the

dataset modeling and conversion process, including:

(i) selected vocabularies for the blog post collection,

(ii) technical details of mapping the blog post col-

lection from a relational database to RDF; and (iii)

EconStor and blog post collections merge for a uni-

ﬁed query space, available for querying and explo-

ration via an HTTP server for RDF data.

(i) Blog post vocabulary selection: Despite the

Web 2.0 nature, blog posts reﬂect the common meta-

data that a DL publication has, such as post title,

publication date, content, terms describing the post;

whereas also having some additional aspects that

https://github.com/zelandiya/maui

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

286

Figure 1: Classes (colored in blue) and properties modeling a blog post instance.

are inherent to them, such as user-generated feed-

back (blog post comments and other Web 2.0 ”fea-

tures” like shares, tweets, etc.). The key vocab-

ularies selected for modeling blog post collections

are SIOC

(and SIOC Types module) and the Dublin

Core Metadata Initiative

; the former covering es-

pecially well the user-generated content, whereas the

latter the typical publications metadata. Figure 1

shows an example representation of a blog post in-

stance (see below) with a single comment and subject

keyword (”Marketing”, in this case). One of the key

parts of the modeling refers to blog posts and related

comments, with SIOC Types BlogPost and Com-

ment classes used for the blog post and their com-

ments, respectively. The default D2RQ mapping for

the entity that covers the keywords used to annotate

every blog post completes the instance modeling.

(ii) Mapping relational data to RDF: The DL col-

lection is represented as RDF triples, whereas the sci-

entiﬁc blog post collection is stored in a relational

database. To bring both collections into the same

model, we used the D2RQ platform

to generate

RDF data representation of the blog post collection.

This required mapping the relational database tables

and columns to RDF classes and properties based on

the vocabularies identiﬁed beforehand. With it, both

datasets are combined via their common data model

(RDF).

(iii) Uniﬁed query space over datasets: There is

http://rdfs.org/sioc/spec/

http://dublincore.org/speciﬁcations/

http://d2rq.org

an important thing to mention at this point: since both

collections use the STW thesaurus to annotate/index

their resources, this eliminates need for any ontology

alignment between the two datasets for the use case

scenarios. Furthermore, this implies a closer connec-

tion between the datasets and renders them more in-

tegrated (at a vocabulary level) than just two datasets

sharing the same representation (RDF, in this case).

This condition enables us to address the combined

datasets as being part of a single ”information space”;

and we can solely rely on STW terms to search re-

sources in both datasets. Both collections are loaded

as a single dataset in a SPARQL endpoint, Apache

Jena Fuseki

, each being part of a named graph

within the single dataset (econstor and wsj, re-

spectively).

5 USE CASE SCENARIO

DEMONSTRATION

In this section we represent several use case scenar-

ios implemented from integrating the DL repository

and the scientiﬁc blog post collection that directly

address the motivation for this paper. Although DL

users are not expected to know SPARQL in order to

search the collection, by exploring some query sce-

narios, we want to demonstrate that this collection can

serve as a data store on top of which we can build

https://jena.apache.org/documentation/fuseki2/

index.html

Bringing Scientiﬁc Blogs to Digital Libraries

287

Figure 2: Results illustration in a SPARQL server.

a standard user interface for search that users under-

stand (keyword-based search, in the same way they

use a search engine or search a document in their com-

puter).

(i) Search across the ”uniﬁed query space:” The

user searches for publications related to the subject of

”technology transfer” in EconStor and WSJ datasets.

As mentioned earlier, there are several types of publi-

cations archived in EconStor, but, in this case, the user

is interested in research papers (i.e., swc:Paper),

published since 2014. The search returns four results

in total, with three results coming from the EconStor

dataset and one coming from the blog post collec-

tion. This just demonstrates the possibility for the

user to search across two different datasets described

with the same thesaurus term(s), and receive publica-

tion results from corresponding datasets (treating the

datasets as if they are one source of information).

(ii) Retrieve relevant blog posts from the collec-

tion: This scenario is related to the previous one: the

user initially searches the DL collection (i.e., Econ-

Stor) and selects a publication that she wants to fur-

ther examine. We search the blog post collection for

additional publications that could be of interest to her

based on the STW term(s) that describe the publica-

tion she is currently reading. The user searches for

(swc:Paper) publications from EconStor that cover

the subject of ”Human capital”, published from 2014

and onwards. The user selects the publication titled

”Labour market integration, human capital formation,

and mobility” from the result list. Using the same

STW term (”Human capital”) that describes the se-

lected publication, gives us 1 blog post from the WSJ

collection, titled ”Q&A: Golub Capital’s David Golub

on GE Capital’s Divestiture”, as well as 7 other posts

described with the ”related” STW term ”Human re-

sources” that could further complement user’ reading

experience (Figure 2 shows part of the retrieved re-

sults in Apache Jena Fuseki). This further empha-

sizes the role that the (STW) thesaurus can play in

providing alternative results for the user by using its

structure, such as via ”narrow”, ”broad”, or ”related”

terms.

(iii) Search the scientiﬁc blog post collection

alone: In another scenario, the user searches for the

newest blog posts covering a certain subject. During

this scenario, the user can decide to factor in the num-

ber of comments that a blog post has, i.e. the post that

stirred the most feedback/discussion on a given sub-

ject, or explore the most used STW terms from the

collection, in order to have a better understanding on

the variety of blog posts that constitute the collection.

Let’s see how these two search strategies work for our

blog post collection:

– Highest number of comments: This search, ﬁl-

tered by posts published from 2015 and on, lists the

following top 3 blog posts with the highest number of

comments: ”Facebook Plans a ’Dislike’ Button, but

Only for Empathy, Zuckerberg Says” with 18 com-

ments; ”Microsoft Expected to Unveil Next-Gen Win-

dows Phone and Surface Tablet” with 12 comments;

and ”Alabama Judges ’Reprehensible’ Conduct Mer-

its Impeachment, Judiciary Says”, the last post stress-

ing a judicial misconduct by a judge, with 8 user com-

ments from the blog readers.

– The most featured blog posts by STW term: This

is an attempt to mimic ”topic trending” in the blog

post collection – showing the extent to which certain

subjects are covered (via posts) in the blogging com-

munity. In our case, searching for the most used STW

terms in the blog post collection results with the top 3

most used terms ”Enterprise”, ”Personalization”, and

”Share”. This provides some hints to DL users about

the most represented/covered subjects from the blog

post collection, in case they want to use that informa-

tion to guide their exploration of this collection.

– A combination of the two: Having identiﬁed the

most used STW terms, the user can further explore the

most commented on blog post from a popular subject,

which with regards to our blog post collection results

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

288

to combining posts on the subject of ”Enterprise”,

”Personalization”, or ”Share” (as discussed above),

and blog posts that attracted the most attention in the

blog community (the top 3 blog posts listed above).

6 RESEARCH BENEFITS

The key beneﬁts relate to the automatic indexing and

semantic annotation of the scientiﬁc blog post collec-

tion, its integration with the DL collection in a uniﬁed

(in terms of querying and resource description via the

STW thesaurus) dataset, as well potential data proﬁl-

ing and analysis operations. Following are the em-

phasis on these aspects:

(i) Semantic annotation and representation of blog

post collections: Having the DL collection published

as LOD dictates the methodology of blog post collec-

tion integration with the DL. Without any effort from

the bloggers’ side, we have modeled and represented

this collection in the same way as the DL collection

– thus making them part of the same ”model” (RDF,

in this case), and automatically indexed it (based on

the STW thesaurus) – thus bridging the terminology

gap between these different resources and integrating

them at a terminology level.

(ii) Integration of (originally) heterogeneous col-

lections for a seamless and uniﬁed ”query space”:

Meeting DL’s interest to include heterogeneous re-

sources – blog posts from the same domain, we have

integrated the latter and made it available as a re-

source collection to the former. The users of the DL

library, as shown with our queries over the result-

ing dataset collection, are able to retrieve relevant re-

sources via different scenarios.

(iii) Data proﬁling and analysis (both indirect

beneﬁts from relying on STW for indexing the blog

post collection): STW’s alignment with other the-

sauri, classiﬁcation systems, and external KBs en-

ables us to enrich the user search experience by link-

ing up scientiﬁc blog posts of interest to the user

with external, related resource collections. More-

over, we are able to provide useful information

about the dataset to the user, such as ”trending” top-

ics/subject for a given time period, the most popular

topic/subject, or the blog posts that sparred the most

debate with the users. For more details, see the imple-

mented use case scenario implementations from sec-

tion 5 of the research paper.

7 CONCLUSION AND FUTURE

WORK

In this paper we have addressed a Digital Library

requirement by integrating non-library resources – a

scientiﬁc blog post collection – and making it avail-

able to its users as a complementary content in their

search operations. In doing so, we have pre-processed

the non-library resources in order to bring them up

to par (vocabulary-wise) with the DL practices (as-

signing STW terms, in this case); modeled them ac-

cording to the DL collections representation (RDF, in

this case) by selecting a set of suitable vocabularies

(and corresponding classes and properties); and ﬁ-

nally converting them from a relational database to an

RDF representation using the D2RQ platform. Fur-

thermore, in order to support the use case scenarios,

we loaded both the library and non-library datasets

on separate named graphs of a single dataset on a

SPARQL server.

One of the future work task is to develop a pro-

totype, hence enabling evaluation scenarios with the

users. Currently, in order to query the uniﬁed dataset

implies knowledge of SPARQL, which is not a skill

that common DL users should have in exploring a DL

collection.

Another research follow up direction is that of

analysis that would bring more value to the user

(search) experience in view of the newly-added blog

post collection, such as publications similarity based

on the STW thesaurus structure and graph representa-

tion properties, to name a few.

REFERENCES

Auer, S., Feigenbaum, L., Miranker, D., Fogarolli, A., and

Sequeda, J. (2010). Use cases and requirements for

mapping relational databases to rdf.

Burgelman, J.-C., Osimo, D., and Bogdanowicz, M. (2010).

Science 2.0 (change will happen.). First Monday,

15(7). Accessed: 28 June 2016.

Holgersen, R., Preminger, M., and Massey, D. (2012).

Using semantic web technologies to collaboratively

collect and share user-generated content in order

to enrich the presentation of bibliographic records.

Code4Lib Journal, 17. Accessed: 20 June 2016.

Hu, Y., Janowicz, K., McKenzie, G., S. K., and Hitzler,

P. (2013). A linked-data-driven and semantically-

enabled journal portal for scientometrics. In 12th

International Semantic Web Conference. Springer

Berlin Heidelberg.

Latif, A., Borst, T., and Tochtermann, K. (2014). Exposing

data from an open access repository for economics as

linked data. D-Lib Magazine, 20(9/10). Accessed: 7

June 2016.

Bringing Scientiﬁc Blogs to Digital Libraries

289

Limani, F., Latif, A., and Tochtermann, K. (2016). Scien-

tiﬁc social publications for digital libraries. In 20th

International Conference on Theory and Practice of

Digital Libraries.

Mahrt, M. and Puschmann, C. (2014). Science blogging:

an exploratory study of motives, styles, and audience

reactions. Journal of Science Communication, 13(3).

Passant, A., Laublet, P., B. G., and Decker, S. (2010). Sem-

slates: Weaving enterprise 2.0 into the semantic web.

Powell, J., Collins, L., and Martinez, M. (2010). Semanti-

cally enhancing collections of library and non-library

content. D-Lib Magazine, 16(7/8). Accessed: 10 June

2016.

Spanos, D.-E., S. P. and Mitrou, N. (2012). Bringing re-

lational databases into the semantic web: A survey.

Semantic Web, 3(2):169–209.

Yoose, B. and Perkins, J. (2013). The lod landscape in

libraries and beyond. Journal of Library Metadata,

13(2-3):197–211.

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

290