Enterprise Knowledge Graphs: A Semantic Approach for Knowledge

Management in the Next Generation of Enterprise Information Systems

Mikhail Galkin

1,2

, Sören Auer

, María-Esther Vidal

1,3

and Simon Scerri

University of Bonn & Fraunhofer IAIS, Bonn, Germany

ITMO University, Saint Petersburg, Russia

Universidad Simón Bolívar, Caracas, Venezuela

Keywords:

Enterprise Information Systems, Linked Enterprise Data, Enterprise Knowledge Graphs, Semantic Web

Technologies.

Abstract:

In enterprises, Semantic Web technologies have recently received increasing attention from both the research

and industrial side. The concept of Linked Enterprise Data (LED) describes a framework to incorporate ben-

eﬁts of Semantic Web technologies into enterprise IT environments. However, LED still remains an abstract

idea lacking a point of origin, i.e., station zero from which it comes to existence. We devise Enterprise Knowl-

edge Graphs (EKGs) as a formal model to represent and manage corporate information at a semantic level.

EKGs are presented and formally deﬁned, as well as positioned in Enterprise Information Systems (EISs)

architectures. Furthermore, according to the main features of EKGs, existing EISs are analyzed and compared

using a new uniﬁed assessment framework. We conduct an evaluation study, where cluster analysis allows for

identifying and visualizing groups of EISs that share the same EKG features. More importantly, we put our

observed results in perspective and provide evidences that existing approaches do not implement all the EKG

features, being therefore, a challenge the development of these features in the next generation of EISs.

1 INTRODUCTION

The demand for new Knowledge Management (KM)

technologies in enterprises is growing in recent

years (Hislop, 2013). The importance of KM in-

creases with the volumes of data processed by a com-

pany. Although the enterprise domain might vary

from car manufacturing to software engineering, KM

is capable of reducing costs, increasing the perfor-

mance and supporting an additional added value to

company’s products. Novel KM approaches often

suggest new data organization architectures and fos-

ter their implementation in enterprises. One of such

an architecture leverages semantic technologies, i.e.,

the technologies the Semantic Web is based on, in

order to allow machines to understand the meaning

of the data they work with. Machine understanding

and machine-readability are supported by complex

formalisms such as description logics and ontologies.

Semantic applications in the business domain com-

prise a new research trend, namely Linked Enterprise

Data (LED). However, in order to truly exploit LED

an organization needs to establish a knowledge hub

as well as a crystallization and linking point (Miao

et al., 2015). Google, for example, acquired Free-

base and evolved it into its own Enterprise Knowledge

Base (Nickel et al., 2016), whereas DBpedia assumed

a similar position for the Web of Linked Data over-

all (Bizer et al., 2009).

In this paper, we present Enterprise Knowledge

Graphs (EKGs), as formal models for the embodi-

ment of LED. An EKG refers to a semantic network

of concepts, properties, individuals, and links rep-

resenting and referencing foundational and domain

knowledge relevant for an enterprise. EKGs offer a

new data integration paradigm that combines large-

scale data processing with robust semantic technolo-

gies making ﬁrst steps towards the next generation

of Enterprise Information Systems (Romero and Ver-

nadat, 2016). The main research goal of the paper

is to formally deﬁne an EKG and provide an exten-

sive study of existing Enterprise Information Systems

(EISs), which offer certain EKG functions or can be

used as a basis for an EKG. To achieve such a goal, we

develop an independent assessment framework which

is presented in the paper as well. We report on an

unsupervised evaluation that allows for clustering ex-

isting EISs in terms of EKG features. Observed re-

Galkin, M., Auer, S., Vidal, M-E. and Scerri, S.

Enterprise Knowledge Graphs: A Semantic Approach for Knowledge Management in the Next Generation of Enterprise Information Systems.

DOI: 10.5220/0006325200880098

In Proceedings of the 19th International Conference on Enterprise Information Systems (ICEIS 2017) - Volume 2, pages 88-98

ISBN: 978-989-758-248-6

Table 1: Comparison of various Enterprise Data Integration Paradigms: P1:XML Schema Integration, P2:Data Warehouses,

P3:Data Lakes, P4:MDM, P5: PIM/PCS, P6:Enterprise Search, P7:EKG. Checkmark and cross denote existence and absence

of a particular feature, respectively.

Para- Data Integr. Conceptual/ Heterogen- Internal/ No. of Type of Domain Semantic

digm Model Strategy operational eous data ext. data sources integr. coverage repres.

P1 DOM trees LAV operational medium both medium high

P2 relational GAV operational 8 partially medium physical small medium

P3 various LAV operational large physical high medium

P4 UML GAV conceptual 8 8 small physical small medium

P5 trees GAV operational partially partially 8 physical medium medium

P6 document 8 operational partially large virtual high low

P7 RDF LAV both medium both high very high

sults give evidences that none of state-of-the-art ap-

proaches fully supports EKGs, and further study is

required in order to catch the wave of future EISs.

The remainder of the paper is structured as fol-

lows: Section 2 positions EKGs into the ecosystem of

EISs. Section 3 lays theoretical foundations of EKGs

and formally describes the concept of EKGs. Sec-

tion 4 presents the assessment framework and review

of EISs that implement to a certain extent the EKG

functionality. Further, the methodology followed to

conduct the comparison is described, as well as an

overview of current EISs and the description of the

features necessary for implementing EKGs. Section

5 visualizes observed results using clustering algo-

rithms and identiﬁes hidden insights. Section 6 dis-

cusses data integration efforts in both the technical

and enterprise dimensions, which might be consid-

ered as predecessors of EKGs. Section 7 analyzes the

observed results and outlooks our future work.

2 MOTIVATION

In the last decades a variety of different approaches

for enterprise data integration have been developed

and deployed. Table 1 shows a comparison of the

main representatives according to the following crite-

ria: 1) Data Model: Various data models are used by

data integration paradigms. 2) Integration Strategy:

The two prevalent integration strategies are Global-

As-View (GAV), where local sources are viewed in the

light of a global schema and Local-As-View (LAV),

where original sources are mapped to the medi-

ated/global schema. 3) Conceptual versus opera-

tional: Some approaches primarily target the concep-

tual or modeling level, while the majority of the ap-

proaches are aiming at supporting operational inte-

gration, e.g., by data transformation or query trans-

lation. 4) Heterogeneous data: Describes the extent

to which heterogeneous data in terms of data model

or structure is provided. 5) Internal/external data:

Describes whether the integration of external data is

supported in addition to internal data. 6) Number of

sources: Presents the typically provided number of

sources, i.e., small (less than 10), medium (less than

100), large (more than 100). 7) Type of integration:

States whether data is physically integrated and mate-

rialized in the integrated form or virtually integrated

by executing queries over the original data. 8) Do-

main coverage: Reports the typical coverage of differ-

ent domains. 9) Semantic representation: Expresses

the extent of semantic representations, which can be

low in the case of unstructured or semi-structured

documents, medium in the case of relational or tax-

onomic data and high, when comprehensive semantic

formalisms and logical axioms are supported.

The comparison according to these criteria shows,

that the EKG paradigm differs from previous integra-

tion paradigms. EKGs are based on the statement-

centric RDF data model, which can mediate between

different other data models, since, for example, re-

lational, taxonomic or tree data can be easily repre-

sented in RDF. Similar to XML-based integration and

the recently emerging Data Lake approach, EKGs fol-

low the Local-As-View integration strategy and are

thus more suited for the integration of a larger num-

ber of possibly evolving sources. EKGs are the only

approach, which bridges between conceptual and op-

erational data integration, because on the one hand

ontologies and vocabularies are used to model do-

main knowledge, but at the same time operational as-

pects, such as querying and mapping/transformation

are supported. By employing RDF with URI/IRI

identiﬁers, EKGs support the integration of heteroge-

neous data from internal and external sources either in

a virtual or physical/materialized way. EKGs support

the whole spectrum of semantic representations from

simple taxonomies to complex ontologies and logical

axioms. However, a promising strategy seems to be

to use EKGs in combination with other approaches.

For example, EKGs can be used in conjunction with a

data lake, to add rich semantics to the low level source

data representations stored in the data lake. Similarly,

EKGs can provide background knowledge for enrich-

Enterprise Knowledge Graphs: A Semantic Approach for Knowledge Management in the Next Generation of Enterprise Information

Systems

Figure 1: Position of an EKG in the Enterprise Information System architecture. An EKG assumes a mediate position between

raw enterprise data storage and numerous services whereas the coherence layer provides a uniﬁed view on the data.

ing enterprise search or EKGs can comprise vocabu-

laries and mappings from MDM.

Thus, EKGs occupy a unique niche in the ecosys-

tem of enterprise applications. In order to provide

a better understanding of such a niche, we present a

structural view that represents a static orchestration of

information systems in a company (cf. Figure 1).

The structural view shows an EKG as a consumer

of raw heterogeneous data and a supplier of knowl-

edge for enterprise applications. Numerous structured

and unstructured data sources compose a new form

of a data lake, i.e., a Semantic Data Lake (SDL). An

SDL can integrate both company’s private data as well

as remote data in the open access, e.g., Linking Open

Data Cloud. The Ontological Coherence layer con-

tains a set of ontologies, both high-level and domain

speciﬁc, which offer different semantic views on the

EKG contents. For example, an organizational ontol-

ogy deﬁnes a business department in a company, a se-

curity ontology places access restrictions on the data

available to this department, whereas some supply

chains ontology speciﬁes the role of the department

in a product lifecycle. Such a granularity increases

knowledge representation ﬂexibility and ensures that

external applications use a standardized, ontology-

based view on the required data from the EKG. The

API layer exposes communication interfaces, e.g.,

graphical user interfaces (GUI), REST, XML, JSON,

or SPARQL, to deliver knowledge encoded in EKGs

to enterprise applications. Possible beneﬁciaries of

using EKG technology include: Enterprise Resource

Planning (ERP) systems, Master Data Management

(MDM) systems, Enterprise Service Buses (ESB), E-

Commerce and a company’s Web applications. An

EKG speciﬁes a set of domain-independent features,

e.g., provenance, governance, or security, which sup-

port the operation and maintenance of the entire sys-

tem. We argue that these enterprise characteristics

distinguish an Enterprise Knowledge Graph from a

common Knowledge Graph; these features are ex-

plained in detail in the Section of EKG Technologies.

Being ’orthogonal’ to the knowledge acquisition and

representation pipelines, the features are an integral

part of the knowledge management policies. We also

elaborate on those enterprise features in the Section

of EKG Technologies.

3 ENTERPRISE KNOWLEDGE

GRAPHS

We deﬁne Enterprise Knowledge Graphs (EKGs) as a

part of an Enterprise Data Integration System (EDIS).

Formally, EDIS is deﬁned as a tuple:

EDIS = hEKG, S, Mi = hhσ, A, Ri, hOS, PS, CSi, Mi

(1)

An EKG contains ontologies and instance data, S rep-

resents data sources, and M represents the collec-

tion of mappings to translate S into anEKG. EKGs

are based on the directed graph model hN, Ei where

nodes N are entities and edges E are relations between

the nodes. An EKG is deﬁned as a tuple hσ, A, Ri,

where σ is a signature for a logical language, A is a

collection of axioms describing an ontology, and R is

a set of restrictions on top of the ontology.

A signature σ is a set of relational and constant

symbols which can be used to express logical formu-

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

las. In other words, a signature contains deﬁnitions

of entities, e.g., the RDF triples :ToolX a owl:Class

and :Product a owl:Class

The axioms in the set A provide additional de-

scription of the ontology by deﬁning the relationships

between the concepts deﬁned in σ. In more detail,

axioms leverage logical capabilities of the chosen on-

tology development methodology, e.g., RDFS allows

for (but is not limited to) class hierarchy as well as do-

main and range deﬁnitions of properties. For instance,

:ToolX rdfs:subClassOf :Product.

The restrictions R impose constraints on concepts

and relationships. Although restrictions are also ax-

ioms, i.e., R @ A, we distinguish them as a sepa-

rate important criteria necessary for large-scale multi-

user enterprise environments. Restrictions can be

expressed as triples or in a rule language, e.g., Se-

mantic Web Rule Language (SWRL) or SPIN. Re-

strictions can be imposed on numerous characteris-

tics: access rights, privacy, or provenance. To illus-

trate, suppose an RDF triple :ToolX :cost 100 rep-

resents the cost of a tool ToolX. Then an RDF triple

:cost :editableBy :FinancialDept is the restric-

tion which states that only the ﬁnancial department

can change the value of the property :cost.

Data sources S are deﬁned as a tuple hOS, PS, CSi,

where OS is a set of open data sources which an en-

terprise considers appropriate to re-use or publish,

e.g., Linked Open Data Cloud or annual ﬁnancial re-

ports. PS is a set of private data sources of limited

access. Supply chain data is an ample example of a

private data. CS is a set of closed data sources with

the strongest access limitations. Closed data is often

available only to a special group of people within a

company and hardly ever shared, e.g., technical inno-

vations, business plans, or ﬁnancial indicators.

Mappings in M connect data sources S with

the EKG. The sources expose a semantic de-

scription of their contents. For instance, let

= {producedFrom(x, y)} return tuples that a

certain product x is produced from a certain ma-

terial y. Suppose EKG contains the following

RDF triples: {componentOf a rdf:Property.

material a Class. product a Class} . In order

to query the tuples, one should provide mappings

between the global ontology in the EKG and the

sources. We advocate that the Local-As-View

(LAV) paradigm is preferred in the EDIS. Ac-

cording to the LAV paradigm, sources are deﬁned

in terms of the global ontology (Ullman, 1997):

for each source S

, there is a mapping that de-

scribes S

as a conjunctive query on the concepts

We use "a" as shortcut for rdf:type as in the RDF Tur-

tle notation for readability.

in the global ontology EKG that also distinguish

input and output attributes of the source. There-

fore, M will contain a rule: producedFrom(x, y) :

−componentO f (y, x), material(y), product(x).

Although there exists a Global-As-View (GAV)

paradigm which implies presentation of the global

ontology concepts in terms of the sources, LAV

approach is designed to be used with numerous

changing sources and a stable global ontology. This

is exactly the EDIS case where an EKG remains

stable but enterprise data is highly dynamic; thus,

we underline the efﬁciency of LAV mappings for

Enterprise Knowledge Graphs.

4 EKG TECHNOLOGIES

4.1 An Assessment Framework for

Comparison

We propose an independent assessment framework

to compare features and shortcomings of each sur-

veyed technology. The benchmark categorizes high-

level EKG functionality along three dimensions:

D1) Human Interaction; D2) Machine Interaction;

and D3) Strategical Development. Each dimension

consists of a number of features. All the features

of each dimension and their possible values are pre-

sented in Table 2. These features are later utilized in

the review of EISs.

Human Interaction (HI) enables humans to inter-

act with an EKG. Modeling expresivity characterizes

the degree to support a user in modeling knowledge or

a domain of discourse behind an EKG ranging from

taxonomies and simple vocabularies to complex on-

tologies, axioms, and rules. Curation describes the

support for creating, updating, and deleting knowl-

edge from an EKG, and the availability of compre-

hensive user interfaces to perform such operations.

Linking comprises the level of support for establish-

ing coherence between knowledge structures and in-

stance data. Exploration & Visualization functions

are essential for a successful user experience. Among

lay users (here referring to non-specialists in seman-

tic technologies), an ability to explore data easily and

represent it in a desired way often determines a so-

lution’s usability and success. Enterprise Search can

be extended to allow for a semantic search over the

entire enterprise information space, and beyond.

The Machine Interaction (MI) dimension de-

scribes different levels of support for machine interac-

tion with an EKG. As information systems of the next

generation, EKGs accentuate machine-to-machine in-

Enterprise Knowledge Graphs: A Semantic Approach for Knowledge Management in the Next Generation of Enterprise Information

Systems

Table 2: Benchmark Dimensions (D) – D1: Human Interaction; D2: Machine Interaction; and D3: Strategical Development.

Benchmark Parameters (P) – P1: Modeling; P2: Curation; P3: Linking; P4: Exploration; P5: Search; P6: Data Model; P7:

APIs; P8: Governance; P9: Security; P10: Quality & Maturity; P11: Provenance; and P12:Analytics. Dash – no feature.

Dim. # Parameter 1 2 3 4 5 6

P1 M odeling Taxonomy Thesaurus

Meta-

schema

(SKOS)

Vocabularies

Ontologies Rules

P2 Curation Collaborative plain text forms GUI Excel –

P3 Linking Text Mining

LOD

Datasets

Ontology

Mapping

Manual

Mapping

Spreadsheet

Linking

Taxo-

nomy

Exploration/

Visualization

Charts Maps Web GUI

Faceted

browser

Vis Widgets Graphs

P5 Search Semantic NL Faceted

Federated

Faceted

Full text

semantic

Semantic

–

P6 Data Model RDF RDF + docs RDBMS Property Graph Taxonomy

P7 APIs SPARQL REST SQL custom APIs ESB & OSGi

Java/

Python

P8 Governance Policies Best practices SAP RDF based – –

P9 Security Spring ACL SAP Roles Token –

P10

Quality &

Maturity

SKOS based RDF based SAP – – –

P11 Provenance Context

Data Lake

based

SAP – – –

P12 Analytics Statistical Exploratory SAP NLP Big Data –

Table 3: Benchmark-based Comparison of EKG Solutions. Parameter values are taken from Table 2: P1: Modeling; P2:

Curation; P3: Linking; P4: Exploration; P5: Search; P6: Data Model; P7: APIs; P8: Governance; P9: Security; P10: Quality

& Maturity; P11: Provenance; and P12:Analytics. Dash – no feature.

System P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

Ontorion 5,6 1,2,3 3 1,2 1 1 1 – – – – 1,4

PoolParty 1,2,5 4 1,4 3,5 2 1 1,2,3 – 1 1 – 1,2

KnowledgeStore 4 – 1,2 4 1 2 1,2 – – – 1 –

Metaphacts 5,6 1,2,3 1,2 4,5 1 1 1 – – – – 1

Semaphore 4 4 4 1,4 4,5 3 1 1,2,4 1 – – – 2,4,5

Anzo SDP 5 2,4,5 1,5 3,5 4 3 1,4,5 2 2 – – 2

RAVN ACE 3 – 1 3,5,6 4 4 2,4,6 2,5 – – 2,4

CloudSpace 5 – 2 4,6 1 1 1 – – – – 2

ETMS 1,4 3,4 6 6 – 5 3 1 4 – – –

SAP HANA – – – – – 4 3 3 3 3 3 3

EKG 5,6 1,4 1,2,3 3,5 4 1 1,2,4,5,6 4 2,4,5 2 2 1,2,5

teraction by creating a shared information space that

is understandable by automated agents (e.g., services,

software, other information systems). The Data Mod-

els feature speciﬁes the foundational model of an

EKG, i.e., how the knowledge is stored and repre-

sented in memory. We distinguish between Docu-

ment models, Relational DBs, Taxonomies (Delphi,

2004), and the RDF data model. Whereas Data Mod-

els deﬁne the ‘depth’ of machine communication,

available APIs expose the ‘breadth’ of the commu-

nication. An EKG is by deﬁnition involved in and

connected with other information systems operating

within an enterprise. APIs are a cornerstone and out-

line the difference between EKGs and previous gen-

eration of knowledge management solutions. The

Strategical Development (SD) dimension is orthogo-

nal to the HI and MI dimensions. Strategical Devel-

opment features are of equal importance for both di-

mensions. Governance indicates availability of man-

agement mechanisms within a particular EKG imple-

mentation. EKGs for large enterprises should obey

corporate policies and should be embedded into the

decision-making routine. The Security feature is of

high importance in large enterprises. EKGs should

employ complex and comprehensive access control

procedures, manage rights, and permissions on a large

scale. The Quality & Maturity (Q&M) feature en-

ables corporate management to evaluate the quality of

an EKG and track its evolution over time. Data Qual-

ity indicates a degree of compliance of EKG data to

an accepted enterprise standard level. Maturity shows

a degree of applicability of a certain technology in an

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

enterprise. The Provenance feature allows for track-

ing origins from which the data has been extracted.

Being applied to EKGs provenance elaborates on ver-

sioning and history of the EKG content. The Analyt-

ics feature provides statistics and overview of basic

KPIs of an EKG. The variety of analytical services

indicates aptitude of EKGs to be involved into en-

terprise information ﬂows instead of being reluctant

’knowledge silos’.

4.2 EKG Solutions

Below, we compare 10 of the most prominent EKG

available solutions. These are selected

based on their

relevance given our own EKG deﬁnition, as well as

their maturity (in a deployment-ready state and tar-

geted for enterprise). The EKGs are evaluated against

the benchmark presented in 4.1. Below we elaborate

on the results summarised in Table 3.

Cognitum Ontorion

is a scalable distributed knowl-

edge management system with rich controlled natu-

ral language (CNL) opportunities (Wroblewska et al.,

2013). In the HI dimension, it enables sophisti-

cated semantic modeling of high expressivity employ-

ing full support of OWL2 ontologies, semantic web

rules, description logics (DL) and a reasoning ma-

chine. Collaborative ontology curation is available

via a plain text editor and various forms. The platform

supports ontology mapping out of the box although

the type of mapping is not speciﬁed. Visualization

and exploration means are presented by web forms

and tables. CNL and semantic technologies stack en-

able natural language semantic search queries. In the

MI dimension Ontorion employs RDF and SWRL to

maintain the data model. Through the OWL API and

SPARQL, machines can interact with the system. In

the SD dimension Ontorion offers statistical analytics

and Natural Language Processing (NLP) techniques.

Semantic Web Company PoolParty

is a software

suite envisioned to enrich corporate data with seman-

tic metadata in the form of a corporate thesaurus

(Mezaour et al., 2014; Schandl and Blumauer, 2010).

Knowledge can be modeled as a taxonomy, thesaurus,

or as an ontology reusing built-in vocabularies, e.g.,

SKOS, schema.org, or custom. Curation is organized

in the GUI via forms. Linking capabilities allow for

text mining of unstructured data and manual vocab-

ulary mapping. PoolParty provides web-based visu-

alizations and exploration interfaces while traversing

a taxonomy or ontology (visual browsing). SPARQL

The KMWorld Magazine served as a major source for

their identiﬁcation. http://www.kmworld.com

Web: http://www.cognitum.eu/semantics/Ontorion/

Web: https://www.poolparty.biz/

graphical shell is available to query the EKG. Seman-

tic search supports facets, multilinguality, and struc-

tured and unstructured data including Sharepoint and

Web CMS. To support MI, PoolParty uses the RDF

model and implements JSON RESTful and SPARQL

endpoints. In terms of SD features, Security relies on

the Spring security mechanisms applied to the REST

component of an EKG. Quality management is lim-

ited to SKOS vocabularies and checks potential errors

and misuses. Analytical features perform statistical

and exploratory data analysis.

KnowledgeStore

is a scalable platform (Rospocher

et al., 2016) optimized for storing structured and un-

structured data with the help of a predesigned ontol-

ogy. KnowledgeStore is an integral part of the News-

Reader

project. Modeling expressivity is limited

to the predeﬁned vocabulary so that newly acquired

data must be represented according to this vocabu-

lary. Linking functionality allows for automatic text

mining and entity extraction with the subsequent jux-

taposing the extracted data on the structured semantic

data. A SPARQL client and a faceted browser are

responsible for knowledge exploration. Knowledge-

Store might be integrated with Synerscope Marcato

for comprehensive multi-dimensional visualizations.

Semantic search over a graph might be performed us-

ing natural language queries or entities URIs. The

data model employs RDF in conjunction with un-

structured documents. One can interact with Knowl-

edgeStore via REST API or SPARQL endpoint. Data

sources might be tracked on the instance level, a lim-

ited form of versioning is implemented by the triples

context feature.

Metaphacts

is a platform to design, control, and

visualize domain-speciﬁc KGs. As to the HI,

Metaphacts supports a number of features. Model-

ing opportunities rely on OWL ontologies and rea-

soning mechanisms. Collaborative knowledge cura-

tion is available in the Web-based GUI via forms

and plain text editor. Interlinking capabilities allow

for semantic enrichment of data extracted from un-

structured documents. External public datasets might

be used for enrichment similarly to KnowledgeStore.

Metaphacts offers visualization and exploration in the

form of a faceted browser with custom visualization

widgets. Semantic search queries are executed against

structured RDF data in the EKG. For the MI di-

mension Metaphacts reuses open standards RDF and

OWL having SPARQL endpoint as a communication

interface. However, a distinctive feature of the plat-

Web: https://knowledgestore.fbk.eu/

Web: http://www.newsreader-project.eu/

Web: http://www.synerscope.com/marcato

Web: http://metaphacts.com/

Enterprise Knowledge Graphs: A Semantic Approach for Knowledge Management in the Next Generation of Enterprise Information

Systems

form is in the analytical domain as it provides ma-

chine learning and graph analysis functions enhanced

by the GUI and custom visualization engine.

Smartlogic Semaphore 4

is a content intelligence

platform which provides a broad spectrum of func-

tions for an enterprise. The modeling is performed us-

ing taxonomies, SKOS vocabularies, and ontologies.

Web GUI Editor allows for effortless curation lever-

aging Web forms, and drag & drop functions. Vocab-

ularies might be interlinked on the schema level man-

ually by a user. Data from unstructured documents

is extracted and aligned with the vocabularies. Visu-

alization and exploration means provide eloquent in-

sights on the EKG and its construction routine. Text

mining GUI guides a user from uploading a docu-

ment to facts extraction. Semaphore 4 enables seman-

tic search capable of answering federated queries and

faceted queries. Elaborating on the MI Semaphore is

oriented toward RDF data model. One of the impor-

tant advantages for an enterprise is a set of integration

interfaces with other enterprise-level solutions, e.g.,

Apache Solr, Oracle WebCenter Content. In the SD

aspect, Semaphore allows for the creation of gover-

nance policies and user workﬂows to follow a certain

policy. Powered by Big Data and NLP Semaphore of-

fers a comprehensive analytical framework with rich

visualizations.

Cambridge Semantics Anzo Smart Data Plat-

form

is a data integration platform which allows

structured and unstructured data to be applied in the

corporate information ﬂows. As regards HI, Anzo

SDP’s components allow for data curation, interlink-

ing, rich visualizations, and advanced semantic search

functionality. Modeling expressivity relies on fully-

supported RDF and OWL ontologies. Curation is per-

formed in a standalone Ontology Editor which ex-

ports the created ontology to the Smart Data Lake.

SDP provides semi-automatic means for integration

of Excel spreadsheets with ontologies with user in-

volvement. SDP offers a comprehensive web-based

GUI with a wide range of visualization widgets. Se-

mantic search is supported by the NLP engine and

is capable of operating on top of structured and un-

structured content simultaneously. In the scope of MI

Anzo SDP implements the RDF model built on top

of Apache Hadoop HDFS, Apache Spark, and Elas-

ticSearch. The platform might be integrated with ex-

isting IT solutions via numerous supported interfaces,

i.e., Enterprise Service Bus (ESB), OSGi, SPARQL,

JDBC, and custom APIs. As SD, the platform in-

Web: http://www.smartlogic.com/what-we-do/

products-overview/semaphore-4

Web: http://www.cambridgesemantics.com/

technology/anzo-smart-data-platform

cludes Governance, Security, and Analytics features.

Governance is supported via default methodologies

and best practices intended to preserve a robust work-

ﬂow on each hierarchy level. A security mechanism

is based on access control lists (ACL) and roles sepa-

ration. Data analytics opportunities offered by SDP

include exploratory analytics via interactive dash-

boards, sentiment analysis of text data, and spread-

sheet analytics.

RAVN Applied Cognitive Engine

is an enterprise

software suite for complex processing of unstructured

data, e.g., text documents. ACE consists of numerous

modules responsible for data extraction, search, visu-

alization, and analytics. Being mostly a text mining

tool, ACE modeling expressiveness adopts character-

istics of the inner abstract schema. Yet the details of

the schema remain unclear, the schema is used to store

knowledge in the built-in graph database. Linking op-

portunities are presented with a broad spectrum of text

analysis functions available in ACE. ACE provides

a comprehensive web-based GUI to maintain the ex-

traction and processing timeline. GUI presents inter-

active search, KG visualizations, provides widgets for

the analytics module. Full-text semantic search in nat-

ural language is supported by Apache Solr and graph

database. To support MI, ACE employs the property

graph model on top of MongoDB and ApacheSolr

The platform provides a range of REST, Java, and

Python APIs to interact with enterprise applications.

To support SD, ACE beneﬁts from the comprehen-

sive security subsystem and analytical functions. Se-

curity is based on token strings and access level. If

a user token is contained in the allowed list, then her

access level is determined by subparts of the token

string. ACE analytical functions employ NLP to sup-

ply a user with sentiment analysis, expert locator, and

exploratory visualizations.

SindiceTech CloudSpace Platform

provides a set

of tools to build and maintain custom EKGs. CSP

supports OWL ontologies for data modeling with

high expressivity. The platform does not provide vi-

sual tools for curation. Linking functions allow for

ontology-based integration when heterogeneous data

from different sources is transformed to RDF and

mapped with one ontology. Pivot Browser enables

relational faceted browsing for data exploration. On-

tologies can be visualized as graphs. Search function-

Web: https://www.ravn.co.uk/technology/applied-

cognitive-engine/

Hoeke, J. V. Ravn cor whitepaper: https://www.ravn.

co.uk/wp-content/uploads/2014/07/CORE-Whitepaper-

W.V1.pdf

Web: http://www.sindicetech.com/cloudspace-

platform.html

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

Ontorion

PoolParty

Knowledge Store

Metaphacts

Semaphore 4

Anzo SDP

RAVN ACE

CloudSpace

Synaptica ETMS

SAP HANA

EKG

(a) 3-clusters

Ontorion

PoolParty

Knowledge Store

Metaphacts

Semaphore 4

Anzo SDP

RAVN ACE

CloudSpace

Synaptica ETMS

SAP HANA

EKG

(b) 4-clusters

Ontorion

PoolParty

Knowledge Store

Metaphacts

Semaphore 4

Anzo SDP

RAVN ACE

CloudSpace

Synaptica ETMS

SAP HANA

EKG

Figure 2: (a) Hierarchical clustering, n=3. The orange cluster offers rich SD functionality, the azure corresponds to the

taxonomy management, the blue comprises RDF based systems; (b) EM algorithm, n=4. All of the above, but the nankeen

cluster emphasizes text mining; (c) K-means algorithm, n=5. The green cluster employs link discovery tools.

ality is powered by Semantic Information Retrieval

Engine (SIREn) based on Apache Solr and Elastic-

Search. In the MI dimension, CSP utilizes the RDF

model with SPARQL as a basic API. SD features of

the platform provide exploratory analytical services,

e.g., modeling debugging and content overview.

Synaptica Enterprise Taxonomy Management

Software

is a technology to manage corporate tax-

onomies, classiﬁcations and ontologies. ETMS mod-

eling expressivity is limited to taxonomies and SKOS

vocabularies. Curation means are based on a Web

GUI which presents a tree-like interface with forms

to ﬁll in. Linking functionality is supported by taxon-

omy mappings. Mappings between taxonomies might

be established automatically whereas mappings of vo-

cabularies can be done manually via forms, and drag

& drop in the GUI. Taxonomies and vocabularies are

visualized as trees, hyperbolic graphs, hierarchies,

area charts, and tree maps. For MI support, ETMS

implements the Taxonomy model. The platform ex-

poses RDBMS APIs to connect the system to the rest

of the corporate IT environment. In the SD dimen-

sion, ETMS allows for data governance and security.

Default policies and workﬂows specify 12 possible

roles with certain access permissions. A user is cat-

egorized in one of such groups and has access to the

constrained part of the taxonomy with limited rights.

SAP HANA SPS 11

is a "one size ﬁts all" enter-

prise software suite which includes an opportunity to

create and maintain EKGs on top of the HANA in-

frastructure and a speciﬁc graph database. HANA

capabilities are overwhelmingly broad and cover all

Web: http://www.synaptica.com/products/

Web: https://hana.sap.com/capabilities/sps-releases/

sps11.html

the features in the SD dimensions which are built-

in in the default HANA distribution. However, an

approach how HANA utilizes semantic technologies

to support human interaction with an EKG remains

vague. HANA Graph Engine is based on property

graphs which naturally satisfy EKG needs. The en-

gine exposes SQL-based APIs to communicate with

versatile HANA modules and components.

EKG denotes the proposed EKG architecture with all

the necessary features in HI, MI, and SD described in

Section 3, Section 2, and Section 4.1.

5 EVALUATION OF

STATE-OF-THE-ART

We conduct an unsupervised cluster analysis to ex-

isting EIS approaches for identifying groups of ap-

proaches of similar functionality

. Table 3 is trans-

formed into a matrix with numerical values that de-

note normalized distances (in the range [0,1]) among

the features, where values close to 1.0 indicate prox-

imity to the deﬁned EKG model. Values are chosen

not to describe ’the best’ or ’the worst’ solution, but to

stress the difference in functions the surveyed systems

provide. Weka

is employed to perform the cluster-

ing into three, four, and ﬁve clusters. Varying the

clusters number, we aim at identifying groups of EISs

which share similar functions. One and two clusters

give a superﬁcial representation of the surveyed sys-

tems, and thus, are not included in the analysis. Three,

four, and ﬁve clusters are able to reveal dependen-

Source codes are available in the github repository:

https://github.com/migalkin/ekgs_clustering

http://www.cs.waikato.ac.nz/ml/weka/

Enterprise Knowledge Graphs: A Semantic Approach for Knowledge Management in the Next Generation of Enterprise Information

Systems

cies encoded in the vector representation which might

have been overlooked or hidden during a manual re-

view. Three clustering algorithms were applied, i.e.,

hierarchical algorithm, EM algorithm, K-means algo-

rithm for three, four, and ﬁve clusters, respectively.

We then visualized the results with D3.js visualiza-

tion library. The radius of each node is proportional

to the Frobenius norm of a features vector. The re-

sults are depicted in Fig. 2. The arrangement and

positions of clusters are generated randomly, i.e., no

cluster is logically closer to the nearest cluster than

other clusters. Fig. 2(a) represents the distribution in

three clusters computed by the hierarchical clustering

algorithm. The biggest blue cluster comprises sys-

tems which can be described as RDF-based systems

with rich visualizations, semantic search and analyti-

cal functions. However, the blue cluster lacks the fea-

tures of the Strategical development dimension. On

the other hand, the orange cluster contains systems

that implement features from this dimension. The

size of the envisioned EKG approach is bigger due to

the leverage of the Human Interaction features which

SAP HANA SPS lacks. The smallest azure cluster

contains only one system which distinguishes itself

as a taxonomy management tool.

Fig. 2(b) represents a distribution in four clus-

ters performed by the EM algorithm. The EM

(Expectation-maximization) algorithm involves latent

variables to maximize likelihood of the input parame-

ters. The extent of divergence in features (SD and tax-

onomies, respectively) keeps the orange and azure

clusters the same as in the previous case. However, a

new nankeen yellow cluster is marked off from the

blue cluster. The newly discovered cluster is charac-

terized by the emphasis on text mining functions of

the HI dimension. The blue cluster is still deﬁned as

semantic-based systems with rich visualizations.

Fig. 2(c) represents the distribution in ﬁve clus-

ters calculated by the K-means algorithm. The algo-

rithm partitions input data in clusters by comparing

means of an input vector and the clusters. The input

vector is attached to the nearest mean value cluster.

Firstly, the blue cluster was split further. The systems

in the blue cluster are characterized as RDF-based

knowledge management solutions which rely on on-

tologies, collaborative visual editing and wide range

of supported APIs. Secondly, a new green cluster,

derived from the blue cluster, is described as a set of

systems that is based on semantic technologies and is

capable of performing data interlinking. Additionally,

the green cluster exposes only SPARQL and OWL

API communication interfaces. The nankin yellow

cluster remained the same as a distinctive set of text

mining solutions. The orange cluster added one sys-

tem. Nevertheless, the cluster is still described as an

enterprise-friendly collection of systems with features

from the SD dimension. The azure cluster now com-

prises a different system, Anzo SDP. The cluster con-

tains a system capable of spreadsheet processing us-

ing ontologies for high-level schema deﬁnition. The

wide range of supported APIs and RDB implementa-

tion of RDF distinguish the azure cluster.

6 RELATED WORK

6.1 The Technical Implementation

Dimension

Data heterogeneity and volume are among the main

challenges when building an EKG. Being comprised

of numerous data sources, an EKG requires tools

for interlinking, integration, and fusion of such data

sources to ensure data consistency and veracity. Large

volumes of common datasets (e.g., DBpedia dump is

abput 250 GB, PubMed dump is about 1.6 TB) imply,

that the integration pipeline has to be as automatic

as possible. Prominent approaches for automatic

linked data integration and fusion are LDIF (Schultz

et al., 2012), OD CleanStore (Michelfeit et al., 2014),

LIMES (Ngonga Ngomo and Auer, 2011). All of

them maintain an integration pipeline starting from

data ingestion from various remote data sources to

a high-quality target data source. LDIF and ODCS

resort to SILK (Isele and Bizer, 2013) the tasks of

entity recognition and link discovery tool. LDIF em-

ploys Sieve (Mendes et al., 2012) as the Data Fusion

module which aims at resolving property values con-

ﬂicts by the assessment of the quality of the source

data and by application of various fusion policies,

while OD CleanStore uses a custom data fusion en-

gine. LIMES (Ngonga Ngomo and Auer, 2011) is a

stand-alone link discovery tool which utilizes math-

ematical features of metric spaces in order to com-

pute similarities between instances. Sophisticated al-

gorithms signiﬁcantly reduce the number of neces-

sary comparisons of property values. Linked Data in-

tegration tools digest data from various public RDF

datasources. Those community driven knowledge

graphs are either general-purpose bases (e.g., DBpe-

dia, Wikidata

, Freebase) which comprise facts from

different domains or domain speciﬁc bases which aim

at providing detailed insights on a particular theme.

DBpedia is one of the largest graphs containing more

than three billion facts. Wikidata is envisioned to

be a universal source of information for populating

https://www.wikidata.org/wiki/Wikidata:Main_Page

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

Wikipedia articles. Freebase has been acquired by

Google as a source for its Knowledge Graph and

Knowledge Vault (Dong et al., 2014).

6.2 The Enterprise Dimension

Large enterprises have integrated Semantic Web tech-

nologies with their IT infrastructures in various forms.

Statoil used Ontology-Based Data Access (OBDA)

to integrate their relational databases with ontolo-

gies (Kharlamov et al., 2015). NXP Semiconductors

transformed product information into an RDF prod-

uct taxonomy (Meenakshy and Walker, 2014). Stolz

et al. (Stolz et al., 2014) derived OWL ontologies

from product classiﬁcation systems. All the above-

mentioned efforts have, however, only analyzed the

one speciﬁc domain that is directly relevant, and al-

though they refer to KGs profusely the authors do

not elaborate on the deﬁnition, conceptual descrip-

tion or reference architecture of such KGs. In gen-

eral, a precise deﬁnition of the EKG concept remains

to be developed. One contribution of this paper is

to address the above by clearly deﬁning the con-

cept, and positioning the potential of EKGs. EKGs

have barely been explored along the data quality di-

mension in either Business Informatics or Semantic

Web communities. The Competence Center Corpo-

rate Data Quality (CC CDQ) has developed a frame-

work (Otto and Oesterle, 2015) for Corporate Data

Quality Management (CDQM) that presents a model

for the evaluation of data quality within a company.

Being a thoroughly tailored enterprise tool, the frame-

work is, however, only applicable to conventional

data architectures, i.e., MDM, Product Information

Management (PIM) and Product Classiﬁcation Sys-

tems (PCS), and extending the framework to cover

semantic data architectures is impractical.

7 CONCLUSIONS AND FUTURE

WORK

In this article, we presented the concept of Enterprise

Knowledge Graphs, described its structure, functions,

and purposes. We devised an assessment framework

to evaluate essential EKG properties along three di-

mensions related to the human interaction, machine

interaction, and strategical development. We pro-

vided an extensive study of existing enterprise so-

lutions, which implement EKG functionality to a

certain extent. The analysis indicates a wide va-

riety of approaches based on different data, gover-

nance, and distribution models and emphasizing the

human and machine interaction domains differently.

However, the strategical development functions, i.e.,

Provenance, Governance, Security, Quality, Maturity,

which are of utmost importance for an enterprise, of-

ten still remain unaddressed by current EKG tech-

nologies. We see here a room for signiﬁcant improve-

ments and new innovative technologies empowering

companies to fully leverage the EKG concept for data

integration, analytics, and the establishment of data

value chains with partners, customers and suppliers.

In future work, we aim at enhancing the EKG model,

implement a reference EKG architecture, and develop

a prototype which implements and supports the next

generation of EISs.

REFERENCES

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C.,

Cyganiak, R., and Hellmann, S. (2009). Dbpedia-a

crystallization point for the web of data. Web Seman-

tics: science, services and agents on the world wide

web, 7(3):154–165.

Delphi (2004). Information intelligence: Content classiﬁca-

tion and the enterprise taxonomy practice. Whitepa-

per.

Dong, X., Gabrilovich, E., Heitz, G., and Horn, W. (2014).

Knowledge vault: A web-scale approach to proba-

bilistic knowledge fusion. In 20th ACM SIGKDD Int.

Conference on Knowledge Discovery and Data Min-

ing, pages 601–610.

Hislop, D. (2013). Knowledge management in organi-

zations: A critical introduction. Oxford University

Press.

Isele, R. and Bizer, C. (2013). Active learning of expres-

sive linkage rules using genetic programming. Web

Semantics, 23:2–15.

Kharlamov, E., Hovland, D., Jiménez-Ruiz, E., Lanti, D.,

Lie, H., Pinkel, C., Rezk, M., Skjæveland, M. G.,

Thorstensen, E., Xiao, G., et al. (2015). Ontology

based access to exploration data at statoil. In The Se-

mantic Web-ISWC 2015. Springer.

Meenakshy, P. and Walker, J. (2014). Applying semantic

web technologies in product information management

at nxp semiconductors. In 13th International Semantic

Web Conference (ISWC 2014).

Mendes, P. N., Mühleisen, H., and Bizer, C. (2012). Sieve:

Linked data quality assessment and fusion. In 2012

Joint EDBT/ICDT Workshops, pages 116–123.

Mezaour, A.-D., Van Nuffelen, B., and Blaschke, C. (2014).

Building enterprise ready applications using linked

open data. In Linked Open Data–Creating Knowledge

Out of Interlinked Data, pages 155–174. Springer.

Miao, Q., Meng, Y., and Zhang, B. (2015). Chinese enter-

prise knowledge graph construction based on linked

data. In Semantic Computing (ICSC), 2015 IEEE In-

ternational Conference on, pages 153–154. IEEE.

Michelfeit, J., Knap, T., and Ne

cask

y, M. (2014). Linked

Enterprise Knowledge Graphs: A Semantic Approach for Knowledge Management in the Next Generation of Enterprise Information

Systems

data integration with conﬂicts. arXiv preprint

arXiv:1410.7990.

Ngonga Ngomo, A.-C. and Auer, S. (2011). Limes - a time-

efﬁcient approach for large-scale link discovery on the

web of data. In IJCAI.

Nickel, M., Murphy, K., Tresp, V., and Gabrilovich, E.

(2016). A review of relational machine learning

for knowledge graphs. Proceedings of the IEEE,

104(1):11–33.

Otto, B. and Oesterle, H. (2015). Corporate Data Quality.

Prerequisite for Successful Business Models. epubli.

Romero, D. and Vernadat, F. B. (2016). Future perspec-

tives on next generation enterprise information sys-

tems. Computers in Industry, 79:1–2.

Rospocher, M., van Erp, M., Vossen, P., Fokkens, A., Ald-

abe, I., Rigau, G., Soroa, A., Ploeger, T., and Bogaard,

T. (2016). Building event-centric knowledge graphs

from news. Web Semantics: Science, Services and

Agents on the World Wide Web.

Schandl, T. and Blumauer, A. (2010). Poolparty: Skos the-

saurus management utilizing linked data. In The Se-

mantic Web: Research and Applications. Springer.

Schultz, A., Matteini, A., Isele, R., Mendes, P. N., Bizer, C.,

and Becker, C. (2012). Ldif-a framework for large-

scale linked data integration. In 21st Int. World Wide

Web Conference (WWW 2012), Developers Track.

Stolz, A., Rodriguez-Castro, B., Radinger, A., and Hepp,

M. (2014). Pcs2owl: A generic approach for deriving

web ontologies from product classiﬁcation systems.

In The Semantic Web: Trends and Challenges, pages

644–658. Springer.

Ullman, J. D. (1997). Information integration using logi-

cal views. In International Conference on Database

Theory, pages 19–40. Springer.

Wroblewska, A., Kaplanski, P., Zarzycki, P., and Lu-

gowska, I. (2013). Semantic rules representation in

controlled natural language in ﬂuenteditor. In Human

System Interaction (HSI), 2013 The 6th International

Conference on, pages 90–96. IEEE.

ICEIS 2017 - 19th International Conference on Enterprise Information Systems