Industrial Big Data: From Data to Information to Actions

Andreas Kirmse

, Felix Kuschicke

and Max Hoffmann

Institute of Information Management in Mechanical Engineering (IMA), RWTH Aachen University, Aachen, Germany

Konica Minolta, Darmstadt, Germany

Keywords:

Industrial Big Data, Industrial Data Lakes, Information Integration, Data Acquisition, Cyber-physical

Systems, Industry 4.0, Smart Manufacturing, Information Systems, RDBMS, OPC UA, MQTT.

Abstract:

Technologies related to the Big Data term are increasingly focusing the industrial sector. The underlying

concepts are suited to introduce disruptive changes in the various ways information is generated, integrated

and used for optimization in modern production plants. Nevertheless, the adoption of these web-inspired

technologies in an industrial environment is connected to multiple challenges, as the manufacturing industry

has to cope with speciﬁc requirements and prerequisites that differ from common Big Data applications.

Existing architectural approaches appear to be either partially incomplete or only address individual aspects

of the challenges arising from industrial big data. This paper has the goal to thoroughly review existing

approaches for industrial big data in manufacturing and to derive a consolidated architecture that is able to deal

with all major problems of the industrial big data integration and deployment chain. Appropriate technologies

to realize the presented approach are accordingly pointed out.

1 INTRODUCTION

Increasing customer demands regarding product qual-

ity and diversity as well as fast changing markets

and strong competitors pose major challenges to the

manufacturing industry (Brecher and

Ozdemir, 2017).

While factory ﬂoor digitization, networking capabili-

ties and automation signiﬁcantly improve all areas of

manufacturing companies, it also creates a stagger-

ing amount of available production data: the so-called

Industrial Big Data (IBD) (General electric, 2012),

which characterizes the increasing amount of data that

is collected in industrial environments. The term has

been adapted from the broader Big Data (BD) term

covering all types of data and different data sources

like social media, environmental or consumer data.

The digital transformation of industrial environ-

ments and a new global availability of information

enabled by methods of IBD allows for new ways of

realizing data-driven potentials for producing compa-

nies. The main business drivers which can be ad-

dressed by utilizing IBD are: Equipment costs re-

duction, product quality assurance (also warranty and

after-sales management), and operational efﬁciency

improvement (Intelligence, 2009). In order to extract

the desired information from the (raw) industrial big

data, various transformation steps have to be pursued:

(1) data transformation (i.e. turning data into informa-

tion and ﬁnally into insights), and (2) a transformation

of the information user (i.e. the human being that uses

insights to implement improvement measures) (Ham-

mer et al., 2016). In order to close the feedback loop

from the shop ﬂoor, followed by profound decision

making through information users and ﬁnally back to

the manufacturing ﬁeld, new data-driven strategies,

road-maps and concrete IT infrastructure planning are

in need (Mourtzis et al., 2016).

In this paper, we address the entire tool chain of

transforming (raw) data into useful information. The

utilization of production data for insights and further

improvements require several major steps to realize

extraction of value from (I)BD. In principle, these ma-

jor steps are implemented by a continuous tool chain

and are furthermore independent of the application or

use case. For this purpose, various guidelines such

as reference architectures/layer structures – some of

them speciﬁc for manufacturing – have been devel-

oped in the recent years. However, these guidelines

are either partially incomplete or address only single

aspects of BD technologies. In most BD use cases,

unstructured data with low meta data footprint must

be processed, while in IBD use cases, rather struc-

tured and fast generated data with known meta infor-

mation and high variety of information is targeted.

Kirmse, A., Kuschicke, F. and Hoffmann, M.

Industrial Big Data: From Data to Information to Actions.

DOI: 10.5220/0007734501370146

In Proceedings of the 4th International Conference on Inter net of Things, Big Data and Security (IoTBDS 2019), pages 137-146

ISBN: 978-989-758-369-8

137

Accordingly, the speciﬁc requirements of indus-

trial big data in contrast to (traditional) big data have

to be pointed out to meet the challenges of modern

factories. As mentioned above, the questions to be an-

swered focus on how to accurately structure (known)

meta-data and how to systematically extract useful

information from massively generated data. Both,

the format of the raw data as well as the representa-

tional form of the generated information suitable for

the information user have to be clearly structured and

adaptable to the targeted application or use-case. The

further discussion of these requirements leads to the

research question of this paper:

• How is manufacturing data currently generated

and stored for industrial data solutions?

• Which steps are needed to realize the successful

acquisition of information from the ﬁeld level up

to information management systems?

• What are the fundamental building blocks of an

Industrial Big Data reference architecture?

• Which major technologies and/or methods can be

used to qualify for the tasks of each step?

The authors of this research publication have

been actively involved in the transformation processes

of large-scale manufacturing companies and propose

theoretical approaches as well as ﬁeld-proven meth-

ods on how to implement these concepts into real-life

applications on an industrial scale.

2 RESEARCH BACKGROUND

Similar to the traditional Big Data the Industrial Big

Data term characterizes an umbrella concept for stor-

ing and dealing with huge amounts of diverse, fast

incoming data. The technologies connected to IBD

are increasingly used in the area of industrial pro-

duction, since the hardware requirements for real-

izing such data-intensive environments are continu-

ously decreasing. Driven by embedded systems (e.g.

realized through edge-computing devices) and by a

higher pervasion of networking technologies, IBD

represent a common condition in modern factories.

The enabler technologies for IBD are strongly

connected to the acquisition and networking of in-

formation across various locations of a production

site. The most common realization of these tech-

nologies are characterized by terms such as Cyber-

Physical Systems (CPS) and more speciﬁcally for the

ﬁeld of production as Cyber-Physical Production Sys-

tems (CPPS). Another enabler technology is referred

to as the (Industrial) Internet of Things (IIoT), a term

which characterizes the global availability of informa-

tion through applications with a low footprint. After

a description of these foundations, the ﬁeld Industrial

Big Data and its according implementation through

IBD architectures is described in detail.

2.1 The Industrial Internet of Things &

Cyber-physical Production Systems

One deﬁnition of Big Data characterizes its purpose

quiet well by stating that “The world has always had

’big’ data. What makes ’big data’ the catch phrase

[. . . ] is not simply about the size of the data. ’Big

data’ also refers to the size of available data for anal-

ysis, as well as the access methods and manipula-

tion technologies to make sense of the data.” (Ebbert,

2018). Thus, as pointed out, BD is more about the

availability of information, which due to the techno-

logical advances can be easily collected by making

use by small and powerful embedded devices. The

networking of the collected data ﬁnally enable its

global availability for various applications.

The technological umbrella terms IIoT and CPPS

represent technologies that intend to implement the

described tool chain of information acquisition, col-

lection, integration and usage. For the purpose of the

present research work, we consider IIoT and CPPS

as synonyms, like it is stated in (Jeschke et al., 2017)

and (US Dept. of Commerce blog, 2014). Both con-

cepts are derivatives of the non-production related In-

ternet of Things and Cyber-Physical Systems termi-

nology and basically describe the transfer to a produc-

tion terminology (Sadiku et al., 2017). While there is

a relatively sharp distinction between IoT and CPS,

which is characterized by the capability of CPS to

perform edge computing in the ﬁeld while IoT are in-

tended to solely provide data from distributed devices,

the terms IIoT and CPPS are much closer as IIoT are

equally characterized by enabling smart applications

on the shop ﬂoor.

The term IIoT describes a concept, where data

exchange of production devices enables embedded

technologies and interconnected networks. The

IIoT thereby implies the use of sensors and actu-

ators, control systems, machine-to-machine (M2M)

communication, data analytics, and security mecha-

nisms (Mourtzis et al., 2016). According to Gartner,

there will be nearly 20 billion devices connected to

the IoT by 2020 – with the large majority of them

coming from the industrial sector. The adoption of

the IIoT results in the gradual replacement of clas-

sical network architectures like the automation pyra-

mid (VDI/VDE-Gesellschaft, 2013) and enables the

direct access to equipment data.

IoTBDS 2019 - 4th International Conference on Internet of Things, Big Data and Security

138

In an IIoT environment, data, services and func-

tions are stored and processed where they are needed,

in contrast to the traditional approach, in which the

data was manipulated to ﬁt to the systems of a grown

ecosystem of information management systems, i.e.

characterized by the different levels of the automation

pyramid. The implementation of such data-driven ap-

proach requires new design patterns in order to com-

ply with these business needs. The business require-

ments might include an application of various dif-

ferent use-cases to same information or a context-

speciﬁc visualization of information depending on the

person regarding the data. A thorough list of concerns

and challenges resulting from this fact can be found in

(Jeschke et al., 2017).

After a successful implementation of IIoT the ac-

quisition of vast amounts of data can be accomplished

that ﬁnally leads to the presence of IBD. Depending

on the protocol, which is implemented in terms of the

IIoT application, the integrated information is present

in a rather structured form. In further steps this data

will be processed to enable a deeper analysis to ex-

tract insights and valuable information.

2.2 Industrial Big Data

One of the most important outcomes of emerging IIoT

is the generation of large data volumes centrally accu-

mulated and stored, which grows at an unprecedented

rate – this volatility of generation speed in data is

one of the major characteristics about IBD (Mourtzis

et al., 2016). According to (McKinsey, 2017), in 2010

manufacturing stored more data than any other sec-

tor – estimated two exabytes. To summarize the def-

inition of IBD, the basic characteristics of IBD are

the high volume, velocity, and variety of data (Laney,

2011); although new characteristics are being contin-

uously introduced with “value” being the most impor-

tant (Yin and Kaynak, 2015). In comparison with

BD, IBD usually has a higher data quality and is

more structured, more correlated, more orderly in

time and more prepared to extract insights (for both

low and advanced methods) (Lee et al., 2015). How-

ever, IBD has higher demands in terms of ﬂexibility

and application-speciﬁc utilization of data.

According to (Kuschicke et al., 2017), the term

IBD stands not only for industrial data itself, but

also for the techniques and methods to utilize and

process the data. To underline this characterization,

the deﬁnition of Wilder-James ﬁts quiet well: “Big

data is data that exceeds the processing capacity of

conventional database systems. The data is too big,

moves too fast, or does not ﬁt the structures of your

database architectures. To gain value from this data,

you must choose an alternative way to process it. [. . . ]

To clarify matters, the three Vs of volume, velocity

and variety are commonly used to characterize dif-

ferent aspects of big data. They are a helpful lens

through which to view and understand the nature of

the data and the software platforms available to ex-

ploit them.” (Wilder-James, 2012) Thus, BD as well

as IBD solutions are build around this viewpoint on

Big Data, seeking for methods how to deal with the

characteristics of volume, velocity and variety.

In addition to this, IBD involves further methods

such as data acquisition, storage, and management

techniques. To make use of the gathered and con-

solidated data, methods originated from the domains

of data visualization, data mining, machine learn-

ing, and artiﬁcial intelligence are applied (Chen and

Zhang, 2014); (Gluchowski et al., 2007) and accord-

ingly complete the tool-set of IBD.

2.3 Industrial Big Data Architectures

The application of IBD techniques requires certain

guidance, as the process of gaining valuable knowl-

edge from industrial data involves several steps. For

optimal results (little effort, short delivery time, high

value) the different steps need to be performed in

close coordination. Therefore, different approaches

– from high level solutions up to detailed instructions

with examples – were developed and presented so far.

Unfortunately, the existing approaches do not com-

prehensively match the requirements and needs of the

manufacturing industry. The following chapter con-

tains a survey, which is not intended to be complete,

but should rather provide an overview about the main

streams in this research area. The examples provided

offer a rather high level and functional focus or target

generic solutions.

One high level architecture for BD is the so called

’big data pipeline’ published in (Bertino et al., 2011).

The authors describe a serial process of multiple

phases which are necessary steps to enable the anal-

ysis and accordingly the exposure of the hidden po-

tentials in the data. The pipeline consists of the

phases “acquisition / recording”, “extraction / clean-

ing / annotation”, “integration / aggregation / rep-

resentation”, “analysis / modeling” and “interpreta-

tion”. The pipeline gives readers a basic overview on

how to generally extract value from BD. However, the

presented approach does not describe how to link the

various phases in the pipeline, nor how to implement

the contents of the individual phases.

A more detailed approach has been developed by

the Industrial Internet Consortium (Industrial Internet

Consortium, 2017) which is referred to as the Indus-

Industrial Big Data: From Data to Information to Actions

139

trial Internet Reference Architecture (IIRA). The in-

troduced reference architecture consists of ﬁve func-

tional domains (control, operations, information, ap-

plication, and business). The last three functions rep-

resent the functionality of the big data pipeline as they

contain data generation, acquisition, transformation,

storage, access and analysis including a HMI.

A reference architecture model, which particu-

larly popular in Germany, is the Reference Architec-

tural Model Industry 4.0 (RAMI 4.0) (VDI/VDE So-

ciety Measurement and Automatic Control (GMA),

2015). This model introduces a three-dimensional

view on “Industry 4.0” based on the layer struc-

ture of the Smart Grid Architecture Model (CEN-

CENELEC-ETSI Smart Grid Coordination Group,

2012). Additional, a life cycle and value stream di-

mension plus a hierarchy level dimension complete

the model. A major focus lies on the data acquisition

step of Industrial Big Data. The model introduces a

so-called ’Administration Shell’ to collect data from

shop-ﬂoor devices. This Shell contains the features

device meta-data management (head) and manage-

ment of data transfer (body) and so, turns devices into

(smart) I4.0 objects. Lastly, the model only partially

addresses the subsequent steps such as data storage,

access and analysis.

Another German centered approach has been de-

veloped by the Fraunhofer society (Society, 2017).

The so-called “Industrial Data Space” focuses mainly

on the description of different roles within one “data

ecosystem”. Similar to RAMI 4.0 it introduces a ﬁve-

layer structure. Each role has certain functions, de-

pending on the layer. E.g., authorization of data usage

is a task for the data owner in the functional layer. As

a summary, the reference architecture is an approach

to establish roles and to assign responsibilities in the

data space on a higher level.

A simpler reference architecture is introduced and

applied in (P

akk

onen and Pakkala, 2015). The ar-

chitecture consists of the elements “data Source”,

“data extraction”, “data loading and preprocessing”,

“data processing”, “data analysis”, “data loading and

transformation”, “interfacing and visualization”, and

“data storage”. The different elements are mapped

against infrastructures of tech-giants (e.g. “Face-

book”, “Twitter”, “Netﬂix”, etc.). In course of this,

a more detailed and applicable model is generated,

which covers mostly the integration, processing and

analytics steps of traditional BD ecosystems.

A similarly detailed and applicable architecture is

presented in (Chen et al., 2014). The key elements

are “data generation”, “data acquisition”, “data stor-

age” and “big data analysis”. Each key element it-

self has sub-elements containing information that are

more detailed on a lower abstraction level, e.g. in-

cluding higher granularity data. Additional, available

technologies are mapped against the tasks of the sub-

elements, analog to (P

akk

onen and Pakkala, 2015).

Another lightweight modular based integration ar-

chitecture focuses on the issue of bringing brown-

ﬁeld devices with proprietary protocols into a har-

monized representation of information is presented

in (Kirmse et al., 2018). It thereby addresses legacy

devices and integrates generated data with existing

network zones separating the production ﬂoor from

typical ofﬁce and analytic areas as well as multiple

factory locations globally.

A variety of additional approaches has been pub-

lished such as in Constance (Hai et al., 2016), a main-

tenance approach in (O’Donovan et al., 2015) or a

high-level description of BD life cycle and infrastruc-

ture in (Demchenko et al., 2013). Further reference

architectures for Big Data can be found in (P

akk

onen

and Pakkala, 2015).

Lastly, a critical review of some reference ar-

chitectures for smart manufacturing is conducted

in (Moghaddam et al., 2018). Here the focus lies on

individual interviews with seven experts on the two

questions of the characteristics of a reference archi-

tecture for smart manufacturing and the required steps

for businesses to get there. Their ﬁndings show that a

uniﬁcation is necessary and the diverse architectures

start from different views, despite the common goal

to strive towards a (macro) service-oriented architec-

ture. All lack the deﬁnition of micro services and how

humans interact in these systems and environments

3 THE BASIC ELEMENTS OF BIG

DATA REFERENCE

ARCHITECTURES

A thorough review of the previously mentioned refer-

ence architectures leads to the deduction of the fol-

lowing basic elements (and their tasks), which are

needed to turn data into useful information and ﬁnally

into action:

1. Industrial Big Data Sources and Generation

2. Industrial Big Data Acquisition and Preparation

3. Industrial Big Data Storage and Access

4. Industrial Big Data Processing and Analysis

5. Industrial Big Data Information Presentation and

Interfaces

The following section describe each of these core

elements with respect to realize the entire integration

IoTBDS 2019 - 4th International Conference on Internet of Things, Big Data and Security

140

chain from (raw) industrial big data up to an analytics

cloud and/or human decider.

3.1 Sources and Generation

The starting point of Industrial Big Data is situated at

the very low level of the factory ﬂoor, where the raw

data is generated by means of various devices, control

units, robots, etc. Accordingly, the prior task of this

element consists in the provision of generated data.

This generated data hereby represents raw material of

the information value chain. The quality of this data

is of fundamental importance for all further steps of

the information value and integration chain. However,

despite the importance of raw data, the arising Indus-

trial Big Data itself is only the result of the digitiza-

tion of production processes and not its cause. The

driver of the mentioned generation are mainly due

to increasing equipment with more sensors that read

various amounts of different information and monitor

speciﬁc processes in high frequency.

3.2 Data Acquisition and Preparation

Data acquisition is the process that bridges the gap

between the sole existence of distributed information

and their actual collection. The capabilities of data

acquisition are strongly affected by the process of ma-

chine communication and the protocols that are used

for the internal data transfer between machine, au-

tomation devices and control units. This does not yet

apply to the data format itself, but to the general way

of accessing the source of the data generation. For the

different hierarchy levels and various systems, these

protocols and representational forms of data are quite

diverse in nature. However, due to standards of pro-

tocols there exists a common ground for communica-

tion exchanges. On the lowest ﬁeld level, where pro-

cess controlling is important and thus time-sensitive,

the demand is different from the upper levels, provid-

ing reporting of Key Performance Indicators (KPI).

3.3 Storage and Access

Data storage takes care of a persistence of all acquired

data, but also to store possible intermediate results

and (pre-)processed information. The storage system

itself has to cope with the high volume of data and

still deliver performance when someone, such as a

data analyst, requests speciﬁc data. In combination

with responsibilities connected to data protection as

well as to data governance, which are both summa-

rized under the paradigms of data security, the data

storage represents the central point of handing out ac-

cess to the gathered data after initial persistence.

3.4 Processing and Analysis

IBD processing and analysis represent the steps, in

which the actual value creation of the entire IBD

pipeline takes place. Hereby, the raw data is trans-

formed to valuable information. For this purpose, the

raw data is transformed, ﬁltered and/or modeled with

a varying degree of automation and manual efforts.

The task is to ﬁlter out uninteresting data (noise) to

obtain only valuable information. IBD processing and

analysis can be classiﬁed according to response time

into real-time and off-line analysis.

3.5 Information Presentation and

Interfaces

Providing the results of the previous analysis steps to

a user is an essential task, as it inﬂuences the pro-

cess of transforming data to actions the most. The

process of visualizing information to some higher in-

stance can target humans but also machines. Human

Machine Interfaces (HMI) includes a variety of dif-

ferent user interfaces (xUI) with the most common

are graphical (visualization) and email/signal alerting.

In comparison, machine interfacing usually involves

the information transmission through digital protocols

read by the receiving machine. The task of presenta-

tion and HMI is to provide the right information at the

right time in the right format and quantity.

4 INDUSTRIAL BIG DATA

TECHNIQUES AND METHODS

The described tasks of each step in the Industrial Big

Data value chain can be fulﬁlled using a variety of dif-

ferent methods, techniques and applications. The fol-

lowing section, therefore, gives an overview of avail-

able methods, techniques and applications, shows ba-

sic information about them and provides decision sup-

port for the selection process.

4.1 Sources and Generation

Industrial Big Data Sources

Manufacturing data and their information backbones

are usually structured according to the Purdue Model

(Williams, 1994) or for German based companies

more familiar the automation pyramid. This industry

Industrial Big Data: From Data to Information to Actions

141

adopted reference model shows the interconnections

and interdependencies of all main components of a

manufacturing control system form the bottom to the

top. Accordingly, data from manufacturing systems

are available at different levels, in different density or

granularity and in different quality.

Figure 1: Purdue Model (PurdueModelCisco, 2018).

Data on the enterprise level is well structured, has

a high data quality, and the number of different data

storages is relatively small, as different types of sys-

tems exist only one time per plant/site. Accordingly,

the data storages are very large, since they also con-

tain a predeﬁned history of data.

In the manufacturing zone, the number of data

sources increases signiﬁcantly, due to a higher variety

of systems and a division of areas into subareas, each

with identical but independent control systems. Shop

control systems (site manufacturing operations and

control), e.g. receive data from programmable logic

controllers (PLC) and/or workstations as part of an

Industrial PC (supervisory control & basic control),

which in turn receive data from production equip-

ment (process). As each system controls only a lim-

ited number of subordinate controllers, many identi-

cal systems are in use covering separated areas.

The break-up of the Purdue structure in the course

of IoT developments results in increasing direct sys-

tem connections on device level. Thus, shop ﬂoor de-

vices become a direct data source for IBD.

The variety of different formats, unlabeled data

and inconsistent naming of the data often negatively

inﬂuence the data quality in the manufacturing zone.

In addition, sensor data contains the risk of faulty data

values being sent – e.g. assets or a sensors fail.

Manufacturing data is generated (if applicable)

based on (ﬁxed) cycle times (log-ﬁles) and sensor

transmission frequencies. Therefore, data generation

frequencies of milliseconds ms are a common phe-

nomenon. Accordingly, the velocity of data genera-

tion, especially on the device level increases with the

number of devices and processes exponentially. Ad-

ditionally, data storage in production equipment itself

is usually limited and often consists only of a circular

buffer, so that data is stored transient according to the

FIFO principal (First-In-First-Out). These points of

data form a continuous ﬂow of new and updated data

points, a data stream to be handled by IBD systems.

4.2 Acquisition and Preparation

OPC UA. OPC Uniﬁed Architecture (Rinaldi,

2013) is a general machine-to-machine protocol and

that also incorporates full description of data us-

ing a meta-modeling pattern for information models.

Thereby, OPC UA models annotate the data in terms

of its format, valid value ranges, semantics and con-

text information. OPC UA allows modeling the com-

plete production process and thereby preparing data

for an exploration of possible correlations. The OPC

UA standard deﬁnes base objects, i.e. blue prints, e.g.

for sensors, devices or machines and provide these

models through an OPC UA Server. Vendors are ac-

cordingly able to extend these basic information mod-

els with their domain speciﬁc object and type deﬁ-

nition. A client accordingly connects to the server,

which typically resides in each control system and

contains its virtual representation. The client is able to

browse and read all available nodes in that server. The

client can either access data directly or register a trig-

ger to let the server notify it on changes, either peri-

odically or on deﬁned value changes and other events.

MQTT

Message Queue Telemetry Transport (MQTT) is an

open-source message protocol originated in the smart

home automation, but is due to its simplicity also

widely adopted into the industrial sector for machine

communication. It is lightweight and enables publish

subscribe to obtain and collect speciﬁc information

through a ﬂexible hierarchical structure, represented

by topics. The hierarchical topics are created inside

the queue tree of a central broker instance. The infor-

mation modeling of the production process depends

on these topics, describing e.g. the location of a de-

vice as well as on the message structure of the MQTT

message object. The message object is represented

by JavaScript Object Notation (JSON) and deﬁnes the

payload of the message, which can be freely deﬁned

according to application or use-case speciﬁc require-

ments. Publishing clients are able deﬁne queues with-

out former registration, thus there is no controlling in-

IoTBDS 2019 - 4th International Conference on Internet of Things, Big Data and Security

142

stance at the server. A client that wishes to receive

data has to know these speciﬁc queues beforehand

to successfully receive data based on subscriptions to

these topics. To organize information exchange about

available topics, queues and the data representation,

MQTT requires additional managing of used naming

conventions and message structures.

IO-Link

Another communication standard for sensors deﬁned

under the norm IEC 61131-9 (DIN EN 61131) is IO-

Link or as deﬁned in the standard SDCI. The stan-

dard speciﬁes communication between one master

and multiple slave devices over a PLC. It bridges the

link between an actual ﬁeld bus of sensors and the In-

dustrial Ethernet, where Industrial PCs reside.

(Proprietary) Message Queue

Other proprietary formats may exists which rely on

a message queue system and deﬁne its own message

format standard, often also called telegrams. Widely

used are XML based telegrams, which can be also ex-

changed using OPC UA. Different vendors favorite

this traditional mechanism to empower data exchange

from their machines, but often rely on their own pro-

prietary format, which cannot be handled directly,

see 4.2.2 Industrial Big Data Transformation.

RDBMS

Another class of source systems that provide al-

ready enriched data are reporting systems of KPIs.

They mostly rely on relational database management

systems (RDBMS) controllable and reachable us-

ing Structured Query Language (SQL). SQL is com-

monly used across all database system and allows for

querying speciﬁc information. Inside the database,

the structured data remains in large horizontal tables

containing raw values with additional new calculated

or combined information including context data.

4.2.1 Industrial Big Data Extraction

The data extraction depends on the protocol and to the

sort of data that is to be extracted. From the Business

Intelligence (BI) point-of-view, so-called facts and di-

mension are differentiated in this context. Facts repre-

sent an event in time that refers entities, and a dimen-

sion is mainly static information that deﬁnes the entity

itself. On the shop ﬂoor system with machine data,

the majority is fact data, which contain e.g. current

sensor values. One value represents one fact. To ex-

tract facts from a machine equipped with a protocol,

a subscription of the client to receive the desired in-

formation is required However, to further understand

and contents of the acquired data and ﬁnally extract

its value, meta-data and about sensors and machines

are needed. This context information is represented

by dimensions. Extracting dimension data can be a

complex task as context information is often not made

explicit in the source system and therefore only exist

implicitly due to some name or an underlying address

space. To acquire this context information domain-

speciﬁc and experience-based knowledge from hu-

man beings is typically needed. RDBMS systems

commonly provide different mechanism to extract

data using SQL. It is possible to poll for new data us-

ing a unique identiﬁer, to deﬁne handover tables, hat

only includes changed/new data points and it is possi-

ble to make use of full-ﬂedged Change-Data-Capture

systems, which completely mirror the whole database

to identify changes.

4.2.2 Industrial Big Data Transformation

Data transformation touches the modiﬁcation of the

data format itself, thus its representation, but also the

cleaning and validation of invalid information, if ap-

plicable. Especially, when the data format is pro-

prietary and does not follow open standards for data

exchange, the actual information needs to be trans-

formed into a readable format. Furthermore, annota-

tion and meta-data enrichment are also a part of data

transformation, where all available information has to

be added explicitly to the data.

4.2.3 Industrial Big Data Load/Integration

The load process partially depends on the target sys-

tem, which has to retrieve or possibly store the data

at hand. If the system is not able to store the infor-

mation as is, which is typically the case, integration

steps are required in terms of a translation of the data

format into the target systems, i.e. the data storages

requirements. This remodeling of data format is often

associated with a manipulation of the data schema,

as it is part of an Extract-Transform-Load (ETL) pro-

cess. An ETL pipeline deﬁnes this behavior for a Data

Warehouse system, which integrates the data into a

data mart speciﬁc to the domain context of the data.

4.3 Storage and Access

4.3.1 Industrial Big Data Storage

Big Data storage systems have to be able to scale up

with massive amounts of data and represent them ac-

cordingly. There are two major principles to access

Industrial Big Data: From Data to Information to Actions

143

data. These principles also affect the storage structure

as well as the complexity of the integration process:

schema-on-write and schema-on-read. As the name

indicates the distinction between these two concepts

lies in the way data schemas are generated respec-

tively accessed. While the schema-on-write pattern

requires a ﬁxed data schema in which the data is trans-

ferred, the schema-on-read methodology redeﬁnes a

new schema every time required information is read

with a speciﬁc purpose. The schema-on-read pattern

explicitly shifts the homogenization of multiple dif-

ferent data points, with a different schema, to step of

information retrieval for analytics.

One example for the schema-on-write principal is

a traditional (enterprise) Data Warehouse, which uses

a star-schema to represent fact and dimension infor-

mation as their respective tables. In general, it is also

a RDBMS applied with special notation and mecha-

nism for larger data sets, such as hot and cold storage.

These different storages inside the warehouse are rep-

resented by data marts.

The schema-on-read principle is commonly used

in the Data Lake (Ignacio et al., 2015) architecture.

The idea is to store data as-is and thereby with the

schema, it was generated in. The user that accesses

the data is required to deﬁne a schema ”on-read” that

ﬁts to all of the desired data on reading. Apache

Hadoop (Shvachko et al., 2010) is an open-source dis-

tributed ﬁle storage system on commodity hardware

that enables this behavior for big data. Its advantage

is the utilization of existing hardware not only storage

wise, but with the added Map-Reduce principle also

for the computing resources.

4.3.2 Exploration

Getting around the vast amount of data points and

ﬁnding exactly the relevant pieces is one major goal

of data exploration. Naive approach would demand

to look at each individual point, but indexes allow di-

rectly accessing a speciﬁc element, based on prop-

erties for which the index has to be created ﬁrst.

However, typical database indexes are only possible

in a ﬁxed data schema, such as it would exist in a

Data warehouse system. In the ﬂexible and dynamic

changing data lake environment, this is not possible in

the classical way, as a schema could change according

to each data point, not yielding a common ground. In

this case, a reading schema that ﬁts all data points has

to be determined in advance and applied to an auto-

matic query mechanism that abstracts this application.

This can be done automatically with tools like Spark,

more particular SparkQL, or with Apache Hive. Both

enable the use of a familiar Querying Language such

as SQL to work with the data in a set schema.

4.4 Industrial Big Data Processing and

Analysis

4.4.1 Visual Analytics

One of the most powerful and widely used methods

for gaining knowledge from data is visual analysis.

By plotting data, it is often already possible to iden-

tify anomalies in data sets. Since production data is

often generated by cyclically operating systems, the

data values should generally be stable along the pro-

duction of identical products. Accordingly, anomalies

often indicate quality or machine condition problems.

In addition to pure visual analytics, thresholds can be

used to automate anomaly detection.

4.4.2 Statistical Analysis

Statistical analysis of manufacturing data (the ap-

plication of statistical theory) has been part of the

standard repertoire in manufacturing since the ad-

vent of Six Sigma and Lean Manufacturing. These

techniques allow to process and analyze larger-scaled

data, as it is possible with visual analytics. Usu-

ally statistical methods condense datasets to key ﬁg-

ures such as process stability (CPK) or equipment

utilization (OEE), but they are also applied to struc-

ture and reduce complexity in the event of prob-

lems/optimization. Percentage, box-plot or distribu-

tion (average, median and mode) belong to this group.

4.4.3 Machine Learning

Machine learning (ML) is the next stage of data

analysis. ML allows to discover hidden patterns in

huge amounts of data independently by an algorithms.

Three main categories of ML are distinguished:

Supervised Learning (SL) targets an approxima-

tion of mapping function to predict outputs based on

input variables. Techniques are regression, decision

trees or artiﬁcial neural networks. As in manufactur-

ing use cases output and input variables are known,

this category of algorithms appears to be applicable.

Unsupervised Learning (UL) does not require out-

put variables. Therefore, unsupervised learning does

not target to predict certain behaviors but to detect

hidden patterns/clusters in unlabeled data sets. Princi-

pal component analysis (PCA) and Clustering are the

most common algorithm categories of this group.

Reinforcement Learning (RL) algorithms are the

most recent members of machine learning procedures.

They target to automatically determine ideal behav-

iors in a speciﬁc situation and context. These algo-

rithms gain knowledge about the subject behavior by

doing test and memorizing the results of experiments.

IoTBDS 2019 - 4th International Conference on Internet of Things, Big Data and Security

144

4.5 Information Presentation and

Interfaces

Human information interfaces have been designed in

various ways and combinations of techniques. In

the context of manufacturing, the major techniques

to provide data to an information consumer is visual-

ization (e.g. monitoring and reporting). Information

monitoring stands for the observation of predeﬁned

metrics/ﬁgures (partially deduce from raw data by

analytics) and can be seen in manufacturing control

rooms and/or by making use of Supervisory Control

and Data Acquisition (SCADA) systems. Information

reporting provides information in a higher condensed

form than high-granular information monitoring. KPI

reporting for example plays a major role for the man-

agement sector of an organization. Usually, reporting

is used to manage processes and to report information

for operational control. For both techniques, several

key aspects need to be considered: self-service/ad-

hoc or pre-deﬁnition, frequently updated information,

point of time or temporarily summarized information

and distribution online or via intranet.

5 CONCLUSION

In this paper, we presented an overview of the chal-

lenges and issues that emerge when dealing with In-

dustrial Big Data that arises from the Industrial In-

ternet of Things. We examined different solutions

and reference models and summarized the common

ground as well as vital approaches of the stages to-

wards the Industrial data ingestion goal. The major

key steps of a data pipeline have not changed from

the classical Big Data approach, but new Industrial

speciﬁc challenges occur. There is yet still no com-

plete universal architecture for the Industrial Internet

of Things, which is capable of fully addressing all is-

sues that arise when digitizing a factory and aim to

realize a full digital transformation. The architecture

references only the general concept of how to deal

with data on a high level, but lack important steps,

such as meta-information about the data itself or on

the other hand, the proposed solution is ﬁtted tightly

for a speciﬁc use case, which cannot be easily adopted

in a general manner. We see requirements for fur-

ther research in the area of data governance, espe-

cially in the parts of data lineage and data protection.

In particular, laws such as the General Data Protec-

tion Regulation (GDPR) of the European Commis-

sion are taken into account in industrial applications

when dealing with customer-speciﬁc products. Con-

cerning machine maintenance and the idea of predic-

tive maintenance, the question of data value is impor-

tant and needs to be strongly addressed by the man-

ufacturing sector in the close future. Another impor-

tant topic addresses business models that might arise

based on discussions around data ownership. An in-

teresting question might be, if a machine provider au-

tomatically acquires machine data of a customer free

of charge in exchange for successful predictions for

e.g. predictive maintenance.

ACKNOWLEDGEMENTS

The authors would like to thank the German Re-

search Foundation DFG for the kind support within

the Cluster of Excellence Internet of Production (IoP).

Project-ID: 390621612

REFERENCES

Bertino, E., Bernstein, P., Agrawal, D., Davidson, S., Dayal,

U., Franklin, M., Gehrke, J., Haas, L., Halevy, A.,

Han, J., and Jadadish, H. (2011). Challenges and op-

portunities with big data. Whitepaper.

Brecher, C. and

Ozdemir, D., editors (2017). Integra-

tive Production Technology: Theory and Applications.

Springer International Publishing.

CEN-CENELEC-ETSI Smart Grid Coordination Group

(2012). Smart grid reference architecture. Whitepa-

per.

Chen, C. P. and Zhang, C.-Y. (2014). Data-intensive appli-

cations, challenges, techniques and technologies: A

survey on big data. Information Sciences, 275:314–

347.

Chen, M., Mao, S., Zhang, Y., and Leung, V. C. (2014).

Big data: related technologies, challenges and future

prospects.

Demchenko, Y., Ngo, C., and Membrey, P. (2013). Archi-

tecture framework and components for the big data

ecosystem. Journal of System and Network Engineer-

ing, pages 1–31.

Ebbert, J. (2018). Deﬁne it – what is big data?

General electric (2012). The rise of industrial big data.

Whitepaper.

Gluchowski, P., Gabriel, R., and Dittmar, C. (2007). Man-

agement Support Systeme und Business Intelligence:

Computergestuetzte Informationssysteme fuer Fach-

und Fuehrungskraefte. Springer-Verlag.

Hai, R., Geisler, S., and Quix, C. (2016). Constance:

An intelligent data lake system. In Proceedings of

the 2016 International Conference on Management of

Data, pages 2097–2100. ACM.

Hammer, M., Hippe, M., Schmitz, C., Sellschopp, R., and

Somers, K. (2016). The dirty little secret about digi-

tally transforming operations.

Industrial Big Data: From Data to Information to Actions

145

Ignacio, T., Peter, S., Mary, R., and John E., C. (2015). Data

wrangling: The challenging journey from the wild to

the lake. CIDR.

Industrial Internet Consortium (2017). The industrial in-

ternet of things volum t3: Analytics framework.

Whitepaper.

Intelligence, M. (2009). Business intelligence in manufac-

turing. Whitepaper.

Jeschke, S., Brecher, C., Meisen, T.,

Ozdemir, D., and Es-

chert, T. (2017). Industrial internet of things and cy-

ber manufacturing systems. In Industrial Internet of

Things, pages 3–19. Springer.

Kirmse, A., Kraus, V., Hoffmann, M., and Meisen, T.

(2018). An Architecture for Efﬁcient Integration and

Harmonization of Heterogeneous, Distributed Data

Sources Enabling Big Data Analytics. In Proceed-

ings of the 20th International Conference on Enter-

prise Information Systems : March 21-24, 2018, in

Funchal, Madeira, Portugal. - Volume 1, pages 175–

182. 20th International Conference on Enterprise In-

formation Systems, Funchal (Portugal), 21 Mar 2018

- 24 Mar 2018, SCITEPRESS - Science and Technol-

ogy Publications.

Kuschicke, F., Thiele, T., Meisen, T., and Jeschke, S.

(2017). A data-based method for industrial big data

project prioritization. In Proceedings of the Interna-

tional Conference on Big Data and Internet of Thing,

pages 6–10. ACM.

Laney, D. (2011). 3d data management: Controlling data

volume, velocity, and variety, application delivery

strategies. Whitepaper.

Lee, J., Ardakani, H. D., Yang, S., and Bagheri, B. (2015).

Industrial big data analytics and cyber-physical sys-

tems for future maintenance & service innovation.

Procedia CIRP, 38:3–7.

McKinsey (2017). Is manufacturing ’cool’ again. Whitepa-

per.

Moghaddam, M., Cadavid, M. N., Kenley, C. R., and Desh-

mukh, A. V. (2018). Reference architectures for smart

manufacturing: A critical review. Journal of Manu-

facturing Systems, 49:215 – 225.

Mourtzis, D., Vlachou, E., and Milas, N. (2016). Industrial

big data as a result of iot adoption in manufacturing.

Procedia CIRP, 55:290–295.

O’Donovan, P., Leahy, K., Bruton, K., and O’Sullivan,

D. (2015). An industrial big data pipeline for data-

driven analytics maintenance applications in large-

scale smart manufacturing facilities. Journal of Big

Data, 2(1):25.

akk

onen, P. and Pakkala, D. (2015). Reference architec-

ture and classiﬁcation of technologies, products and

services for big data systems. Big Data Research,

2(4):166–186.

PurdueModelCisco (2018). Purdue Model/CPwE Logical

Framework.

Rinaldi, J. (2013). OPC UA - the basics: An OPC UA

overview for those who are not networking gurus.

Amazon, Great Britain.

Sadiku, M. N., Wang, Y., Cui, S., and Musa, S. M. (2017).

Industrial internet of things. IJASRE, 3.

Shvachko, K., Kuang, H., Radia, S., and Chansler, R.

(2010). The hadoop distributed ﬁle system. In 2010

IEEE 26th Symposium on Mass Storage Systems and

Technologies (MSST), pages 1–10.

Society, F. (2017). Reference architecture model for the

industrial data space. Whitepaper.

US Dept. of Commerce blog (2014). The internet’s next big

idea: Connecting people information and things.

VDI/VDE-Gesellschaft (2013). Cyber-physical systems:

Chancen und nutzen. Whitepaper.

VDI/VDE Society Measurement and Automatic Control

(GMA) (2015). Reference architecture model indus-

trie 4.0 (RAMI4.0). Whitepaper.

Wilder-James, E. (2012). What is big data? an introduction

to the big data landscape.

Williams, T. J. (1994). The purdue enterprise reference ar-

chitecture. Computers in industry, 24(2-3):141–158.

Yin, S. and Kaynak, O. (2015). Big data for modern indus-

try: challenges and trends [point of view]. Proceed-

ings of the IEEE, 103(2):143–146.

IoTBDS 2019 - 4th International Conference on Internet of Things, Big Data and Security

146