MEDIS: Analysis Methodology for Data with Multiple Complexities

Raluca Portase

, Ramona Tolas

and Rodica Potolea

Technical University of Cluj-Napoca, Cluj, Romania

Keywords:

Data Analysis, Big Data, Preprocessing Methodology, Knowledge Extraction, Real Industrial Data, Metadata

Extraction.

Abstract:

Hidden and unexpected value can be found in the vast amounts of data generated by IoT devices and indus-

trial sensors. Extracting this knowledge can help on more complex tasks such as predictive maintenance or

remaining useful time prediction. Manually inspecting the data is a slow, expensive, and highly subjective

task that made automated solutions very popular. However, ﬁnding the value inside Big Data is a difﬁcult task

with many complexities. We present a general preprocessing methodology (MEDIS- MEthdology for prepro-

cessing Data with multiple complexitIeS) consisting of a set of techniques and approaches which address such

complexities.

1 INTRODUCTION

The industry is now approaching the concept of In-

dustry 4.0, considered the fourth industrial revolution

(Lu, 2017). The concept refers to the intelligent net-

working of machines and processes in the industry

with the aid of information and communication tech-

nology, and it is closely related to IoT (Internet of

Things).

In order to integrate various systems to be part of

the big picture of Industry 4.0, companies equipped

their modern devices with various sensors which are

monitoring their behavior (Mourtzis et al., 2016) to

achieve a high production rate with as small as pos-

sible operational costs (Khan et al., 2020). Data col-

lected by these sensors have intrinsic hidden value,

and solutions for exploring and exploiting it need to

be developed to further expand it in the context of

IoT and predictive maintenance. Once integrated into

a predictive maintenance ﬂow, beneﬁts like avoid-

ing unnecessary equipment replacement, just in time

maintenance, saving data storage costs, improvement

of process safety, and increased efﬁciency can be

tackled (Hashemian, 2010), (Edwards et al., 1998).

Another signiﬁcant reason for processing the sig-

nals of the appliance sensors is to detect anomalies in

their communication ﬂow and predict the damage of

the sensors themselves. Their failure may result in se-

vere damage to critical processes (Wang et al., 2001).

Visual inspection is the base method and the old-

https://orcid.org/0000-0002-8985-4728

https://orcid.org/0000-0002-6236-1114

https://orcid.org/0000-0002-7051-3691

est task of predicting something might break in the fu-

ture (Hashemian, 2010). Unfortunately, it cannot be

applied given various data-related complexities (the

difﬁculties caused by the fact that data is unstructured

and the volume of the data is not suitable for manual

inspection). Common challenges are the data char-

acteristics, together with heterogeneity and lack of

structure and quality (noisy or incompleteness). Other

complexities (such as data dimensionality, scalabil-

ity, requirement for real-time/due-time processing, vi-

sualization and interpretability, global vs local opti-

mal feature extraction) were also identiﬁed in similar

works (Bol

on-Canedo et al., 2015), (Storcheus et al.,

2015), (Cirillo and Valencia, 2019).

The purpose of this paper is to propose a general

preprocessing methodology (MEDIS) for Big Data

analysis. We propose several strategies to handle real

data complexities usually identiﬁed in unstructured

data. Despite the dynamic nature of the ﬁeld, this

topic was poorly explored. To the best of our knowl-

edge, the studies in this domain are very general,

following the entire process of extracting knowledge

from Big Data and not tackling speciﬁc challenges of

the preprocessing phase. Compared to previous work,

our methodology has a smaller granularity in which

we propose an analysis of speciﬁcity and usage.

The rest of the paper is organized as follows: Sec-

tion 2 presents a brief overview of selected related

work on data analysis and concrete application ex-

amples. Section 3 motivates the need for a general

methodology for data preprocessing. In Section 4, the

decomposition of our proposed methodology is pre-

sented. Strategies for handling speciﬁc data complex-

Portase, R., Tolas, R. and Potolea, R.

MEDIS: Analysis Methodology for Data with Multiple Complexities.

DOI: 10.5220/0010655100003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 1: KDIR, pages 191-198

ISBN: 978-989-758-533-3; ISSN: 2184-3228

191

ities are illustrated in Section 5, while in Section 6 we

present a small case study on a snapshot of real in-

dustrial data. Finally, section 7 ends the paper with

conclusions.

2 RELATED WORK

Companies equipped their appliances with many sen-

sors that monitor the different properties of the appli-

ance. These sensors produce a considerable amount

of data that needs to be processed to beneﬁt from

the extraction of the meaningful information (Rat-

ner, 2017). Extracting this information signiﬁcantly

impacts next-level processes like predictive mainte-

nance, as it is used as input for complex systems. To

be compliant with the rest of the systems, the informa-

tion extracted from the recordings of the monitoring

sources (for example, sensors) must be reliable and

as informative as possible. Identifying any source of

failure as soon as possible can prevent the upper-level

system from giving incorrect predictions.

Even if the raw data is initially recorded for a spe-

ciﬁc purpose, there is more value hidden in large data,

and it can be used for reasons that go even further

from the one it was initially stored for. Therefore, we

propose building even more onto the existing data to

maximize the exploitation of its value to open new re-

search directions with direct applicability in various

economic ﬁelds, especially industry. This is done by

creating a methodology for ﬁltering, visualizing, and

correlating the signals of various sensors.

A concrete example of the impact of our work is in

the predictive and preventive maintenance area (Sipos

et al., 2014), (Pech et al., 2021). Predictive mainte-

nance is domain-speciﬁc, with limited (if at all) abil-

ity to apply beyond the speciﬁc domain or even for

different devices. In contrast to this, MEDIS could be

easily projected onto various ﬁelds. To the best of our

knowledge, the research uses data already processed

for predictive maintenance tasks. In real-world appli-

cations, the data from sensors is big unstructured data

that needs several processing steps before any predic-

tive analysis can be done. In this context, our work

can be a driving vehicle towards predictive main-

tenance by incorporating domain-independent auto-

mated preprocessing steps. This can help multiple

industries reduce both the maintenance costs and the

system’s reliability directly impacted by components

or complete system malfunction, or even downtime.

Processing the massive amount of recorded data is

both a big opportunity and a challenge for the indus-

try, and it attracted much attention from researchers

and companies around the world. However, at the mo-

ment, there is no standard way for approaching Big

Data projects for real industrial data. Several works

tackle this topic, works that, according to the authors

of (Hashemian, 2010), are divided into three cate-

gories: methods that use signals from existing process

sensors, solutions that use signals from test sensors,

and solutions based on injecting test signals into the

equipment to measure its response.

While the last category has the disadvantage of de-

veloping the solution using test data (it is not guaran-

teed to have the same behavior with real-world data),

the second solution is to use test sensors attached only

to a part of the appliances and using the collected real

data and scenarios further on. Having exposed the

identiﬁed drawbacks of the other categories, this work

belongs to the ﬁrst category, exploiting the value of

data collected from in-place running devices.

As emphasized by the authors of other works on

this topic (Smith et al., 2006), a general model for un-

derstanding and exploring the data would be a power-

ful tool for both teaching and learning due to its inno-

vative potential. Due to the massive growth of the data

mining area, some efforts have been made that seek

the establishment of standards such as KDD (Fayyad

et al., 1996), SEMMA (Miner, ), and CRISP-DM

(Wirth and Hipp, 2000). All of these methodologies

are composed of several iterative steps covering the

entire process, from domain understanding to mod-

eling, assessment, and deployment of the data. The

existing methodologies are generic and do not focus

on preprocessing data and handling speciﬁc big data

complexities that might arise in real-world datasets.

3 VARIETY OF DATA

COMPLEXITIES: THE NEED

OF A GENERAL

METHODOLOGY FOR

PREPROCESSING

Real big data is characterized by several complexi-

ties known in the literature by the Seven V’s of Big

Data: volume, visualization, velocity, variety, verac-

ity, value, and variability (Uddin et al., 2014).

In the context of big industrial data, the large vol-

ume is given by the large amount of recorded data

from all devices and different types of sensors. The

frequency of the received data is different depending

on the type of the recorded property. In the case of

compound records, all components must be received

to compute the compound value. The variety of val-

ues and the existence of simple and composite data

suggests a heterogeneous analysis approach. Some

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

192

special steps must be deﬁned and applied to such

cases.

The performance of the processing steps is inﬂu-

enced by the quality and quantity of the used data

set. Moreover, the performance is also affected by the

frequency of noisy values, incompletely deﬁned data,

incorrectly sent or duplicated records. The strategy

should soundly discriminate between noisy data (in-

correct registered or stored) and outliers in the clean-

ing process. Warnings of a malfunction should be

identiﬁed, preserved, and processed to prepare the

ﬁeld for a predictive and preventive maintenance ﬂow.

This paper is strongly focused on the value di-

mension of data: identiﬁcation of useful or potential

valuable information from the massive amount of an-

alyzed data. As far as we know, at the moment, there

is no standard way to approach Big Data projects from

industrial applications. For this reason, we created a

methodology to ﬁll this gap that can be used in future

projects.

Given the identiﬁed complexities, we developed

MEDIS, a general preprocessing methodology con-

taining strategies for jointly tackling each complexity

and aspect. Compared to the other existing method-

ologies (Fayyad et al., 1996), (Miner, ), (Wirth and

Hipp, 2000), our approach has a smaller granularity

and bigger speciﬁcity. In the next section, we present

our proposed MEDIS methodology alongside a de-

scription of its steps.

4 METHODOLOGY FOR DATA

ANALYSIS

We propose a methodology decomposition with a sys-

tematic strategy illustrated in Figure 1. Since the fo-

cus of our work is the preprocessing part, we are go-

ing to represent the source layer and processing layer

as black boxes.

4.1 Data Representation as Time Series

The data from industrial data sets is mostly a col-

lection of observations collected sequentially in time,

usually referred to as time series. These observations

can be collected at equally-spaced time points. In this

case, we use the notation f

, where we index the set

of observations by t, the time at which each observa-

tion was taken. We refer to these types of signals as

periodic signals, and we are abstracting them as can

be seen in equation 1.

, t = (...., −1, 0, 1, 2, ....) (1)

If the observations were not taken at equally

spaced points then we use the notation f

, with i =

(1, 2, ...). In this case (t

− t

i−1

) is not necessarily

equal to one. We refer to these types of signals as

non-periodic signals and their abstraction can be seen

in equation 2.

, i = (1, 2, ....) (2)

The information about the type of signals present

in the processed data set might not be available. In

this case, an algorithm for ﬁnding the data periodicity

should be applied to infer this type of metadata about

the processed dataset.

4.2 Business Understanding

Even though understanding the businesses and pro-

cess of data collection is speciﬁc for each dataset, we

identiﬁed two main steps that must be done: process

understanding and problem statement. Given the

speciﬁcity of this phase and the generality of our pro-

cessing methodology, it will remain a subject for fu-

ture work.

4.3 Descriptive and Exploratory

Analysis

4.3.1 Syntactic Data Exploration

A preliminary syntactic analysis should be made to

familiarize with the processed data set and understand

its representation. This step helps tackle the volume

complexity of the data because the output of the step

is the ﬁltered data set: all the records that are not valid

from a syntactical point of view are eliminated.

A syntactic analysis should include the following

steps:

1. Noise Analysis and Removal. For data recorded

from sensors, noise is expected from both trans-

mission and recording. In a high-dimensional

classiﬁcation problem, conventional classiﬁcation

rules using all features perform no better than

random guesses due to noise accumulation (Hall

et al., 2008). Thus, noise analysis and removal

are crucial steps in cleaning the data.

2. Removal of Duplicates. The entries representing

the same signal are sent simultaneously with the

same values.

There are several reasons for the appearance of

duplicate data. They might arise due to an error in

the recording system, transmission, or receiving.

After analyzing and understanding why this data

arises, cleaning duplicate data has to be done as

part of cleaning the initial dataset.

MEDIS: Analysis Methodology for Data with Multiple Complexities

193

Figure 1: General processing model for data analysis.

3. Eliminate Syntactically Incorrect Data.

For the statistical accuracy of any processing tech-

niques, dimension reduction and variable selection

play pivotal roles in analyzing high-dimensional data.

The work (Chira et al., 2020) presents a data set pro-

cessing technique where a signiﬁcant percentage from

the data was dropped using syntactic data processing.

Using the data representation introduced in sec-

tion 4.1 a syntactic data exploration on a signal f

will

eliminate those values of t for which f

has an invalid

value. The equation 3 presents the signal under pro-

cessing, f

before syntactic processing where we can

see that at timestamp t

we have an invalid value of the

signal marked with red color. After syntactic process-

ing the signal f

has fewer values because the value

transmitted at timestamp t

was eliminated, as it can

be seen in 4.

, t = (t

, t

, ...) (3)

↓

, t = (t

, t

, ...) (4)

4.3.2 Select Relevant Features Describing the

Signals

A set of features describes each signal from the data

set. A manual or automated analysis of these fea-

tures and their meaning should be made. The out-

put of the step is a data set reduced by horizontal di-

mension. This step is essential in tackling the volume

complexity because the resulting data set will occupy

less memory space and simultaneously less process-

ing time.

The value of the processed signal f

can be ex-

pressed as a simple value or a set of features. Most of

the time, when we process real industrial data, we ﬁnd

ourselves in the second situation. In that case, the pro-

cessed signals, f

, are composed of a set of features,

as can be seen in equation 5.

Some features might not be relevant in the context

of the current problem statement ( f eature

in the case

of the signal exposed in equation 5). Some of them

can also be redundant, bringing no new knowledge.

Those features should be eliminated, and a smaller set

will better and faster characterize the signal f

. Identi-

fying discriminative features is also crucial for reduc-

ing computational time. The equation 6 is presenting

the signal f

introduced in equation 5 after the step of

feature selection was applied.

= [ f eature

, f eature

, ... f eature

] (5)

↓

= [ f eature

, f eature

, ... f eature

] (6)

The topic of feature selection was in the attention

of the researchers for a long time, and various meth-

ods were identiﬁed. Among the most popular strate-

gies are ﬁlter, wrapper, and embedded methods. The

ﬁlter methods evaluate each feature individually by

using its general statistical properties. The proper-

ties can be correlation-based, similar to the work (Lei

Yu leiyu, 2003). However, methods from this cate-

gory have the disadvantage of not interacting with the

classiﬁer algorithm (V. Bol

on-Canedo, 2014).

The wrapper approaches use learning algorithms

to detect the most relevant set of features and are

considered to be generally computationally expensive

(G. Chandrashekar, 2014). Approaches from these

categories include forward selection (H. Liu, 2012),

backward elimination (Shilaskar and Ghatol, 2013),

and recursive feature elimination (Ke Yana, 2015).

A review and comparison of the most recent hy-

brid approach in a microarray data analysis context is

presented in literature (NADA ALMUGREN, 2019).

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

194

Embedded methods are a type of combination of ﬁl-

ter and wrapper methods (Lu, 2019), (Haoyue Liu,

2019).

4.3.3 Semantic Data Exploration and Filtering

The output of the current step is a data set that is se-

mantically ﬁltered. The operations proposed for this

step are the following ones:

1. Transform the data, if needed, to represent it in the

appropriate processing format. The data might not

be with a ready-to-process structure. In this case,

the initial structure of the data should be used to

obtain a form of data easy to process. Several op-

eration types like composing or splitting might be

needed in this respect.

2. Use the new structure of the data to ﬁlter the data

further. Apply the speciﬁc processes to ﬁlter out

invalid semantic entities.

3. Differentiate the anomalies from the outliers.

Outlier is a legitimate data point that is far away

from the mean or median in a distribution. An

anomaly is an illegitimate data point generated by

a different process than the process that generated

the rest of the data. There are situations where it

is challenging to distinguish between outliers and

anomalies, and the literature exposes multiple ap-

proaches to identifying anomalies in data. The

authors present in (Nedelcu et al., 2017) a ma-

chine learning approach for detecting anomalies

from an EEG data set.

4. After the anomalies are detected in the step above,

they should be excluded from the dataset since

they are noise from the perspective of future pro-

cessing.

4.3.4 Signal Correlation: Eliminate Redundant

Signals

There might be different signals in a large data set

that actually express the same information about the

state of the observed process. Retaining both (or mul-

tiple) such signals is overhead from a memory point

of view. Detecting signals that hold the same infor-

mation can be done by correlating the signals from

the data set. Once correlated, we can identify trans-

mission anomalies, such as situations when signals

are correlated, but the correlation is not present for

a short time. That could be a transmission error. Usu-

ally, data is not aligned, making a direct correlation

not applicable. This issue can be tackled by interpo-

lation, followed by resampling.

Interpolation is a method for smoothing data val-

ues by ﬁlling gaps of missing data with neighbors’

(averaged) values. In this case, interpolation is used

to generate values within time intervals in which no

signals are sent. Depending on the type of signal and

its behavior, multiple methods of interpolating signals

can be used, including linear, polynomial methods or

methods related to previous or future values (Powell

et al., 1981).

Resampling is extracting from the function gen-

erated at the interpolation step values at equal rates.

We obtain a new signal with the same trend as the

original signal. The equal sample rate makes it possi-

ble to correlate with another type of signal for which

we applied the same procedure. A common time

frame must be found for all signals to take all received

records into account.

4.4 Advanced Data Engineering

4.4.1 Enhance with Metadata from Particular

Features of the Dataset

Visual inspection is a difﬁcult task given the difﬁcul-

ties caused by the fact that data is unstructured, and

the volume of the data is not suitable for manual in-

spection. Independently of how complex and large

the data is, an initial visual inspection is valuable as it

allows the identiﬁcation of common patterns in data.

This is the reason why the task of visual inspection is

applicable even in the actual context of the data.

Particular metadata can be inferred and further uti-

lized in the process phase by using various visualiza-

tion techniques. For example, if the data set is the ob-

servation made on a home appliance with running cy-

cles (oven, washing machine), detecting those cycles

is an appropriate step in preprocessing the data be-

cause some aspects might be relevant only inside the

data running cycles. A general approach of prepro-

cessing operations for exploring and extracting valu-

able knowledge from a large set of industrial data

characterized by a multitude of complexities is pre-

sented in work (Olariu et al., 2020) that exempliﬁes

how the identiﬁcation of cycles and applying the cy-

cle presence metadata can improve the results of the

correlation of the signals process.

This processing step is formalized by equations 7

and 8. Using the representation of the processed sig-

nal deﬁned in 7, the metadata enhancement effect on

the same signal can be observed in equation 8 where

a new feature is added for the processed signal.

= [ f eature

, f eature

, ... f eature

] (7)

↓

= [ f eature

, f eature

, ... f eature

, f eature

n+1

] (8)

The number of added features can vary, and it is

particular to the characteristics of the data.

MEDIS: Analysis Methodology for Data with Multiple Complexities

195

4.4.2 Data Periodicity Identiﬁcation

This data feature is essential to be identiﬁed because

knowledge can be inferred from it. Suppose a signal is

transmitted periodically, and at a particular time, the

signal is not at the expected period in the data set. In

that case, we can deduce that the sensor or data trans-

mission systems have a fault. An algorithm for data

periodicity identiﬁcation (Puech et al., 2019), (Elfeky

et al., 2005) should be applied if there is no available

information about the signal’s periodicity. Identifying

data periodicity is also vital in ﬁnding missing values

and duplicate transmissions.

5 CHOOSING THE STRATEGY

ACCORDING TO THE

IDENTIFIED COMPLEXITIES

A multitude of complexities can characterize the pro-

cessed data. Depending on the data’s identiﬁed com-

plexities, we should apply speciﬁc steps from the de-

ﬁned methodology. The data complexities that we

identiﬁed in our experiments and other related works

in the ﬁeld (Kaisler et al., 2013),(Katal et al., 2013),

(Uddin et al., 2014) are presented below.

Large Volume of Data: Big data is by deﬁnition an

amount of data just beyond technology’s capability to

store, manage and process efﬁciently.

Unstructured Data: Data can be recorded by sev-

eral sensors or logs and most of the times it is not

structured. This may cause signiﬁcant challenges that

can lead to analytic sprawl.

Heterogeneity: Data tends to be different from all

perspectives: data types, value, recording method.

Heterogeneity is one of the biggest challenges in

modern data analysis caused by the effects of different

factors introduced during data collection procedures.

Velocity: Data velocity refers to the speed of data

creation, streaming, and aggregation. This character-

istic is not limited to the speed of incoming data and

the speed at which the data ﬂows, making traditional

systems incapable of performing the analysis on data

constantly in motion.

Variety: Data variety is a measure of the richness

of the data representation. Due to non-aligned data

structures and inconsistent data semantics, it is prob-

ably the biggest obstacle from an analytic perspective.

Value: Since the volume of big data constantly in-

creases implicitly, the value of different data records

decreases in proportion due to several factors such as

age, type, quantity, and richness of information.

Variability: Variability considers the inconsisten-

cies of the data ﬂow. Data loads are challenging

to maintain due to peaks in data loads when certain

events occur.

Complexity: Complexity measures the degree of

interconnectedness and interdependence in big data

structures in a manner that a small change in a few

elements can create a ripple across the system and af-

fect its behavior.

We propose a strategy for identifying and handling

each complexity in the processed data set for the iden-

tiﬁed data complexities. The correspondence between

complexities and strategies is made at the end of this

chapter.

The business understanding steps from the

methodology deﬁned in section 4.2 should always be

applied before doing complex preprocessing. Pro-

cessing steps deﬁned in section 4.3 and 4.4 are cre-

ated in order to handle real complexities existent in

data. Given identiﬁed complexities in data, a process-

ing strategy to bypass those complexities should be

applied.

Table 1: Identiﬁers of the MEDIS steps.

Step Identiﬁer

Syntactic data exploration - SyExpl.NAR

Noise analysis and removal

Syntactic data exploration - SyExpl.DR

Duplicates removal

Syntactic data exploration - SyExpl.SIV

Eliminate incorrect values

Feature selection FeatSel

Semantic data exploration SemExpl

Signal correlation SignCorr

Metadata enhancing MetaEnh

Periodicity identiﬁcation PeriodIden

Missing values MissingVals

We are using an identiﬁer for better visualization

and understanding of each deﬁned step. The iden-

tiﬁers can be seen in Table 1. Table 2 contains the

complexities presented above, and it maps each com-

plexity to the steps from MEDIS that are suitable for

that complexity. Given our mapping of solutions for

each possible complexity, an automated process can

be done for preprocessing and data analysis.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

196

Table 2: Mapping MEDIS steps to data complexities.

Complexity Compatible steps

Large volume of data SyExpl

FeatSel

SemExpl

SignCorr

Unstructered data SemExpl

MissingVals

MetaEnh

Heterogeneity MetaEnh

SemExpl

PeriodIden

Velocity MissingVals

PeriodIden

Variety MetaEnh

SemExpl

Value SemExpl

MissingVals

Variability FeatSel

MissingVals

SyExpl.SIV

Complexity SignCorr

MetaEnh

6 CASE STUDY

For validating the methodology, we used samples of

real industrial data. The data represents the recorded

logs from three different types of modern smart home

appliances and are described in 3. When describing

the data we replaced the real timestamps with a times-

tamp identiﬁer for conﬁdentiality reasons.

Table 3: Datasets used for methodology validation.

Id Start time End time

App1 year1-month1-day22 year1-month1-day30

App2 year1-month1-day1 year1-month1-day11

App3 year1-month2-day18 year1-month3-day16

When we applied the step SyExpl.DR from the

methodology we eliminated in average 27.75% of the

data because we found them as being duplicates.

When we applied FeatSel step we reduced the di-

mensionality of the data by selecting 11 features from

a total of 27. By applying step PeriodIden we found

4 sensors that transmit data periodically from 35 sen-

sors for one of the appliances studied. Step Missing-

Vals determined a decrease in the size of the dataset

by 2% for App1. On the same appliance, we reduced

the dataset by 2.44% by using SyExpl.SIV step from

the methodology.

The concrete algorithms for periodicity detection,

duplicate signals and missing values identiﬁcation

used in this experiments are presented in (Tolas et al.,

2021).

7 CONCLUSIONS

Big data is all around us and comes from multiple

sources. Before applying complex processing steps

to data, it needs to be understood and preprocessed.

In this paper, we presented a methodology (MEDIS)

for preprocessing big data, which consists of several

approaches that can be used in order to handle real

data complexities. The general methodology is orga-

nized in three main steps: business understanding and

data representation, descriptive and exploratory anal-

ysis, and data engineering in which we both ﬁlter the

existing data and enhance it by adding metadata in-

formation.

Current research in this area is concentrated either

on processing structured data such as logs from sys-

tems or on knowledge extraction. Our work’s inno-

vation consists of a new uniﬁed approach to process

hybrid data. We are doing this by considering at the

same time processing of Big Data and knowledge ex-

traction from unstructured data, thus combining even

more complexity dimensions. Moreover, our method-

ology creates the base for future automation of data

analysis and preprocessing.

REFERENCES

Bol

on-Canedo, V., S

anchez-Maro

no, N., and Alonso-

Betanzos, A. (2015). Recent advances and emerging

challenges of feature selection in the context of big

data. Knowledge-Based Systems, 86:33–45.

Chira, C., Portase, R., Tolas, R., Lemnare, C., and Potolea,

R. (2020). A system for managing and processing in-

dustrial sensor data: Sms. In 2020 16th IEEE Interna-

tional Conference on Intelligent Computer Communi-

cation and Processing (ICCP). IEEE.

Cirillo, D. and Valencia, A. (2019). Big data analytics for

personalized medicine. Current opinion in biotech-

nology, 58:161–167.

Edwards, D. J., Holt, G. D., and Harris, F. (1998). Pre-

dictive maintenance techniques and their relevance to

construction plant. Journal of Quality in Maintenance

Engineering.

Elfeky, M. G., Aref, W. G., and Elmagarmid, A. K. (2005).

Periodicity detection in time series databases. IEEE

Transactions on Knowledge and Data Engineering,

17(7):875–887.

MEDIS: Analysis Methodology for Data with Multiple Complexities

197

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996).

The kdd process for extracting useful knowledge from

volumes of data. Comunications of the ACM.

G. Chandrashekar, F. S. (2014). A survey on feature selec-

tion methods. Computers & Electrical Engineering,

41(1):16–28.

H. Liu, H. M. (2012). Feature selection for knowledge dis-

covery and data mining. Springer Science & Business

Media, 454.

Hall, P., Pittelkow, Y., and Ghosh, M. (2008). Theoreti-

cal measures of relative performance of classiﬁers for

high dimensional data with small sample sizes. Jour-

nal of the Royal Statistical Society: Series B (Statisti-

cal Methodology), 70(1):159–173.

Haoyue Liu, Meng Chu Zhou, Q. L. (2019). An embedded

feature selection method for imbalanced data classiﬁ-

cation. IEEE, JOURNAL OF AUTOMATICA SINICA,

Hashemian, H. M. (2010). State-of-the-art predictive main-

tenance techniques. IEEE Transactions on Instrumen-

tation and measurement, 60(1):226–236.

Kaisler, S., Armour, F., Espinosa, J. A., and Money, W.

(2013). Big data: Issues and challenges moving for-

ward. In 2013 46th Hawaii International Conference

on System Sciences, pages 995–1004. IEEE.

Katal, A., Wazid, M., and Goudar, R. H. (2013). Big data:

issues, challenges, tools and good practices. In 2013

Sixth international conference on contemporary com-

puting (IC3), pages 404–409. IEEE.

Ke Yana, D. Z. (2015). Feature selection and analysis on

correlated gas sensor data with recursive feature elim-

ination. Elsevier, 212.

Khan, W. Z., Rehman, M., Zangoti, H. M., Afzal, M. K.,

Armi, N., and Salah, K. (2020). Industrial internet of

things: Recent advances, enabling technologies and

open challenges. Computers & Electrical Engineer-

ing, 81:106522.

Lei Yu leiyu, H. L. (2003). Feature selection for high-

dimensional data: A fast correlation-based ﬁlter solu-

tion. Proc. Int. Conf. Machine Learning, 3:856–863.

Lu, M. (2019). Embedded feature selection accounting for

unknown data heterogeneity. Elsevier, Expert Systems

with Applications, 119.

Lu, Y. (2017). Industry 4.0: A survey on technologies, ap-

plications and open research issues. Journal of indus-

trial information integration, 6:1–10.

Miner, S. E. Seema methdology, sas institute.

Mourtzis, D., Vlachou, E., and Milas, N. (2016). Industrial

big data as a result of iot adoption in manufacturing.

Procedia cirp, 55:290–295.

NADA ALMUGREN, H. A. (2019). A survey on hy-

brid feature selection methods in microarray gene ex-

pression data for cancer classiﬁcation. IEEE Access,

7:78533–78548.

Nedelcu, E., Portase, R., Tolas, R., Muresan, R., Dinsore-

anu, M., and Potolea, R. (2017). Artifact detection in

eeg using machine learning. In 2017 13th IEEE Inter-

national Conference on Intelligent Computer Commu-

nication and Processing (ICCP), pages 77–83. IEEE.

Olariu, E., Tolas, R., Portase, R., Dinsoreanu, M., and Po-

tolea, R. (2020). Modern approaches to preprocessing

industrial data. In 2020 16th IEEE International Con-

ference on Intelligent Computer Communication and

Processing (ICCP). IEEE.

Pech, M., Vrchota, J., and Bedn

r, J. (2021). Predictive

maintenance and intelligent sensors in smart factory.

Sensors, 21(4):1470.

Powell, M. J. D. et al. (1981). Approximation theory and

methods. Cambridge university press.

Puech, T., Boussard, M., D’Amato, A., and Millerand, G.

(2019). A fully automated periodicity detection in

time series. In International Workshop on Advanced

Analysis and Learning on Temporal Data, pages 43–

54. Springer.

Ratner, B. (2017). Statistical and Machine-Learning Data

Mining:: Techniques for Better Predictive Modeling

and Analysis of Big Data. CRC Press.

Shilaskar, S. and Ghatol, A. (2013). Feature selection for

medical diagnosis: Evaluation for cardiovascular dis-

eases. Expert Syst. with Applications, 40.

Sipos, R., Fradkin, D., Moerchen, F., and Wang, Z. (2014).

Log-based predictive maintenance. In Proceedings of

the 20th ACM SIGKDD international conference on

knowledge discovery and data mining, pages 1867–

1876.

Smith, P., Apple, D., Beyerlein, S., and Holmes, C. (2006).

Methodology for creating methodologies. Faculty

Guidebook. Lisle, Illinois: Paciﬁc Crest, pages 371–

Storcheus, D., Rostamizadeh, A., and Kumar, S. (2015). A

survey of modern questions and challenges in feature

extraction. In Feature Extraction: Modern Questions

and Challenges, pages 1–18.

Tolas, R., Portase, R., Iosif, A., and Potolea, R. (2021). Pe-

riodicity detection algorithm and applications on iot

data. In 20th International Symposium on Parallel and

Distributed Computing (ISPDC 2021).

Uddin, M. F., Gupta, N., et al. (2014). Seven v’s of big data

understanding big data to extract value. In Proceed-

ings of the 2014 zone 1 conference of the American

Society for Engineering Education, pages 1–5. IEEE.

V. Bol

on-Canedo, N. S

anchez-Marono, A. A.-B. J. M. B.

F. H. (2014). A review of microarray datasets and ap-

plied feature selection methods. Information Sciences,

282:1110135.

Wang, P., Vachtsevanos, G., et al. (2001). Fault prognostics

using dynamic wavelet neural networks. AI EDAM,

15(4):349–365.

Wirth, R. and Hipp, J. (2000). Crisp-dm: Towards a stan-

dard process model for data mining. Proceedings of

the 4th International Conference on the Practical Ap-

plications of Knowledge Discovery and Data Mining.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

198