A Real-time and Energy-aware Framework for Data Stream Processing

in the Internet of Things

Egberto A. R. de Oliveira

, Flavia C. Delicato

, Atslands R. da Rocha

and Marta Mattoso

PESC/COPPE, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil

Instituto de Computac¸

ao, Universidade Federal Fluminense, Niter

oi, RJ, Brazil

Universidade Federal do Cear

a, Fortaleza, CE, Brazil

Keywords:

IoT, Internet of Things, Data Streams, Data Stream Processing, Edge Computing, Adaptive Sampling.

Abstract:

The Internet of things (IoT) has transformed the internet, enabling the communication between every kind of

objects (things). The growing number of sensors and smart devices increased the possibilities of data gen-

eration and collection. This led to an explosion of data streams being produced which are challenging to be

processed in real-time. Regarding the nature of the data, the huge volume, heterogeneity, continuity, disor-

dering, noise and unpredictable rate are some challenging aspects to tackle. Regarding the data processing,

the core activities from the data acquisition to the production of high-level knowledge also pose challenges

related to limited computational and energy resources and high network latency. In this context, we propose

a framework to support activities of a data stream processing workﬂow for IoT. It aims allowing real-time

data processing with low power consumption. Edge computing is used to bring the data processing closer to

the data sources and allow actions to be triggered quickly. An adaptive sampling strategy combined with a

data prediction model are adopted to reduce the network trafﬁc, thus decreasing the power consumption of

the network devices. Experiments show that the proposed framework is able to achieve up to 60.58% average

energy consumption savings to sensor nodes and still meet a strict execution time threshold of 1s without

compromising the accuracy of the output data on different scales of input streams.

1 INTRODUCTION

The Internet of things (IoT) is transforming the in-

ternet, enabling the communication between every

kind of object (things) and creating a vision of “any-

time, anywhere, any media, anything” communica-

tions (Atzori et al., 2010). Initially, IoT was mainly

obtained by the use of RFIDs, nowadays such a con-

cept has evolved to a broader view that refers to the

interconnection of sensors, actuators, smart objects,

and wireless sensor networks (WSN)(Akyildiz et al.,

2002). The growing number of sensors and smart de-

vices led to an explosion of volume, variety and ve-

locity of generated data, empowering a new way of

value creation to people and corporations (Dias de

Assunc¸

ao et al., 2018). The processing of these “ﬁre-

hoses” of data from existing and emerging applica-

tions poses several challenges and brings novel re-

search opportunities.

The challenges involved in IoT data stream pro-

cessing may be analyzed at least from two dimen-

sions: (i) the data itself (generated by heterogeneous,

distributed and often constrained devices), and (ii) the

data processing, i.e. the core activities from the data

acquisition to the production of high level knowledge.

Regarding the nature of the data, IoT de-

vices/sensors generate, possibly in a continuous way,

a huge amount of data, typically consisting of time-

series values, which are sampled over a speciﬁc time

period, thus characterizing a data stream (Karkouch

et al., 2016). Often, there is no control over the or-

der or frequency of streamed data, which is transient

or non-persisted. The input rate of a data stream is

unpredictable and bursty in nature, ranging from a

few bytes to several gigabits per second. In addi-

tion, the data is highly heterogeneous, as it is gen-

erated by multiple types of devices, in different for-

mats and to feed a wide range of applications, also

heterogeneous. Besides the potentially massive vol-

ume of data, an IoT environment is also character-

ized by high dynamism and volatility. In many IoT

applications, such as trafﬁc accident monitoring or

river ﬂooding prediction, the potential value of data

depends on its timely processing, under strict time

R. de Oliveira, E., Delicato, F., R. da Rocha, A. and Mattoso, M.

A Real-time and Energy-aware Framework for Data Stream Processing in the Internet of Things.

DOI: 10.5220/0010370100170028

In Proceedings of the 6th International Conference on Internet of Things, Big Data and Security (IoTBDS 2021), pages 17-28

ISBN: 978-989-758-504-3

requirements. Otherwise, the processing results and

actions become less valuable or even worthless. Fi-

nally, quality-related data features also need to be

considered. According to Qin et al. (Qin et al.,

2016), data quality in IoT considers the following

features: (i) uncertainty, (ii) ambiguity and incon-

sistency, (iii) incompleteness, and most of them are

a direct consequence of the data being produced by

sensors. Sensors are fail-prone devices. Information

and decisions derived from raw data generated by sen-

sors will also be subject to failure (Klein and Lehner,

2010). Therefore, identifying errors/inconsistencies

on a sensor generated data stream is crucial to im-

prove the accuracy of the data being processed. These

errors/inconsistencies are called outliers, which are

readings considered outside the regular state of the

data being collected. Data points that differ signiﬁ-

cantly from others in a data set can represent either er-

rors or events of importance to the application (Kark-

ouch et al., 2016).

Regarding the data processing, the demand for

computational resources capable of processing large

volumes of data has historically been an obstacle for

creating high volume and/or high speed data process-

ing solutions (Dautov et al., 2018). Because of the

huge availability of resources offered by cloud com-

puting platforms, cloud-based approaches are widely

adopted in IoT systems. The data is pushed to the

cloud to be processed and the outcome is delivered

back to the local system. However, the internet back-

bone is not always able to meet the real-time require-

ments to transport a huge amount of data coming at

a high speed. This creates a communication bot-

tleneck that leads to proposing non-cloud based al-

ternatives, to handle and process IoT generated data

streams(Janjua et al., 2019).

A promising approach recently emerged is Edge

Computing (Dautov et al., 2018). It consists of bring-

ing the data processing activities physically closer to

the data sources. Edge computing is potentially use-

ful and has been adopted in several domains such

as smart buildings, healthcare, autonomous vehicles,

and environmental monitoring. In these applications,

data is processed by an edge device such as a smart

gateway, to extract meaningful information from it

and take necessary actions immediately (Janjua et al.,

2019), thus preserving the usability of time-sensitive

data. Therefore, with the support of edge comput-

ing, some of the activities of a data stream process-

ing workﬂow can be performed by devices at the edge

of the network. Other activities, more demanding in

terms of processing, may continue to be carried out in

the cloud. Still others can be performed on the sen-

sor device itself, that is, on the data sources since, al-

though restricted in terms of CPU and memory, such

sensors are capable of performing less complex pro-

cessing. In this context, another challenge arises re-

lated to data stream processing in IoT, which consists

of using the available resources in a rational way. In

addition to the restricted processing and communica-

tion capabilities, several sensor devices are powered

by non-rechargeable batteries. Keeping the sensors

working as long as possible is a major challenge in

all sensing-based systems, and it has been extensively

investigated in the WSN community.

In general, a basic strategy for preserving energy

in WSNs consists of: (i) keeping the nodes in hiber-

nation as long as possible and (ii) reducing the data

trafﬁc in the network as much as possible. One strat-

egy called Adaptive Sampling (Anastasi et al., 2009)

simultaneously tries to do both by varying the inter-

val between samplings according to the behaviour of

the sensed data. Adaptive sampling methods might

not be suitable to sample physical phenomena with

sudden variations, which is a common characteristic

of data streams. However, this problem can be min-

imized when Adaptive Sampling is combined with

a data prediction model to compute estimated future

sensor readings (Monteiro et al., 2017).

In this work, we propose a framework which sup-

ports activities of a data stream processing workﬂow.

The framework aims at addressing the challenges of

real-time, power consumption and data accuracy. We

adopt the edge computing paradigm to deal with the

network bandwidth vs. data production bottleneck,

allowing for applications with real-time requirements.

We use adaptive sampling to reduce the network traf-

ﬁc, and, as a consequence, the power consumption

of the sensor nodes. A data prediction model iden-

tiﬁes and removes outliers producing an accurate ag-

gregate output. The main contributions of the pro-

posed framework are:

• To provide an energy-aware data gathering com-

ponent with adaptive sampling to reduce the net-

work trafﬁc and, as a consequence, the power con-

sumption of the sensor nodes;

• To develop a data prediction model which takes

readings from multiple sensor nodes over a short

predeﬁned window as inputs, applies a density-

based clustering algorithm to identify and remove

outliers and produces an accurate aggregated out-

put.

The major beneﬁt expected by adopting the proposed

framework is being able to deploy long running real-

time processing systems on remote outdoor environ-

ments such as forests, open ﬁelds and watercrafts. In

such environments, there is no access to continuous

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

sources of electricity thus requiring the use of bat-

teries, solar panels or other types of limited power

sources.

The rest of this paper is organized as follows.

Section 2 presents relevant works which individually

tackle the issues we aim to solve together. Section 3

describes the proposed framework in details. Exper-

iments to evaluate the proposal is presented and dis-

cussed in Section 4. Finally, Section 5 concludes the

paper and provides additional information about the

ongoing and future work on this research.

2 RELATED WORK

Real-time/low latency response is an important and

very frequent requirement in IoT and Data Stream

Processing (DSP) applications (Li et al., 2015). A

recent survey on solutions for real-time processing

in big data streams (Mehmood and Anees, 2020)

points out there has been a growing number of pub-

lications on this topic during the last 10 years. The

study identiﬁes in-memory computing, support to

non-structured or semi-structured data, low latency

and the usage of machine learning algorithms, among

others, as key challenges on this ﬁeld. The authors

conclude that there is a lack of ﬂexibility in the avail-

able solutions, since they are too speciﬁc for the use

cases they were designed to tackle. In addition, the

survey states that cloud-based approaches still rep-

resent the majority of current solutions analyzed. It

is important to mention that the quality of the out-

put data is always a major concern while energy con-

sumption do not appear among the challenges or ob-

jectives of the evaluated solutions for real-time data

stream processing.

Another recent survey on IoT architecture chal-

lenges (Samizadeh Nikoui et al., 2021) highlights en-

ergy efﬁciency as one of the major concerns when

designing IoT systems. Time efﬁciency (real-time

or near real-time responses) is also mentioned as an

important and frequent requirement. Quality aware-

ness can also be inferred as a key aspect for IoT ar-

chitectures since data integrity and accuracy concerns

are present in most of the described approaches. The

study, in turn, does not explicitly correlate energy and

time efﬁciency on the analyzed solutions in any way.

In order to ﬁnd this correlation, we cross checked the

lists of publications addressing energy and time ef-

ﬁciency. From the 29 works listed on this survey,

only 4 aim at tackling both energy and time efﬁciency

together: the solutions proposed in (Xu and Helal,

2016) and (Catarinucci et al., 2015) are cloud-based

data stream processing solutions that rely on fully of-

ﬂoading the collected data to the cloud. As already

mentioned in section 1, this kind of approach leads to

a communication bottleneck which makes it impossi-

ble to meet strict low latency requirements. The ar-

chitecture described in (Loria et al., 2017) is an effec-

tive real-time stream processing solution which relies

on a robust infrastructure of servers on the edge. It

is not designed for constrained gateway devices and,

in fact, it does not address energy efﬁciency of gate-

ways or sensor nodes. OSCAR (Vu

cini

c et al., 2014)

is not related to data stream processing. It is an en-

ergy efﬁcient architecture for real-time communica-

tion (machine-to-machine and multicast) focusing on

security.

Despite relevant proposals addressing separately

the issues of real-time responses, data accuracy or

power consumption can be found in the ﬁeld of data

stream processing for IoT, to the best of our knowl-

edge, no solution tackling these three concerns to-

gether has been found so far. This makes it difﬁcult to

deploy solutions that can efﬁciently respond to real-

time events in power-constrained environments, such

as a forest ﬁre suppression system, a malfunction de-

tection system on small ships, etc.

IRESE (Janjua et al., 2019) presents an outlier (so

called ”rare-event”) detection system that applies un-

supervised machine learning techniques at the edge

to quickly identify events on audio data streams. De-

spite the signiﬁcant results achieved in terms of data

accuracy and real-time response, no concern regard-

ing energy consumption is mentioned. Therefore, this

solution might not be feasible on environments with

limited power sources, which is a major concern in

our work. In addition, IRESE was designed to han-

dle a very speciﬁc type of data: audio streams. Our

work is more agnostic and not restricted to a single

data type or use case.

Dual prediction techniques are presented in (Mon-

teiro et al., 2017), (Al-Hoqani and Yang, 2015) and

(Gupta et al., 2011). They are combinations of adap-

tive sampling with data prediction models based on

exponential time series. The core idea is based on a

set of lightweight calculations performed at the WSN

nodes. These computations allow the sensor nodes,

instead of delivering only a single sensor reading to

the sink, delivering a function that allows predicting

sensor readings in the time interval between the cur-

rent and the next reading. Data prediction models are

inserted in this context in order to avoid jeopardiz-

ing the quality of the data being generated, by the de-

crease in the sampling frequency. The main goal of

such a combined approach is ﬁnding a good balance

between the energy consumption and the quality of

the data being produced by a WSN. A common fea-

A Real-time and Energy-aware Framework for Data Stream Processing in the Internet of Things

ture observed on these works that employ dual predic-

tion schemes is the fact that their prediction models

are applied individually by each sensor. The frame-

work proposed in this paper differs from these works

since it considers the readings of a group of sensors.

It not only uses an aggregate function to predict data,

but it also identiﬁes and eliminates incorrect readings

from its computations.

The works presented in this section propose ef-

fective solutions to address requirements of real-time

responses, data accuracy or power consumption, but

none of them tackle all these three concerns together.

Combining these three requirements in the same so-

lution is complex because the approach used to solve

one problem can negatively impact the solution of an-

other. For example, statistical methods based on in-

tensive computing can efﬁciently solve the problem

of lack of data accuracy but they demand a high en-

ergy consumption from the devices. The contribution

of our work consists of combining approaches such as

those described, promoting the necessary adaptations

so that the three requirements are jointly met.

3 A FRAMEWORK FOR DATA

STREAM PROCESSING IN IoT

There are many tools and platforms for ingesting, pro-

cessing, storing, and managing data streams, making

it a difﬁcult task for professionals to select the right

combination to perform their analysis. The authors in

(Isah et al., 2019) identiﬁed the main components of a

modern data stream processing system (DSPS), which

can be integrated into a framework. A data stream

processing framework can be considered as a corner-

stone for guiding the building of DSPS, addressing all

the activities involved in the stream processing work-

ﬂow (Isah et al., 2019). In our work, we consider

the general model for a DSPS framework proposed

by (Isah et al., 2019). It includes (i) a data stream in-

gestion layer, responsible for accepting data into the

DSPS; (ii) a data stream processing layer, which pre-

processes and analyzes data in one or more steps; (iii)

a persistence layer that stores, indexes and manages

the data and the generated knowledge; (iv) a resource

management layer, which coordinates the functions

of provisioning resources for data processing and its

communication with external applications; and (v) an

output layer that directs the output data to services and

applications.

3.1 Framework Components

In this work, we consider that IoT systems are orga-

nized in a three-tier architecture as described in (Li

et al., 2017). In such architecture, the bottom is the

Things tier and it comprises physical sensors, WSNs

and embedded devices, responsible for collecting data

from the monitored environment and feeding them to

the IoT system. Next comes the edge/fog tier com-

posed of devices located physically close to the things

and responsible for less compute intensive tasks such

as preprocessing the incoming data. The upper tier

is the cloud, encompassing robust devices (data cen-

ters) capable of handling more compute intensive pro-

cessing tasks and/or permanently storing relevant data

(archiving).

Considering such an organization for IoT systems

and the model for DSPS described in (Isah et al.,

2019), in this work we propose a DSP framework (de-

picted in Fig. 1) to be deployed at the things and edge

tiers. Its goal is to provide real-time and energy-aware

data processing for IoT streamed data. In the ﬁgure,

each physical entity (PE) denotes a physical quan-

tity monitored by a group of physical sensing units

(sensors - S) at the things tier. The Orchestrator is

a software component placed at the edge tier. It in-

Figure 1: Schematic view of the proposed framework for real-time and energy-aware data stream processing.

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

stantiates logical representations of each physical en-

tity, called logical entities (LE), and also provides a

Representational State Transfer Application Program

Interface (REST API) (Richards, 2006) to allow com-

munication between LEs and applications. The sensor

nodes at the Things tier push readings to LEs via Con-

strained Application Protocol (CoAP) (Shelby et al.,

2014). Applications running at the cloud tier, or even

at the edge tier, consume data from LEs via Hyper-

text Transfer Protocol (HTTP) (Fielding and Reschke,

2014). The proposed framework supports both syn-

chronous queries and asynchronous patterns such as

publish / subscribe (Eugster et al., 2003).

Figures 2 and 3 illustrate a use case of a single

entity monitored by three physical sensors and serv-

ing three different applications. We use a simpli-

ﬁed scenario to allow the graphic representation of

the components and their interactions. The adoption

of decoupled components and the possibility of host-

ing them using lightweight virtualization techniques

(such as containers) make the proposed framework

potentially scalable in terms of physical entities, sen-

sors and applications. The software components that

encompass the proposed framework are described as

follows, along with their correspondence to the layers

for data stream processing systems proposed by (Isah

et al., 2019):

• Sampler: it is a CoAP client that runs on the

sensor nodes and it represents the data ingestion

layer. Its function is to sample the physical entity,

send the data to the correspondent LE and receive

a time interval as a response from the LE. This

time interval is used to put the sensor node in a

sleep state. When the sensor node goes back to

the active state, this process is repeated.

• Orchestrator: it is a service that runs on the edge

node to coordinate instantiating of LEs and the

communication between LEs, sensors and appli-

cations. It represents the resource management

layer.

• Logical Entity (LE): it consists of a set of decou-

pled modules (described below) that run on the

edge node and work together to provide an ab-

stract view of a monitored entity or phenomena:

– Gatherer: a CoAP server that listens to requests

on a speciﬁc UDP port. Its function is to re-

ceive data sampled from sensor nodes and re-

spond back with a time interval until the next

sampling. Section 3.2 describes how these time

intervals are calculated.

– Buffer: to meet strict real-time requirements for

processing requests within milliseconds, an in-

memory data store that keeps the data in the

random access memory (RAM) is necessary

(Zhang et al., 2015). This is used to persist sen-

sor readings for a short predeﬁned time. It rep-

resents the persistence layer.

– Predictor: this is a component responsible to

retrieve data from the buffer, identify and dis-

charge incorrect readings and output a calcu-

lated value based on a predeﬁned aggregation

function. It represents the DSP layer. Section

3.3 describes how these operations are carried

out.

• REST API: it is an HTTP endpoint (Richards,

2006) provided by the Orchestrator to standard-

ize the communication between LEs and applica-

tions. It represents the output layer.

Figure 2: Nodes and components involved on a hypothetical scenario of a single entity monitored with three physical sensors

and serving different applications.

A Real-time and Energy-aware Framework for Data Stream Processing in the Internet of Things

Figure 3: Interactions between software components.

3.2 Adaptive Sampling Strategy

An adaptive sampling strategy consists of dynam-

ically varying the time interval between samplings

somehow to follow the variability of the sampled

physical entity. Whenever there is little or no vari-

ation, the sample interval can be increased. When-

ever there is a signiﬁcant variation, the sample in-

terval must be reduced. The purpose of an adaptive

sampling method is to reduce as much as possible the

number of samples per unit of time, aiming at reduc-

ing the energy consumption of the sensor node. How-

ever, this strategy needs to be applied carefully so as

not compromising the accuracy of the sampled series

(Anastasi et al., 2009).

When combined with a data prediction model, it is

possible to calculate the difference between the sam-

pled and predicted values. Thus, a domain special-

ist can conﬁgure a tolerance threshold which shall

be used to drive the adaptive sampling strategy. A

so-called greedy adaptive sampling strategy increases

the sampling interval at each sampling until the pre-

diction exceeds this tolerance threshold. When the

threshold is exceeded, the sampling interval is re-

duced so that the predicted and sampled values are

closer again. Also, minimum and maximum thresh-

olds can be added to satisfy application requirements

(Monteiro et al., 2017).

The adaptive sampling strategy adopted on the

proposed framework is an adaptation made to the one

described in DPCAS (Monteiro et al., 2017), also pro-

posed by our research group, for data stream process-

ing in WSNs. DPCAS uses the TCP congestion con-

trol algorithm concepts to adjust the sampling inter-

val of each sensor node dynamically. The strategy is

based on the TCP CUBIC protocol (Ha et al., 2008),

where the size of the windows vary according to a cu-

bic function. The equations of the adaptive sampling

method are as follows:

= C(t − K)

max

(1)

K = (βW

max

/C)

1/3

,0 < β ≤ 1 (2)

Where W

represents the sample interval calculated at

the i-th sampling, which will be used as the sensor

hibernation time until the next sampling, C is a scale

factor known as a CUBIC parameter (typically 0.4),

t is the elapsed time since the last reduction of the

sample interval, W

max

is the sample interval immedi-

ately before the last reduction of the sample interval

and β a multiplicative reduction factor (typically 0.2).

The factor K, described in Equation 2, is updated only

when an event of reduction of the sample interval oc-

curs. An event of reduction of the sampling interval

occurs whenever the difference (δ) between the sam-

pled value (Y

) and its respective prediction (F

) ex-

ceeds the tolerance limit of the application (ε), that is,

whenever:

− F

| = δ > ε (3)

In addition, the application can also impose a mini-

mum (S

min

) and maximum (S

max

) limit for sampling

interval variation, that is:

min

≤ W

≤ S

max

(4)

In our proposal, we assume the presence of multiple

sensors sampling the same physical entity, so called

multi-sensed entity scenario. In the adaptive sampling

model proposed by DPCAS, each sensor node acts in

a completely autonomous way. It samples the phys-

ical entity and calculates the time window until the

next sampling based on its own samples. However, as

the cubic function used to compute the time intervals

is the same for all sensor nodes, the sensing rate is

very close or the same for all nodes. Therefore, there

is some synchronization between the sensor node ac-

tivities since the time windows increase and decrease

almost simultaneously. All the sensors would always

be sampling at the very same time. Similarly, all the

sensors would also be in a sleep state at the same

time. This aspect creates a gap that we call a ”blind

window”: if a sudden variation occurs when all sen-

sors are sleeping, this variation will only be identiﬁed

when the sensors wake up.

In such a multi-sensed entity scenario, all the sen-

sors monitoring the same entity must not be in a sleep

state simultaneously. This opens up an opportunity

that we explore by proposing a collaborative strategy:

to distribute different sampling intervals between sen-

sor nodes. It aims at desynchronizing sensor nodes

activities to reduce these blind windows and make the

adaptive sampling model more responsive to sudden

changes in the data stream.

To enable this collaborative and desynchronized

approach, the calculations of time intervals between

samplings must be carried out at the edge node and

not at the sensor nodes. The edge is the only node

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

that communicates with all sensor nodes and it is also

responsible for computing data predictions. Only the

edge node has all the information required to calcu-

late sampling intervals. In this way, the gatherer com-

ponent is responsible for these computations by per-

forming the following steps:

1. It receives a sensor reading from the sampler;

2. It stores the received value in the buffer;

3. It uses the cubic function, the last value computed

by the predictor and the last computed sampling

window to compute the next sampling window;

4. It sends the sampling window back to the Sam-

pler.

3.3 Data Prediction

A prediction model aims at computing future sen-

sor readings. Simple Exponential Smoothing (SES)

(Ha et al., 2008) and Double Exponential Smooth-

ing (DES), also known as Holt Method (Hyndman

and Athanasopoulos, 2014) (Wright, 1986) are good

examples of data prediction models. They are com-

putationally economical and thus interesting choices

for WSNs and IoT (Monteiro et al., 2017). However,

both were designed to predict readings of an individ-

ual sensor based on its own past readings. As already

mentioned in section 3.2, our framework assumes a

multi-sensed entity scenario. Thus, our data predic-

tion model must consider readings from different sen-

sors, physically closer to each other, on its computa-

tions. From a high level perspective, the data pre-

diction model performed by our Predictor component

consists of (i) cleaning noisy/incorrect readings and

(ii) computing an output value based on an aggrega-

tion function.

Data streams are continuous ﬂows of isolated data

points. In an IoT use case with sensor generated data,

these data points are represented by sensor readings.

Sensors are fail-prone devices. Incorrect sensor read-

ings need to be identiﬁed and discarded as best as pos-

sible to improve the quality of the information and

decisions based on the acquired data. No data point

can be considered an outlier on an individual basis

analysis. Thus, deﬁning a way to group and analyze

these data points together is a major concern when de-

signing an outlier detection task for IoT. Buffering in-

coming data on a predeﬁned length or time interval to

create windows is a common approach to group data

points and perform operations on data streams (Tsai

et al., 2014).

Clustering is a problem widely studied in data

mining and AI literature. However, it is difﬁcult to

adapt arbitrary clustering algorithms to the context

of data stream processing. The data stream feature

of being potentially unbounded in size makes such

adaptation especially complex (Aggarwal, 2013). K-

means is one of the best-known clustering and also the

starting point for a number of variations tailored for

stream processing (Tsai et al., 2014). However, since

each outlier can represent a different cluster and the

number of clusters is an expected input for K-means

and its variations, such algorithms are not suitable for

outlier detection. Density-based techniques, in turn,

are able to determine the number of clusters as an out-

put. Thus, they are more effective and versatile than

K-means for the purpose of outlier detection in data

streams in the IoT (Aggarwal, 2013). Even not being

tied to a speciﬁc algorithm, our proposed framework

requires a density-based approach to be chosen. A set

of good candidates is presented in (Campello et al.,

2013). We chose Density-based spatial clustering of

applications with noise (DBSCAN) in our implemen-

tation of the predictor component.

Choosing an aggregation function can become a

complex task depending on the application’s use case.

In the same way as for the clustering algorithm used

in the outlier detection task, this is a feature where the

proposed framework gives some autonomy regarding

its choice. In the context of real-time data stream

processing for IoT, it can be assumed that the input

data are sensor readings sampled in a short time win-

dow. We must also consider that these data points

have been cleaned by an outlier detection task and

are trustworthy. Therefore, central tendency statisti-

cal measures, such as mean and median, are reason-

able choices to represent the value measured in that

time interval. For the sake of simplicity, we use the

arithmetic average function in our implementation.

The current version of the Predictor component

adopts a naive approach which can be summarized as

a three-step procedure:

1. It takes all the samples from the buffer as the in-

put;

2. It runs DBSCAN to identify and remove outliers;

3. It computes and outputs the arithmetic average of

the remaining samples.

4 EVALUATION

In this section, we describe the experiments per-

formed with the proposed framework in order to eval-

uate (i) the impact, in terms of accuracy of the output

data, of reducing the number of data samples sent by

the sensors; (ii) how efﬁcient is the proposed adap-

tive sampling strategy in terms of energy consump-

A Real-time and Energy-aware Framework for Data Stream Processing in the Internet of Things

tion, and (iii) how fast is the proposed data prediction

model to process buffered data.

4.1 Environment

The components of the framework were developed in

Python 3 language. The CoAP features were imple-

mented with CoAPthon (Tanganelli et al., 2015). Re-

dis (Sanﬁlippo and Noordhuis, 2009) was used to im-

plement the Buffer component. The REST API was

implemented with Flask (Ronacher, 2020). In terms

of infrastructure, not only the development of the soft-

ware components but also a preliminary proof of con-

cept (PoC) were executed on a Raspberry Pi 3 Model

B+ which is a widely used device in many IoT appli-

cations. We created an isolated virtual network on

a public cloud environment to simulate the experi-

ments on ﬁve different input data volumes and allo-

cated computing resources for processing. To opti-

mize the costs incurred by running these experiments

on a public cloud with a pay-as-you-go model, each

sensor node runs 30 instances of the sampler compo-

nent to simulate 30 physical sensors running simul-

taneously. A single edge node gets more allocated

resources since the proposed solution is not yet pre-

pared to run in a distributed manner. Table 1 de-

scribes the number of sensors ingesting data and also

the computing resources of the edge node (vCPUs and

RAM) for each round of simulation. All sensor and

edge nodes are powered by 64-bit Arm-based proces-

sors, with 10 Gbps network bandwidth and Ubuntu

Linux 20.04 operating system. This conﬁguration for

the virtual machines is the closest setup to a Rasp-

berry Pi 3 B+ we can achieve on a public cloud.

A preliminary data set of real sensor readings was

generated from six Adafruit DHT11 sensors (Indus-

tries, 2020) continuously collecting temperature and

humidity data for one hour. Then, we used this data

set to generate synthetic data sets of 480 sensors by

adding to these measured values a random value rang-

ing from -1 to 1. Finally, the synthetic data set was

used as the input on each round of simulation for two

different sampling modes:

1. Fixed Sampling Mode: the complete synthetic

data set is sent to the edge node according to its

timestamp and used by the predictor component

to compute the output. This scenario represents a

benchmark use case where all the sensors are ac-

tive all the time.

2. Adaptive Sampling Mode: each instance of the

sampler component running on a sensor node

communicates with the edge node to send data

when active and sleeps according to the time win-

dows received. Data points from the synthetic

Table 1: Infrastructure setup per round of simulation.

Round Sensors Edge vCPUs Edge RAM

1 30 1 4 GiB

2 60 2 8 GiB

3 120 4 16 GiB

4 240 8 32 GiB

5 480 16 64 GiB

data set in which the timestamp corresponds to the

sensor’s sleep time are discarded.

In all rounds of simulations, the data prediction task is

executed once every ﬁve seconds to get the data used

to generate the line and scatter plots in this section.

Before evaluating the impacts and beneﬁts of the

proposed solution, it is important to clarify the core

difference between the two samplings, which will be

compared in the following subsections. As it is shown

in Fig. 4, while the ﬁxed sampling mode keeps a high

number of samples at the buffer all the time, in the

adaptive sampling mode, this number varies accord-

ing to the value of the data being gathered. When the

temperature and humidity values are stable, the num-

ber of buffered samples is low. When the tempera-

ture and humidity values start to change, the number

of buffered samples increases quickly. This behav-

ior is observed in all simulation rounds with different

scales. As the data volume being gathered increases

on each simulation round, the difference in the num-

ber of buffered samples over time is even greater. Fig.

4 also presents the difference between rounds 1 and 5.

4.2 Data Accuracy

The reduced number of samples sent to the edge

node using the adaptive sampling mode implies that

less data is available for the data prediction model.

Figure 4: Buffered load over time by sampling strategy.

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

Figure 5: Temperature and humidity readings processed with different sampling strategies.

Thus, there is a concern that data accuracy might be

affected. To evaluate how this reduced amount of

data available impacts the data prediction output, we

use Mean Absolute Error (MAE) metric (Chai and

Draxler, 2014). We assume the outputs from the exe-

cutions using ﬁxed sampling as the real/observed val-

ues to calculate the differences from the outputs ob-

tained with adaptive sampling mode on each simu-

lation round. Tables 2 and 3 present the calculated

MAE and data range per round of simulation for hu-

midity and temperature. It is possible to say that scal-

ing up sensors and computing resources does not im-

pact MAE.

Despite the considerable difference in the data

available at the buffer, both ﬁxed and adaptive sam-

pling modes have a very similar data prediction out-

put. Fig. 5 allows visualizing how close are the out-

puts for both sampling modes. The red lines repre-

sent temperature readings while the blue lines repre-

sent humidity readings. The darker lines correspond

to the ﬁxed sampling mode while the lighter ones cor-

respond to the adaptive sampling mode. By compar-

ing rounds 1 and 5, it is also possible to note that the

greater the amount of data, the closer are the outputs

for the different sampling approaches.

Table 2: MAE for humidity per round of simulation.

Round MAE Range

1 0.5365 (19.91 - 51.01 %)

2 0.5720 (17.15 - 50.44 %)

3 0.4110 (16.84 - 52.23 %)

4 0.3656 (19.06 - 51.23 %)

5 0.4738 (16.45 - 50.46 %)

4.3 Energy Consumption

Once it is proved that the reduced number of samples

does not signiﬁcantly affect the accuracy of the out-

put data, we need to verify the beneﬁt of the adaptive

sampling strategy in terms of energy consumption. To

verify how efﬁcient the proposed framework is on re-

ducing the energy consumption of the sensor nodes

when compared to a traditional ﬁxed sampling ap-

proach, we use PowerPi (Kaup et al., 2014). PowerPi

is a power consumption model to calculate the energy

consumed by an application running on a Raspberry

Pi device (RPi).

To calculate the energy consumption of an appli-

cation with PowerPi, only the application to be mea-

sured must be running on the device, along with es-

sential operating system tasks. Our experiments rely

on 30 concurrent processes running the sampler com-

ponent on a shared virtual sensor node (section 4.1).

To allow this calculation, we had to make an assump-

tion: the energy consumption was measured individ-

ually on a real Raspberry Pi 3 Model B+ device for

each of the four main actions performed by a sensor

node (described below). The total energy consump-

tion of a sensor node is computed as a weighted sum

of each of these individual measures plus a constant

value for idle time (Kaup et al., 2014). The weights

are based on the application logs, where every ac-

Table 3: MAE for temperature per round of simulation.

Round MAE Range

1 0.1894 (25.44 - 51.01 °C)

2 0.2517 (25.46 - 57.53 °C)

3 0.3369 (25.59 - 50.65 °C)

4 0.2609 (25.31 - 58.59 °C)

5 0.2384 (25.40 - 58.09 °C)

A Real-time and Energy-aware Framework for Data Stream Processing in the Internet of Things

tion of the sensor node is registered. The four actions

performed by a sensor node which were individually

measured are:

1. Sampling: sampling the physical entity to obtain

the temperature and humidity values.

2. Sending: sending sampled data to the edge node.

3. Getting Window: getting the sleep time until the

next sampling from the edge node.

4. Sleeping: sensor inactive for 1s.

The assumption made can lead to calculated values

that might not accurately represent the real energy

consumption of the devices. However, for a strictly

comparative analysis between the scenarios, the per-

formed calculations are enough. Fig. 6 shows that

the adaptive sampling strategy leads to average energy

consumption savings ranging from 42.93% to 60.58%

on sensor nodes, which is a very signiﬁcant result. It

is also possible to state, from the same Fig. 6, that

actions that involve communication between the sen-

sor and edge nodes (specially ”Sending”) are the most

expensive in terms of energy consumption. The adap-

tive sampling strategy essentially replaces ”Sending”

activity by less expansive actions: ”Sleeping”, which

is the least expensive activity and ”Getting” which is

also a communication activity but with much smaller

payloads.

The carried out experiments also allowed observ-

ing an expected side effect of the proposed adaptive

sampling strategy: poor load balancing. Although the

total energy consumption of the sensor nodes is sig-

niﬁcantly reduced in the broad spectrum compared to

the ﬁxed sampling strategy, individually, some sen-

sors presented a much higher energy consumption

Figure 6: Average energy consumption per sensor node.

Figure 7: Energy consumption of sensors 347 and 384.

than others in the same group. Load balancing tends

to get worse as we increase the number of sensors. Fig

7 shows the high contrast between the energy savings

of two sensors on the same group in simulation round

5. For smaller scales (rounds 1 to 3), this anomaly is

tiny, almost irrelevant.

4.4 Real-time Data Prediction

Earlier research on computer response times suggests

that(Nah, 2004):

• 0.1s: is the limit for having the users feel that the

system is reacting instantaneously, meaning that

no special feedback is necessary except to display

the result;

• 1.0s: is about the limit for the user’s ﬂow of

thought to stay uninterrupted, even though the

user will notice the delay. Normally, no special

feedback is necessary during delays of more than

0.1s but less than 1.0s;

• 2.0s: is the limit where a response to simple com-

mands becomes unacceptable to users.

According to the deﬁnitions above, to meet the real-

time requirement, we consider 1.0 s as a threshold for

the data prediction task’s execution time. To achieve

such goal on this multi-sensed environment, it is re-

quired to choose fast and non compute intensive algo-

rithms when implementing the Predictor component

to run on an edge device. DBSCAN was chosen for

data cleaning/pre-processing due to its good capabil-

ity of ﬁnding arbitrarily shaped clusters, what makes

it robust to outlier detection (Campello et al., 2013).

The performed experiments involve continuous vari-

ables, which are real values over a non-empty range.

Thus, we believe that a simple average can fairly rep-

resent the temperature and humidity values given a

set of samples over a given short time. Therefore, the

Mean statistical function was chosen as the aggrega-

tion function for this implementation.

The time elapsed in each data prediction execution

was also registered as an attribute on the output data.

This information evaluated how fast the data is being

processed, considering the data cleaning and aggre-

gation tasks. Fig 8 shows how the execution time of

the data prediction increases as the volume of data

processed (number of buffered samples) grows. The

graph presents a linear progression, suggesting good

scalability of the proposed solution using the chosen

algorithms. There were actually a very small number

of data points (9) above the threshold (1.0 s), repre-

senting less than 0.3% of the total and, therefore, can

be considered outliers. These outliers are outside the

plotting area to provide a better view of the relevant

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

Figure 8: Data prediction execution time per amount of

buffered samples.

part of the data. It is important to highlight that even

with a number of samples in the order of 7 thousand

units, the processing time remains below 0.6 s.

5 FINAL REMARKS

This work presented a real-time and energy-aware

data stream processing framework for IoT. Experi-

ments show that the reduced number of samples does

not compromise data accuracy due to a combination

of an adaptive sampling strategy with a data predic-

tion model. Being energy efﬁcient, the framework has

reduced the average energy consumption of sensor

nodes up to 60.58%. The results described in 4.4 indi-

cate that the edge data prediction model successfully

addresses real-time requirements by meeting the exe-

cution time threshold of 1s for the data prediction ac-

tivity. The main contribution of the proposed frame-

work is its capability of tackling real-time processing,

energy consumption and data accuracy requirements

all together. Thus, it might be used to enable the de-

velopment of long running real-time data stream pro-

cessing IoT systems in remote outdoor environments,

where energy sources are scarce and it undesirable or

unfeasible replacing batteries frequently.

ACKNOWLEDGEMENTS

This work has been partially funded by Fundac¸

de Amparo

a Pesquisa do Estado de S

ao Paulo -

FAPESP (grant 2015/24144-7), Fundac¸

ao de Am-

paro

a Pesquisa do Estado do Rio de Janeiro -

FAPERJ (grant 2017/233868) and Conselho Nacional

de Desenvolvimento Cient

ıﬁco e Tecnol

ogico - CNPq

(grant 434874/2018-3). Marta Mattoso and Flavia

Delicato are CNPq Fellows.

REFERENCES

Aggarwal, C. C. (2013). Mining Sensor Data Streams. In

Managing and Mining Sensor Data, pages 143–171.

Springer US, Boston, MA.

Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci,

E. (2002). Wireless sensor networks: a survey. Com-

puter Networks, 38(4).

Al-Hoqani, N. and Yang, S.-H. (2015). Adaptive sam-

pling for wireless household water consumption

monitoring. Procedia Engineering, 119:1356 –

1365. Computing and Control for the Water Industry

(CCWI2015) Sharing the best practice in water man-

agement.

Anastasi, G., Conti, M., Francesco], M. D., and Passarella,

A. (2009). Energy conservation in wireless sensor net-

works: A survey. Ad Hoc Networks, 7(3):537 – 568.

Atzori, L., Iera, A., and Morabito, G. (2010). The In-

ternet of Things: A survey. Computer Networks,

54(15):2787–2805.

Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013).

Density-based clustering based on hierarchical den-

sity estimates. In Pei, J., Tseng, V. S., Cao, L., Mo-

toda, H., and Xu, G., editors, Advances in Knowledge

Discovery and Data Mining, pages 160–172, Berlin,

Heidelberg. Springer Berlin Heidelberg.

Catarinucci, L., de Donno, D., Mainetti, L., Palano, L., Pa-

trono, L., Stefanizzi, M. L., and Tarricone, L. (2015).

An iot-aware architecture for smart healthcare sys-

tems. IEEE Internet of Things Journal, 2(6):515–526.

Chai, T. and Draxler, R. R. (2014). Root mean square error

(rmse) or mean absolute error (mae)? Geoscientiﬁc

Model Development Discussions, 7(1):1525–1534.

Dautov, R., Distefano, S., Bruneo, D., Longo, F., Merlino,

G., and Puliaﬁto, A. (2018). Pushing intelligence to

the edge with a stream processing architecture. In Pro-

ceedings - 2017 IEEE International Conference on In-

ternet of Things, IEEE Green Computing and Commu-

nications, IEEE Cyber, Physical and Social Comput-

ing, IEEE Smart Data, iThings-GreenCom-CPSCom-

SmartData 2017.

Dias de Assunc¸

ao, M., da Silva Veith, A., and Buyya, R.

(2018). Distributed data stream processing and edge

computing: A survey on resource elasticity and future

directions. Journal of Network and Computer Appli-

cations.

Eugster, P. T., Felber, P. A., Guerraoui, R., and Kermarrec,

A.-M. (2003). The many faces of publish/subscribe.

ACM Comput. Surv., 35(2):114–131.

Fielding, R. T. and Reschke, J. (2014). Hypertext Transfer

Protocol (HTTP/1.1): Semantics and Content. RFC

7231.

Gupta, M., Shum, L. V., Bodanese, E., and Hailes, S.

(2011). Design and evaluation of an adaptive sampling

strategy for a wireless air pollution sensor network. In

2011 IEEE 36th Conference on Local Computer Net-

works, pages 1003–1010.

Ha, S., Rhee, I., and Xu, L. (2008). Cubic: A new tcp-

friendly high-speed tcp variant. SIGOPS Oper. Syst.

Rev., 42(5):64–74.

A Real-time and Energy-aware Framework for Data Stream Processing in the Internet of Things

Hyndman, R. and Athanasopoulos, G. (2014). Forecasting:

principles and practice. OTexts.

Industries, A. (2020). DHT11 basic temperature-humidity

sensor.

Isah, H., Abughofa, T., Mahfuz, S., Ajerla, D., Zulker-

nine, F., and Khan, S. (2019). A survey of distributed

data stream processing frameworks. IEEE Access,

7:154300–154316.

Janjua, Z. H., Vecchio, M., Antonini, M., and Antonelli,

F. (2019). IRESE: An intelligent rare-event detec-

tion system using unsupervised learning on the IoT

edge. Engineering Applications of Artiﬁcial Intelli-

gence, 84:41–50.

Karkouch, A., Mousannif, H., Al Moatassime, H., and

Noel, T. (2016). Data quality in internet of things:

A state-of-the-art survey.

Kaup, F., Gottschling, P., and Hausheer, D. (2014). Pow-

erpi: Measuring and modeling the power consumption

of the raspberry pi. In 39th Annual IEEE Conference

on Local Computer Networks, pages 236–243.

Klein, A. and Lehner, W. (2010). Quality and Performance

Optimization of Sensor Data Stream Processing. In-

ternational Journal on Advances in Networks and Ser-

vices.

Li, S., Xu, L. D., and Zhao, S. (2015). The internet

of things: a survey. Information Systems Frontiers,

17(2):243–259.

Li, W., Santos, I., Delicato, F. C., Pires, P. F., Pirmez, L.,

Wei, W., Song, H., Zomaya, A., and Khan, S. (2017).

System modelling and performance evaluation of a

three-tier Cloud of Things. Future Generation Com-

puter Systems.

Loria, M. P., Toja, M., Carchiolo, V., and Malgeri, M.

(2017). An efﬁcient real-time architecture for col-

lecting iot data. In 2017 Federated Conference on

Computer Science and Information Systems (FedC-

SIS), pages 1157–1166.

Mehmood, E. and Anees, T. (2020). Challenges and so-

lutions for processing real-time big data stream: A

systematic literature review. IEEE Access, 8:119123–

119143.

Monteiro, L. C., Delicato, F. C., Pirmez, L., Pires, P. F., and

Miceli, C. (2017). Dpcas: Data prediction with cu-

bic adaptive sampling for wireless sensor networks.

In Au, M. H. A., Castiglione, A., Choo, K.-K. R.,

Palmieri, F., and Li, K.-C., editors, Green, Perva-

sive, and Cloud Computing, pages 353–368, Cham.

Springer International Publishing.

Nah, F. F.-H. (2004). A study on tolerable waiting time:

how long are web users willing to wait? Behaviour &

Information Technology, 23(3):153–163.

Qin, Y., Sheng, Q. Z., Falkner, N. J., Dustdar, S., Wang, H.,

and Vasilakos, A. V. (2016). When things matter: A

survey on data-centric internet of things. Journal of

Network and Computer Applications, 64.

Richards, R. (2006). Representational State Transfer

(REST), pages 633–672. Apress, Berkeley, CA.

Ronacher, A. (2010 (accessed August 5, 2020)). Flask web

development, one drop at a time.

Samizadeh Nikoui, T., Rahmani, A. M., Balador, A., and

Haj Seyyed Javadi, H. (2021). Internet of things archi-

tecture challenges: A systematic review. International

Journal of Communication Systems, 34(4):e4678.

e4678 IJCS-19-1067.R1.

Sanﬁlippo, S. and Noordhuis, P. (2009). Redis.

Shelby, Z., Hartke, K., and Bormann, C. (2014). The Con-

strained Application Protocol (CoAP). RFC 7252.

Tanganelli, G., Vallati, C., and Mingozzi, E. (2015).

Coapthon: Easy development of coap-based iot appli-

cations with python. In 2015 IEEE 2nd World Forum

on Internet of Things (WF-IoT), pages 63–68.

Tsai, C.-W., Lai, C.-F., Chiang, M.-C., and Yang, L. T.

(2014). Data Mining for Internet of Things: A Sur-

vey. IEEE Communications Surveys & Tutorials,

16(1):77–97.

cini

c, M., Tourancheau, B., Rousseau, F., Duda, A., Da-

mon, L., and Guizzetti, R. (2014). Oscar: Object se-

curity architecture for the internet of things. In Pro-

ceeding of IEEE International Symposium on a World

of Wireless, Mobile and Multimedia Networks 2014,

pages 1–10.

Wright, D. J. (1986). Forecasting data published at irregu-

lar time intervals using an extension of holt’s method.

Management Science, 32(4):499–510.

Xu, Y. and Helal, A. (2016). Scalable cloud–sensor archi-

tecture for the internet of things. IEEE Internet of

Things Journal, 3(3):285–298.

Zhang, H., Chen, G., Ooi, B. C., Tan, K., and Zhang, M.

(2015). In-memory big data management and process-

ing: A survey. IEEE Transactions on Knowledge and

Data Engineering, 27(7):1920–1948.

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security