Exploring Big Data Clustering Algorithms for Internet of Things

Applications

Hind Bangui

1,2,3

, Mouzhi Ge

1,2

and Barbora Buhnova

1,2

Institute of Computer Science, Masaryk University, Brno, Czech Republic

Faculty of Informatics, Masaryk University, Brno, Czech Republic

FSTG, Cadi Ayyad University, Marrakesh, Morocco

Keywords: Big Data, Internet of Things, Clustering Algorithm, Machine Learning, Mobile Networks.

Abstract: With the rapid development of the Big Data and Internet of Things (IoT), Big Data technologies have emerged

as a key data analytics tool in IoT, in which, data clustering algorithms are considered as an essential

component for data analysis. However, there has been limited research that addresses the challenges across

Big Data and IoT and thus proposing a research agenda is important to clarify the research challenges for

clustering Big Data in the context of IoT. By tackling this specific aspect - clustering algorithm in Big Data,

this paper examines on Big Data technologies, related data clustering algorithms and possible usages in IoT.

Based on our review, this paper identifies a set of research challenges that can be used as a research agenda

for the Big Data clustering research. This research agenda aims at identifying and bridging the research gaps

between Big Data clustering algorithms and IoT.

1 INTRODUCTION

Internet of Things (IoT) is one of the most promising

technologies in the current epoch. IoT is

characterized by using smart and self-configuring

objects that can interact with each other via global

network infrastructure. Therefore, these interactions

between large amounts of heterogeneous objects

represent IoT as a disruptive technology that enables

ubiquitous and pervasive computing applications

(Van Kranenburg 2008). Accordingly, a wide range

of industrial IoT applications (Da et al., 2014) have

been developed and deployed in different domains

such as transportation systems, agriculture, food

processing industry, health monitoring systems,

environmental monitoring, and security surveillance.

Since IoT connects the sensors and other devices

to the Internet, it plays an important role to support

the development of smart services. In other words, the

dynamic things collect different kinds of information

from the real-world environment. Thus, the extracted

relevant information from IoT data can be used to

improve and enrich our daily life with context-aware

applications, which can for example display contents

related to the current situation of users (Dey, 2001).

It can thus identify the data relevant to an object based

on the object’s contextual information. IoT is an

important source of contextual data with a large

vomue and fast velocity that is considered as typical

characristics of Big Data.

Big Data is definded according to three

fundamental elements, which are volume (size of

data), variety (different types of data from several

sources) and velocity (data collected in real time).

Moreover, other research work introduced additional

characteristics to the 3V’s model such as (Manogaran

et al., 2017) that presented further aspects: value

(benefits to various industrial and academic fields),

veracity (uncertainty of data), validity (correct

processing of the data), variability (context of data),

viscosity (latency data transmission between the

source and destination), virality (speed of the data

send and receives from various sources) and

visualization (interpretation of data is more

concerned and identification of the most relevant

information for the users). Despite the existence of

additional characteristics of Big Data, the 3V’s model

sets the basis of the Big Data concept (Kitchin, 2017).

The fusion of Big Data and IoT technologies has

created opportunities for the development of services

for smart environments like smart cities. There have

been thus several Big Data technologies available to

support the processing of large volume of IoT data

Bangui, H., Ge, M. and Buhnova, B.

Exploring Big Data Clustering Algorithms for Internet of Things Applications.

DOI: 10.5220/0006773402690276

In Proceedings of the 3rd International Conference on Internet of Things, Big Data and Security (IoTBDS 2018), pages 269-276

ISBN: 978-989-758-296-7

269

such as Big Data analytics (Chen et al., 2016), which

have emerged as a need to process the data collected

from different sources in the smart environment.

However, the advancement of IoT is increasingly

producing vast amount and different types of data,

especially after the appearance of the emerging 5G

(Mavromoustakis et al., 2016). At the same time, Big

Data and its technologies have opened new

oppotunities for industries and academics to develop

new IoT solutions. Therefore , the fusion of Big Data

and IoT, as well as the highly dynamic evolution of

the two domains, creates new research challenges,

which have so far not been recognized and addressed

by the research community.

This paper tackles a specific and important aspect

of the Big Data, clustering algorithms in Big Data, as

clustering is a critical operation for Big Data

processing and analytics. We have reviewed the

advantages and disadvantages of clustering

algorithms, which indicate that clustering is one of the

key factors to supply the fusion of Big Data, cloud

computing, mobile environment and IoT

technologies. The contributions of the paper are two-

fold: we have reviewed the clustering algorithms in

Big Data and illustrated how clustering algorithms in

Big Data can be used in IoT. Based on the review, we

have proposed a set of research challenges to clarify

the research gaps between Big Data and IoT.

The remainder of the paper is organized as

follows. Section 2 carries out a literature review on

clustering algorithms in Big Data, which includes

algorithm characteristics and classification. Based on

the review, Section 3 presents the major challenges

and opportunities related to the fusion of IoT, cloud

computing, mobile environment and Big Data

technologies in the 5G networks. Finally, section 4

concludes the work and outlines future research.

2 CLUSTERING ALGORITHMS

IN BIG DATA

Clustering algorithms have emerged as a pre-

processing tool to learn and analyze the Big Data

(Fahad et al., 2014). The goal of clustering algorithms

is to group the data in the same cluster based on

certain similarity metrics. There already exists a

number of clustering algorithms, as well as studies

that discuss their advantages and drawbacks

(Shirkhorshidi et al., 2014). As (Fahad et al., 2014)

indicated, clustering algorithms are currently

evolving to meet different Big Data challenges. This

section therefore reviews different clustering

algorithms for Big Data, which can possibly be used

in IoT. Although some studies also proposed

promising flexible parallel programming models that

can support parallel clustering algorithms for

handling Big Data (Mohebi et al., 2016), reviewing

the parallel clustering algorithms based on

MapReduce is not in the scope of this paper.

Clustering is an essential data mining used as a

Big Data analytics method. The principle of this

technique is to create groups or subsets that contain

the objects with similar characteristic features

(Havens et al., 2013). Consequently, the cluster

analysis makes data manipulation simple by finding

structure in data and classifying each object according

to its nature. Besides, it is divided into two categories:

single machine clustering techniques, which use

resources of just one single machine, and multiple

machine clustering techniques, which run in several

machines and have access to more resources. In this

section we try to categorize the majority of available

clustering algorithms according to their applicability

in Big Data as follows:

Hierarchical algorithm: The goal of hierarchical

clustering is to build a hierarchical tree to show the

relation of clusters in two different manners, which

are agglomerative method and divisive method

(Pandove and Goel, 2015). Agglomerative method

starts with one-point (singleton) clusters and

recursively adds two or more appropriate clusters

until it achieves a K number of clusters. On the other

hand, divisive method divides the data to a single

cluster, which contains all data objects, into smaller

appropriate clusters until a stopping criterion is

achieved.

Partitional algorithm: Unlike the hierarchical

clustering algorithms that impose a hierarchical

structure, the partitional algorithms find all the

clusters simultaneously as an initial partition of the

data. Then the objects are assigned to the similar

cluster center based on specific criteria (Nguyen et al.

2013, Lin et al. 2011, Al-Madi et al. 2014).

Density-based algorithm: The main of using these

techniques is to discover clusters of different shapes

and sizes from large datasets, where each cluster is

represented by a maximal set of density-connected

objects, which are split based on the region of density,

connectivity and boundary (Amini et al. 2014, Guo et

al. 2015). Due to high computational complexity, this

kind of methods is able to improve further the

communication cost.

Centroid-based clustering algorithm: The general

idea of this technique is that each cluster is

represented by an object (medoid), which is the most

centrally located in a cluster (Srirama et al. 2012, Ng

IoTBDS 2018 - 3rd International Conference on Internet of Things, Big Data and Security

270

and Han 2002). Moreover, the centroid-based

algorithms reduce all comparisons between objects

and clusters into simple comparisons between objects

and the medoids of the clusters.

Single-linkage hierarchical algorithm: In general,

a single-linkage algorithm is one of several methods

of hierarchical clustering, it aims to reduce

computational complexity by combining two

different cluster based on the distance between the

closest two objects (Goyal et al. 2016, Rafailidis and

Manolopoulos 2017). However, it can produce

chaining effect if the clusters are much farther from

each other than to objects of other.

Grid-based algorithm: The data space is

partitioned into a finite number of grids and a cluster

is represented by a region that has a maximal set of

density points (Wang et al., 1997). The number of

grids is smaller than the number of instances.

Consequently, in the partitioning stage of this kind of

methods, the grids could produce good results in

terms of clustering time.

Similarity-based clustering algorithm: The main

idea of this practical technique is to measure the

similarity of two objects and determine if they are

similar or dissimilar (Fahad et al. 2014, Shirkhorshidi

et al. 2014). Based on the degree of similarity, similar

objects are stored in the same cluster and dissimilar

objects are in different clusters. However, this

algorithm is incapable of dealing with massive data

instances.

Co-clustering algorithm: Unlike to the traditional

clustering algorithms that contain a similar subset of

the rows across all columns, co-clustering algorithm

correlates the subsets of rows with only a subset of its

columns (Liu et al. 2014). However, it is not practical

to apply on large data set.

Within the state of the art, works exist that survey

clustering algorithms to determine the best

performing for Big Data (Fahad et al., 2014, Berkhin

2006). K-means is one of the most used clustering

algorithms in Big Data (Nguyen et al., 2013). It is a

partitioned clustering algorithm that takes K as initial

cluster centers (input parameter). Next, it partitions a

set of n objects into K clusters. Then it determines

cluster similarity or cluster center according to the

mean value of the objects in the cluster. Based on the

distance between the object and the cluster center, it

assigns each object to the cluster to which it is the

most similar. Finally, it calculates the new mean for

each cluster. Cop-k-means (Lin et al., 2011) is a

modified version of k-means, where two pairwise of

constraints, namely, Must-link (ML) and Cannot-link

(CL), are added to avoid computational dependencies

between Mappers. Consequently, the assignment of

objects to clusters is order-sensitive. PSO (Al-Madi et

al. 2014) solves the sensitivity problem of k-means on

initial cluster center by executing three MapReduce

jobs, where the first job generates new particle

centroids, the second job uses the fitness function to

evaluate the new particle centroids, which are

generated in the first module. Finally, the third job

merges the fitness values that are the outputs of the

first and second modules.

PAM (Partitioning Around Medoids) is another

clustering approach that belongs to centroid-based

clustering (Srirama et al., 2012). It chooses k random

objects as the initial medoids. Then it calculates the

distance between each object (non-medoid) and k

medoids in order to assign each object to the cluster

with the closest medoid. In contrast to PAM, CLARA

(Clustering Large Application) (Ng and Han 2002,

Srirama et al. 2012), which is an improvement of

PAM algorithm, focuses to cluster small random

subsets of the dataset. So, the whole iteration is

reduced into two MapReduce jobs: The first job

calculates random subsets and the second measures

the quality. As a result, it achieves minimal job

latency because the input data is only loaded twice.

Thanks to the similarity that measures the coherence

of the objects and selects automatically the similar

subsets, co-clustering algorithm based on

MapReduce has proved its efficiency and reliability

in many domains such as the improvement of cancer

subtype identification (Liu et al., 2014). Many works

focus on running DBSCAN (density-based spatial

clustering of applications with noise) algorithm in

MapReduce such as (Amini et al., 2014). The general

idea of DBSCAN is to overcome the effect of noise

and discover clusters of arbitrary shape. To do that,

the objects are split based on the region of density,

connectivity and boundary. Next, a cluster is formed

by a maximal set of density-connected objects that are

maximal density reachable. Then the algorithm uses

a pioneer density based clustering algorithm to detect

arbitrarily shaped clusters. However, a lot of I/O

overhead is produced due to the need to detect each

object to determine whether it is the core object.

Besides, it performs poorly if the clusters having

different densities. SCAN (Structural Clustering

Algorithm for Networks) (Guo et al., 2015) is an

extension of DBSCAN approach for large networks.

The advantage of this algorithm is to identify the

activated vertices as new members of the cluster to

handle big networks with millions of vertices.

STING (Statistical Information Grid-based

method) (Wang et al., 1997) is one of the

representative clustering algorithms based on grid,

which clusters spatial data. Similar to the clustering

Exploring Big Data Clustering Algorithms for Internet of Things Applications

271

properties of index structures, the spatial area is

divided independently into rectangular grid cells at

different levels and each cell at level is partitioned

into k number of cells at the next level, which forms

a hierarchical structure that processes the statistics

information stored in grid units. To achieve more

preference in distributed environment, STING

algorithm is implemented using MapReduce and

Hadoop (Li et al., 2017). However, the clustering

algorithms based on grid are greatly sensitive to the

high granularity of grid, which can decrease the

quality of clustering as well as the clustering

accuracy. Due to the advantages of grid paradigm, the

popular single linkage hierarchical clustering

algorithm (SLINK) is combined with the grid to

produce GridSLINK (Goyal et al., 2016) that aims to

reduce the number of distance calculations required

by SLINK.

Unlike the traditional methods that consider the

similarity values from instances to k centers, spectral

algorithm (Rafailidis and Manolopoulos, 2017) is

able to detect complex nonlinear structures, and select

clusters based on pairwise similarities of data

instances. However, it requires considerable memory

and computational time when the size of a data

instances is large.

Another clustering algorithm is called CURE

(Clustering Using REpresentative) (Pandove and

Goel, 2015), which is a hierarchical clustering

algorithm. The general idea around this algorithm is

to present each cluster by fixed number of well

scattered data points in order to determine general

shapes. Moreover, it aims at reducing the dimension

of the input data matrix by transforming it into a lower

dimensional space. As a result, the effect of outliers

and noise in this reduced space are reduced. To

improve further its efficiency for Big Data, the

algorithm is implemented over distributed

environment using MapReduce and Hadoop.

Despite the advantages of the clustering

algorithms mentioned above, the extraction of useful

information out of massive amount of data and

process them in less time has become a crucial

operation for the existing clustering algorithms,

which use non-iterative parallel programming models

that require re-clustering each time new object is

generated. Yet, we need to face the challenges

associated with map and reducer programming

paradigm notably MapReduce, which is considered as

the most programing model adapted from clustering

algorithms (Table I) and used generally in Big Data

architecture.

Clustering algorithm is one of the important Big

Data analytics methods that are used to extract useful

information from data (Sreenivasulu et al., 2017).

Moreover, the extraction of data is the most crucial

process in Big Data. The main goal of Big Data is to

produce efficient knowledge for real-world

applications, which use extracted information to learn

more from the previous experiences. As a result, the

smart technologies, like IoT devices, have the

opportunity to know or predict the real users’ needs.

However, we need to understand more the integration

of Big Data with other promising technologies, such

as IoT applications, smart cities, and so on.

Table 1: Clustering algorithms based on mapreduce.

Paper Clustering

Algorithm

Category

(Nguyen et al., 2013) K-means

Partitional

algorithm

(Lin et al., 2011) Cop-k-means

(Al-Madi et al., 2014) PSO

(Srirama et al., 2012) PAM

Centroid-based

clustering

(Ng and Han, 2002) CLARA

(Amini et al., 2014) DBSCAN

Density-based

algorithm

(Guo et al., 2015) SCAN

(Goyal et al., 2016) GridSLINK

Single-linkage

hierarchical

clustering

(Rafailidis and

Manolopoulos, 2017)

Spectral

(Liu et al., 2014) Co-clustering Co-clustering

(Pandove and Goel,

2015)

CURE Hierarchical

algorithm

(Wang et al., 1997) STING Grid-based

algorithm

3 RESEARCH CHALLENGES

AND OPPORTUNITIES IN BIG

DATA AND IoT

Based on the review of Big Data technologies and

related clustering algorithms, we found that there is a

set of research challenges that are worth to further

investigate when linking Big Data clustering and IoT.

It is hence valuable to develop a research agenda that

contains possible research topics and questions to be

addressed in the future. In order to connect IoT and

clustering algorihtms in Big Data, we have initated

the investigation from the direction of IoT data

features. IoT data have certain specifics indicating

that certain clustering algorithms can be expected to

yield higher analytical effectiveness and efficiency.

The key common characteristics of IoT data are as

follows.

 Homogeneity in heterogeneity: Although there

is an enormously large number of data sources

in a typical IoT network (sensor devices) with

IoTBDS 2018 - 3rd International Conference on Internet of Things, Big Data and Security

272

different types of IoT devices, there are many

instances of each type (for instance many

unified temperature sensors, smart utility

meters).

 Size of the data records: The data records

produced by IoT devices are typically very

small and well structured.

 Time series format: The data records are

typically produced in fixed time intervals and

delivered in time series, which is a

characteristic that can ease the analysis.

 Low data quality given by device reliability:

There is often low pressure on the reliability of

the devices, because the number of them is very

high (it is beneficial to have many cheap

sensors instead of a few expensive sensors), but

with high pressure on elaborated techniques to

combine the data from different devices that in

a way substitute each other (e.g. missing

information is deduced from neighbouring

devices).

 High security risks: Each device might become

a security vulnerability, hence all analytical

methods should be robust against injected data

by attackers (suspicious data should be

continuously evaluated and not be considered

in the analysis).

 Dynamism: It is a common characteristic of IoT

networks that IoT devices dynamically join and

leave the network.

The critical research question in this respect is how to

optimize data clustering and analysis to take

advantage of the characteristics of IoT data.

Furthermore, IoT technologies have been

incorporated into various important domains in our

life. Thus, IoT domains refer to the IoT techniques

that are applied in certain context such healthcare IoT

or transportation IoT. Dierent IoT domains share a

set of common features but the the same time possess

domain-specific variations. For example, most of the

IoT domains emphasize the data collection,

monitoring, sharing, automation, control and

collaboration. Also, their datasets usually consist of

relatively homogeneous data records e.g. from

sensors and other IoT devices, which are often in a

time series. However, healthcare or military IoT may

acquire more precision or security and tranpotation or

smart city IoT may have a relatively loose quality

standard to the data.

3.1 Clustering Algorithms in Big IoT

Data

Most of the IoT applications are based on the vision

of connecting different objects to each other and

analyze the generated IoT data in real-time. In this

dynamic IoT context, the clustering algorithms are

used to ensure the reliability of the IoT distributed

applications. For example, in (Fredj et al., 2013), a

hierarchical agglomerative clustering technique has

been descripted to provisione a scalable search for

constructing routing tables which perform request

matching and forwarding.

The data from IoT devices are possibly generated

from different sources such as sensors or mobile

devices (Hossain et al., 2017). These data, termed as

Big IoT Data in this paper, are with large volume,

fast-moving and usually unstructured e.g. image data

or stream data. The Big Data analytics mainly aims to

firstly classify the Big Data, then mine the patterns

and finally produce predictions. For example, in a

city, there can be real-time traffic image data from

various IoT devices such as road surveillance,

satellite photos and traffic sensors etc. In order to

analyze the real-time density of the cars, the traffic

image data from different sources need to be firstly

clustered and then processed for further analysis.

However, for data clustering, the road surveillance

images data and stream traffic sensor data are

considered as one IoT data input but with different

data structures. Maybe the combination of the

clustering with other methods, like fuzzy, could

improve further the functionality of clustering

methods (D’Urso et al., 2017). However, the

treatment of multi-structured data is still a big lack of

clustering algorithms that based on an unsupervised

process of classification of data into clusters.

Due to the large volume, complexity, variety, and

rapid generation of IoT data that create an important

opportunity in our daily life, new tremendous

challenges have been raised for researchers to design

new scalable and efficient clustering methods that are

able to supervise or semi-supervise input data,

classify multi-structured data, detect noisy data, and

extract valuable information from different IoT data

sources. We therefore tackle this critical problem by

firstly reviewing the clustering algorithms in Big

Data. Given the nature of the Big IoT Data, we have

proposed the following research questions.

 How to effectively select the clustering

algorithms for Big IoT Data?

Exploring Big Data Clustering Algorithms for Internet of Things Applications

273

 How to dynamically select the most suitable

algorithm to cluster the Big IoT Data in a

timely manner?

 How to cluster the different types of Big IoT

Data that represent the same or similar

entity/event?

 How to choose the proper Big Data

technologies such as MapReduce frameworks

to perform the clustering algorithm for Big IoT

Data?

3.2 Dynamics of IoT Systems

There is high dynamics in IoT systems, especially for

the mobile IoT system, where the mobile devices can

be considered as a highly dynamic IoT end point. For

example the IoT devices may spontanously join or

leave the networks due to the mobility. In fact,

mobile devices are sensors devices that are able to

sense, process and generate large amount of real

world data. Then the collected contextual information

is analyzed, interpreted and utilized to make decision

in different areas i.e. business intelligence [46].

Nowadays, most persons are surrounded by multiple

mobile devices that can provide several services to

the end-user at any time and place. Due to the

increasing impact of mobile devices on people’s

habitudes, mobile devices have therefore become a

major part of the IoT paradigm (Ahmed and Rehmani,

2017).

Furthermore, the characteristics of mobile

devices, notably mobility, enhance further the

integration of mobile devices in IoT by offering a

wide range of promising innovations, which will

dramatically make the data more valuable for several

domains in the forthcoming years, such as education,

healthcare, smart homes, smart cities, and so on.

However, the mobile Big Data require dynamic

analysis techniques for ensuring the usability of data.

Yet, the clustering algorithms could be thought of as

a key to solve this challenge, such as the application

of K-Medoids clustering in (Dash et al., 2012), where

the algorithm has been used as a way of facilitating

privacy for organized data in mobile cloud

computing.

Our review found that there is an essential need

for new techniques related to different aspects of

testing and quality evaluation of mobile Big Data.

Accordingly, the success of mobile Big Data leads to

these relevant research questions.

 How to use context-aware, location-aware and

users’ experience to test and evaluate the

compatibility and adaptability of data?

 How could mobile structured and unstructured

data be used effectively?

 What are the issues facing organizations trying

to take the benefits of mobile Big Data?

 What are the limitations of the existing

analytical methods, especially clustering

models, to process mobile Big Data? And what

are the new methods in response to these

issues?

 What are the best practices and strategies that

the organization need to adopt for Big Data

projects?

3.3 The Role of Networking in IoT

The increasing number of IoT devices, including

mobile devices, has increased the amount of data

exponentially. As a result, the 5th generation mobile

network (5G), which is expected to be operational by

2020, has become a key success factor to support

various types of emerging IoT applications with

strengthened quality of service (Jiang and Liu, 2017).

To achieve more efficient communication between

IoT sensors in 5G. The clustering techniques have

been used (Xu et al., 2017). However, the 5G must be

more smart and flexible to guarantee the quality of

their services for the end users as well as for the smart

environments. In other words, the 5G has to learn

from known and unknown data in order to achieve

their goals, which are share data everywhere, every

time, by everyone and everything for the benefits of

several domains such as healthcare, business, and so

on. Yet, the advancement of 5G networks conducts to

these relevant research questions.

 How could the 5G and other modern networks

handle the real-time and online mobile-data

processing?

 What are the most important factors that need

to be taken into consideration when designing

and evaluating solutions for Big Data and IoT

technologies in 5G and other modern mobile

communication systems?

 What is the impact of machine learning in 5G

mobile information systems? And how could

data mining supply the advancement of 5G

mobile information systems?

3.4 Machine Learning Applications to

Support IoT

Since machine learning techniques, such as

supervised and unsupervised learning, tend to classify

and cluster the data, thus, different machine learning

IoTBDS 2018 - 3rd International Conference on Internet of Things, Big Data and Security

274

models may bring insightful results for IoT data

analytics. In general, machine learning is an

interdisciplinary approach in building mathematical

methodologies that are ideally suited to the task of

extracting knowledge from Big Data and make data-

driven predictions and decisions. It can be defined as:

“a field of study that gives computers the ability to

learn without being explicitly programmed” (El Naqa

and Murphy 2015). In other words, the goal of

machine learning is to develop algorithms that can be

self-programmed to solve new problems by using

previous known data rather than directly

programming new solutions. In the IoT context, the

machine learning techniques have been linked to

clustering algorithms for deep learning because the

clustering concept is an unsupervised process of

classification of data into groups. For example, in

(Akbar et al., 2015), the machine learning methods

have been used to analyze automatically the IoT data

based on real-time rules. Meanwhile, the integration

between Big Data and mobile Internet can produce

positive impacts in the machine learning field by

providing more real world examples to forecast the

future activities and investments.

We found that it is valueable to develop new

technologies, methods and algorithms for this

explosive increase of big masses of mobile IoT data

that need more real-time analysis which challenges

the existing traditional analytic tools as well as the

existing machine learning algorithms. Therefore,

we identified the following research questions

associated with the machine learning applications

with support of mobility patterns.

 How could machine learning applications

improve the existing clustering models for self-

mobile-IoT applications?

 How could machine learning optimize mobile

Big Data as a service and analytics as a service

for the IoT?

 What are the best strategies to process the

mobile Big Data and extract the most useful

information?

4 CONCLUSIONS

In this paper, we have conducted a survey on Big Data

technologies and clustering algorithms, in which we

have specified the pros and cons of each algorithm in

the Big Data context. We have further related our

review to the research of IoT and discussed the

relations between Big Data, clustering algorithms and

IoT. Based on the review, we have proposed a set of

research challenges that address the emerging

research topics and research questions on the data

clustering in IoT, dynamics of Big Data application in

IoT, the role of networking in IoT, as well as the

machine learning applications.

The research challenges can be considered as a

research agenda to guide the future research across

Big Data and IoT communities. Specifically, this

paper has emphasized the importance of clustering

algorithms in Big IoT Data and brings attention and

possible applications of the Big Data clustering

algorithms for IoT. As future work, we plan to further

investigate the relations between the communities of

Big Data and IoT. We will analyze which Big Data

technologies can be effectively used in which IoT

context. Also, we plan to investigate data features

from IoT and connect them with features in Big Data,

which can facilitate Big Data applications in IoT.

ACKNOWLEDGEMENTS

The work was supported from European Regional

Development Fund Project "CERIT Scientific Cloud"

(No. CZ.02.1.01/0.0/0.0/16_013/0001802).

REFERENCES

Al-Madi, Nailah, Ibrahim Aljarah, and Simone A. Ludwig.

"Parallel glowworm swarm optimization clustering

algorithm based on MapReduce." IEEE Symposium on

Swarm Intelligence 2014.

Amini, Amineh, Teh Ying Wah, and Hadi Saboohi. "On

density-based data streams clustering algorithms: a

survey." Journal of Computer Science and Technology

29.1 (2014): 116-141.

Akbar, A., Carrez, F., Moessner, K., Sancho, J. and Rico,

J., 2015, December. Context-aware stream processing

for distributed IoT applications. In Internet of Things

(WF-IoT), 2015 IEEE 2nd World Forum on (pp. 663-

668). IEEE.

Ahmed, Ejaz, and Mubashir Husain Rehmani. "Mobile

edge computing: opportunities, solutions, and

challenges." (2017): 59-63.

Berkhin, Pavel. "A survey of clustering data mining

techniques." Grouping multidimensional data. Springer

Berlin Heidelberg, 2006. 25-71.

Chen, Yong, Hong Chen, Anjee Gorkhali, Yang Lu, Yiqian

Ma, and Ling Li. "Big Data analytics and Big Data

science: a survey." Journal of Management Analytics 3,

no. 1 (2016): 1-42.

Da Xu, Li, Wu He, and Shancang Li. "Internet of things in

industries: A survey." IEEE Transactions on industrial

informatics 10.4 (2014): 2233-2243.

Exploring Big Data Clustering Algorithms for Internet of Things Applications

275

D’Urso, Pierpaolo, Riccardo Massari, Livia De Giovanni,

and Carmela Cappelli. "Exponential distance-based

fuzzy clustering for interval-valued data." Fuzzy

Optimization and Decision Making 16(1), 2017, 51-70.

Dash, Sanjit Kumar, Debi Pr Mishra, Ranjita Mishra, and

Sweta Dash. "Privacy preserving K-Medoids

clustering: an approach towards securing data in Mobile

cloud architecture." 2nd International Conference on

Computational Science, Engineering and Information

Technology, pp. 439-443. ACM, 2012.

Dey, Anind K. "Understanding and using context."

Personal and ubiquitous computing 5.1 (2001): 4-7.

El Naqa, Issam, and Martin J. Murphy. "What Is Machine

Learning?." Machine Learning in Radiation Oncology.

Springer International Publishing, 2015. 3-11.

Fahad, Adil, Najlaa Alshatri, Zahir Tari, Abdullah Alamri,

Ibrahim Khalil, Albert Y. Zomaya, Sebti Foufou, and

Abdelaziz Bouras. "A survey of clustering algorithms

for Big Data: Taxonomy and empirical analysis." IEEE

transactions on emerging topics in computing 2(3)

(2014): 267-279.

Fredj, Sameh Ben, Mathieu Boussard, Daniel Kofman, and

Ludovic Noirie. "A scalable IoT service search based

on clustering and aggregation." In Green Computing

and Communications (GreenCom), 2013 IEEE and

Internet of Things pp. 403-410, 2013.

Guo, Kun, Wenzhong Guo, Yuzhong Chen, Qirong Qiu,

and Qishan Zhang. "Community discovery by

propagating local and global information based on the

MapReduce model." Information Sciences 323 (2015):

73-93.

Goyal, Poonam, Sonal Kumari, Sumit Sharma, Dhruv

Kumar, Vivek Kishore, Sundar Balasubramaniam, and

Navneet Goyal. "A Fast, Scalable SLINK Algorithm for

Commodity Cluster Computing Exploiting Spatial

Locality." In High Performance Computing and

Communications; IEEE 14th International Conference

on Smart City, 2016.

Havens, Timothy C., James C. Bezdek, and Marimuthu

Palaniswami. "Scalable single linkage hierarchical

clustering for Big Data." Intelligent Sensors, Sensor

Networks and Information Processing, 2013 IEEE

Eighth International Conference on. IEEE, 2013.

Hossain, M. Shamim, Changsheng Xu, Ying Li, Al-Sakib

Khan Pathan, Josu Bilbao, Wenjun Zeng, and

Abdulmotaleb El Saddik. "Impact of Next-Generation

Mobile Technologies on IoT-Cloud Convergence."

IEEE Communications Magazine 55(1) , 2017, 18-19.

Jiang, Dajie, and Guangyi Liu. "An Overview of 5G

Requirements." 5G Mobile Communications. Springer

International Publishing, 2017. 3-26.

Kitchin, Rob. "Big Data—Hype or revolution." The SAGE

handbook of social media research methods 2017.

Liu, Yiyi, Quanquan Gu, Jack P. Hou, Jiawei Han, and Jian

Ma. "A network-assisted co-clustering algorithm to

discover cancer subtypes based on gene expression."

BMC bioinformatics 15, no. 1 (2014): 37.

Lin, Chao, Yan Yang, and Tonny Rutayisire. "A parallel

Cop-Kmeans clustering algorithm based on

MapReduce framework." Knowledge Engineering and

Management. 2011. 93-102.

Li, Yan, Hong Liu, Guang-peng Liu, Liang Li, Philip

Moore, and Bin Hu. "A grouping method based on grid

density and relationship for crowd evacuation

simulation." Physica A: Statistical Mechanics and its

Applications (2017).

Manogaran, Gunasekaran, Chandu Thota, Daphne Lopez,

V. Vijayakumar, Kaja M. Abbas, and Revathi

Sundarsekar. "Big Data Knowledge System in

Healthcare." In Internet of Things and Big Data

Technologies for Next Generation Healthcare, pp. 133-

157. Springer International Publishing, 2017.

Mavromoustakis, Constandinos X., George Mastorakis,

and Jordi Mongay Batalla. "Internet of Things (IoT) in

5G Mobile Technologies." Modeling and Optimization

in Science and Technologies, 2016

Mohebi, Amin, Saeed Aghabozorgi, Teh Ying Wah, Tutut

Herawan, and Ramin Yahyapour. "Iterative Big Data

clustering algorithms: a review." Software: Practice and

Experience 46, no. 1 (2016): 107-129.

Nguyen, Cuong Duc, Dung Tien Nguyen, and Van-Hau

Pham. "Parallel two-phase K-means." International

Conference on Computational Science and Its

Applications. Springer Berlin Heidelberg, 2013.

Ng, Raymond T., and Jiawei Han. "CLARANS: A method

for clustering objects for spatial data mining." IEEE

transactions on knowledge and data engineering 14.5

(2002): 1003-1016.

Pandove, Divya, and Shivani Goel. "A comprehensive

study on clustering approaches for Big Data mining."

Electronics and Communication Systems (ICECS),

2015 2nd International Conference on. IEEE, 2015.

Rafailidis, D., E. Constantinou, and Y. Manolopoulos.

"Landmark selection for spectral clustering based on

Weighted PageRank." Future Generation Computer

Systems 68 (2017): 465-472.

Shirkhorshidi, Ali Seyed, Saeed Aghabozorgi, Teh Ying

Wah, and Tutut Herawan. "Big Data clustering: a

review." In International Conference on Computational

Science and Its Applications, pp. 707-720, 2014.

Srirama, Satish Narayana, Pelle Jakovits, and Eero

Vainikko. "Adapting scientific computing problems to

clouds using MapReduce." Future Generation

Computer Systems 28.1 (2012): 184-192.

Sreenivasulu, G., S. Viswanadha Raju, and N. Sambasiva

Rao. "Review of Clustering Techniques." International

Conference on Data Engineering and Communication

Technology. Springer Singapore, 2017.

Van Kranenburg, Rob. The Internet of Things: A critique

of ambient technology and the all-seeing network of

RFID. Institute of Network Cultures, 2008.

Wang, Wei, Jiong Yang, and Richard Muntz. "STING: A

statistical information grid approach to spatial data

mining." VLDB. Vol. 97. 1997.

Xu, Lina, Rem Collier, and Gregory MP O’Hare. "A survey

of clustering techniques in wsns and consideration of

the challenges of applying such to 5g iot scenarios."

IEEE Internet of Things Journal 4, no. 5 (2017): 1229-

1249.

IoTBDS 2018 - 3rd International Conference on Internet of Things, Big Data and Security

276