Big Data Streaming Platforms to Support Real-time Analytics

Eliana Fernandes

, Ana Carolina Salgado

and Jorge Bernardino

1,3 b

Polytechnic of Coimbra – ISEC, Rua Pedro Nunes, Quinta da Nora, 3030-199 Coimbra, Portugal

Centre for Informatics, Universidade Federal de Pernambuco, Recife, Brazil

Centre for Informatics and Systems of the University of Coimbra (CISUC), Portugal

Keywords: Streaming, Real-time Analytics, Big Data, Fault-Tolerance.

Abstract: In recent years data has grown exponentially due to the evolution of technology. The data flow circulates in

a very fast and continuous way, so it must be processed in real time. Therefore, several big data streaming

platforms have emerged for processing large amounts of data. Nowadays, companies have difficulties in

choosing the platform that best suits their needs. In addition, the information about the platforms is scattered

and sometimes omitted, making it difficult for the company to choose the right platform. This work focuses

on helping companies or organizations to choose a big data streaming platform to analyze and process their

data flow. We provide a description of the most popular platforms, such as: Apache Flink, Apache Kafka,

Apache Samza, Apache Spark and Apache Storm. To strengthen the knowledge about these platforms, we

also approached their architectures, advantages and limitations. Finally, a comparison among big data

streaming platforms will be provided, using as attributes the characteristics that companies usually most need.

1 INTRODUCTION

The explosive growth of the Internet has caused large

amounts of data to be generated. The companies try

to react to this evolution and if data isn’t processed

efficiently and at the same speeds (Safaei, 2017).

Big data is a generic term for organizing,

processing, and aggregating large amounts of data.

The data that has a fast and continuous changing is

called streaming data (Behera et al., 2018). It needs

to be analyzed in a short period of time. Traditional

Business Intelligence tools aren’t suitable for

analyzing streaming data in real time, because is

processed in batch processing (Behera et al., 2018).

A large number of big data streaming platforms have

been developed (Imanuel, 2019).

Big data streaming platforms are the main

challenge for most companies. The requirements of

companies are sometimes different from the features

that these platforms offer. The objective of this work

is to assist in choosing a big data streaming platform,

taking into account the characteristics that platforms

may have for companies. As well as, is to describe

and compare the most popular and open-source big

https://orcid.org/0000-0003-4036-8064

https://orcid.org/0000-0001-9660-2011

data streaming platforms, such as: Flink, Kafka,

Samza, Spark and Storm (Imanuel, 2019).

The rest of this paper is structured as follows.

Section 2 provides an overview of the big data

streaming platforms, their architecture, advantages

and limitations. Section 3 presents a comparative

study of these platforms. The conclusions and future

work are presented in Section 4.

2 STREAMING PLATFORMS

Processing data means manipulating, aggregating in

order to transform data into useful information.

Big data streaming processing is always up-to-

date. So, when the data is available, it’s processed

immediately and is transformed into information.

To ensure continuous and stable operation of the

entire system it is necessary that the platform has a

suitable architecture design. The architectures for big

data streaming platforms, can be: symmetrical

architecture and master-slave architecture.

In symmetrical architecture, the functions of each

node are the same and have good scalability.

426

Fernandes, E., Salgado, A. and Bernardino, J.

Big Data Streaming Platforms to Support Real-time Analytics.

DOI: 10.5220/0009817304260433

In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 426-433

ISBN: 978-989-758-443-5

However, as there is no central node, the system must

contain resource scheduling, system fault tolerance

and data balancing (Sun et al., 2019).

The master-slave architecture has one master node

and several slave nodes. The master node manages

system resources, coordinating tasks, completing

system fault tolerance and balancing data. The slave

node receives tasks from the master node.

Throughout the process, there is no data exchange

between slave nodes, and system-wide operations are

completely dependent on master node control (Sun et

al., 2019).

To solve the problem of large-scale real-time

processing, current big data streaming platforms

, such

as Flink, Kafka, Samza, Spark Streaming, and Storm,

have emerged. These platforms adopt the master-

slave architecture.

There are a lot of big data streaming platforms.

However, many of them are only used for batch

processing, such as Hadoop. The chosen platforms

can handle data in real time and can perform

streaming processing. Another important issue for the

choice of platforms was the availability to use, that is

why we have analyzed only open-source platforms.

In addition, these platforms have a huge community

of developers and users (Neves and Bernardino,

2015).

Finally, the selection of the platforms was also

given to the popularity of a platform itself, the wealth

of resources and its usefulness. We took into account

some characteristics, such as ease of use, number of

features, among others (Imanuel, 2019).

2.1 Apache Flink (flink.apache.org)

Flink is an open source platform for distributed

stream and batch data processing (Stratosphere and

Markl, 2018). It’s a platform that provides data

distribution, and fault-tolerance for data stream

calculations (Stratosphere and Markl, 2018). Its

processes the user-defined functions code through the

system stack. It’s ability to compute common

operations (Nasiri, Nahesi and Goudarzi, 2019).

2.1.1 Flink Architecture

The platform offers software developers various

application programming interfaces (APIs), for

creating new applications to be executed on the Flink

engine. Examples of these APIs, represented in

Figure 1 (Stratosphere and Markl, 2018).

Figure 1: APIs and Libraries of Flink (Shahverdi, 2018).

The main Flink APIs are the Batch DataSet API

and the Streaming DataStream API. In this case we

will approach the DataStream API, because this API

allows to handle a large amount of data in real time.

This API performs filtering, updates, window

definition, joins etc. It can receive any kind of data

from message queues, sockets, and file systems

(Shahverdi, 2018).

Flink stream processing model handles incoming

data on an item-by-item basis as a true stream. Flink

provides its DataStream API to work with unbounded

streams of data. The basic components that Flink

works with are (Gurusamy, Kannan and Nandhini,

2017):

 Streams are immutable, unbounded datasets

that flow through the system;

 Operators are functions that operate on data

streams to produce other streams;

 Sources are the entry point for streams entering

the system;

 Sinks are the place where streams flow out of

the Flink system. They might represent a

database or a connector to another system.

The APIs present a logical representation and are

converted to a directed acyclic task graph that is sent

to the cluster for execution. A Flink cluster, shown in

Figure 2, comprises three types of processes: the

client, the job manager, and at least one task manager.

The client takes the program code, transforms it

to a dataflow graph, and submits to the job manager.

This transformation phase also examines the data

types of the data exchanged between operators and

creates serializers and other type/schema specific

code (Katsifodimos and Schelter, 2016).

Job manager coordinates distributed execution of

the data stream. It tracks the status and progress of

each operator and flow, schedules new operators, and

coordinates checkpoints and recovery points.

Actual data processing takes place in task

managers. And it runs one or more operators that

produce streams and reports their status to the task

manager. Job managers maintain buffer pools for

buffering or materializing streams and network

connections to exchange data streams between

operators (Katsifodimos and Schelter, 2016).

Big Data Streaming Platforms to Support Real-time Analytics

427

Figure 2: Flink Processing Flow (Nasiri, Nahesi and

Goudarzi, 2019).

2.1.2 Flink Advantages and Limitations

Flink have some advantages, such as (Levy, 2019):

 It doesn’t require manual optimization and

adjustment to data it processes;

 Dynamically analyzes and optimizes tasks

Flink has also some shortcomings (Sun et al., 2019):

 With a large number of changes to external

events, it cannot know how large-scale

resources are needed, creating workload issues;

 Some scaling limitations.

2.2 Apache Kafka (kafka.apache.org)

Kafka is a highly available open source, fault-

tolerant, scalable distributed streaming platform. It

can be used to store and process data streams, and is

intended to provide unified, high throughput, low

latency platform for handling feeds of real time data

(Freiknecht et al., 2018). Kafka was introduced by

LinkedIn in 2011 and is written in Scala and Java

(Shaheen, 2017). Kafka is a publishing and

subscribing messaging system. A Messaging System

is responsible for transferring data from one

application to another and focus on data. Distributed

messaging is based on the concept of reliable message

queuing. There are two types of messaging patterns

available (Team, 2019):

 Point to Point Messaging System – messages

remain in a queue. More than one consumer can

consume the messages in the queue;

 Publish-Subscribe Messaging System –

messages remain in a topic. Consumers can

take more than one topic and consume every

message in that topic.

2.2.1 Kafka Architecture

Kafka is deployed as a cluster on multiple servers, so

it handles its entire publish and subscribe messaging

system with the help of four APIs, such as: producer,

consumer, streams processors and connector.

 Producer API: customers can connect to Kafka

servers, and customers can post the log stream

to one or more Kafka topics.

 Consumer API: Allows clients to connect to

Kafka servers running in the cluster and

consume streams of records from one or more

Kafka topics. This platform consumes the

messages from Kafka topics.

 Streams API: Clients act as flow processors by

consuming flows from one or more topics and

producing flows to other output topics. This

allows to transform input and output streams.

 Connector API: Allows writing reusable

producer and consumer code. We can create

reusable source and sink connector

components for various data sources.

Figure 3 shows a short illustration of the Kafka

ecosystem. It shows how producers send messages to

the cluster and presents how consumers extract this

message from the broker. Also, it can see the

Zookeeper, which is used to manage and coordinate

the Kafka cluster. The Zookeeper is used to notify

producer and consumer of the presence of any new

broker in the system or broker failure (Shaheen,

2017).

A Kafka cluster is made up of connectors that

record changes to records in a relational database,

data producers, data consumers or data processors

(TutorialKart, 2019). The main components of its

architecture are topics, registers and intermediaries.

Figure 3: Kafka Ecosystem (JavaTpoint, 2020).

Topics consist of a flow of records containing

different information (Shaheen, 2017). Data or

messages are partitioned into different partitions

within various topics. Here, messages are indexed and

stored associated with a data / time stamp. Consumers

can consult messages from these parties.

2.2.2 Kafka Advantages and Limitations

Kafka are some of the advantages (Instaclustr, 2019):

 Load balance and data replication;

 Can handle high-velocity of data;

There are some limitations (JavaTpoint, 2019):

ICSOFT 2020 - 15th International Conference on Software Technologies

428

 The Kafka broker can sometimes have

problems when a message needs some tuning

as Kafka's performance is reduced;

 Brokers and consumers reduce Kafka

performance by compressing and

decompressing data flow, thus affecting

performance and throughput;

2.3 Apache Samza (samza.apache.org)

Samza was developed by LinkedIn. Is a distributed

flow processing platform and also an open source

Kafka message queue-based system for

implementing real-time flow data processing (Sun et

al., 2019). It is formed by combining Kafka and

YARN to perform the computation of data streams

(Nasiri, Nahesi and Goudarzi, 2019). Samza is

designed to take advantage of Kafka's unique

architecture and warranties, although Kafka can be

used by other flow processing systems. That's why

Samza uses Kafka to provide fault-tolerance and state

(Gurusamy, Kannan and Nandhini, 2017).

This platform supports a high throughput for a

wide variety of processing standards while providing

operational robustness at the massive scale required.

To achieve its primary purpose it uses a small number

of carefully designed abstractions partitioned

message logs, fault-tolerance local state, and cluster-

based scheduling (Kleppmann, 2018). The core of

Samza consists of several fairly low-level

abstractions, on top of which high-level operators

have been built.

2.3.1 Samza Architecture

The Samza architecture consists of the flow data layer

(Kafka), the execution layer (YARN), and the

processing layer (Samza API) (Sun et al., 2019). It is

used for consuming flows, processing messages, and

producing derived output streams. One of Samza

work consists of a Kafka consumer, an event loop that

calls the application code to process incoming

messages, and a Kafka producer that sends outgoing

messages back to Kafka. YARN is used to

automatically restart failed processes, metrics, and

monitoring. It even plays the role of resource manager

and cluster manager. For processing messages,

Samza provides a Java StreamTask interface that is

implemented (Kleppmann and Kreps, 2015).

A node manager demon is running at each node in

the cluster and is responsible for scheduling the

process on the node (Behera et al., 2018). A resource

manager is responsible for coordinating the task

executed at each node in the cluster. Work progress

or resource failure at slave node is reported

periodically by the node manager. Node managers

might communicate among themselves. Resource

manager and Node manager are communicated by a

concept known as ”heartbeat” (Behera et al., 2018).

Kafka works at the streaming layer, and acts as a

distributed Message Queuing system that provides at

least once the message delivery guarantee policy.

Each data stream is known as a topic that is

partitioned and replicated across multiple nodes.

When a producer sends a message to a topic, a key is

provided and determines the partition to which the

message is to be sent (Behera et al., 2018).

Kafka's provides Samza with some features that

are difficult or should not be implemented in other

streaming platforms. The Kafka cluster consists of

several intermediate servers. On this, each message

type is defined as a topic. Messages on the same topic

are partitioned and stored in different intermediaries,

according to a given key and algorithm.

2.3.2 Samza Advantages and Limitations

Samza have some advantages, such as (Levy, 2019):

 Provides reliable persistence with low latency,

offering replicated storage;

 Can eliminate backpressure, allowing data to

be persisted and processed later.

Although Samza has many advantages, it also has

some limitations, such as (Sun et al., 2019):

 There is no full fault-tolerance, causing state

information in the memory of the source node

to be lost when the node fails to transfer;

 Only supports JVM languages;

 Doesn’t support very low latency.

2.4 Apache Spark (spark.apache.org)

Spark is an open source big data streaming platform,

developed in 2009 by Matei Zaharia (Vaidya, 2019).

It’s designed to support iterative algorithms,

interactive queries and streaming. And it’s highly

scalable, high fault-tolerance, high performance and

low latency (Apache Spark - Introduction, 2019).

Spark allows for ease of developing large-scale

applications, and it has some scalability issues

(Ghasemi and Chow, 2019). This system supports

various programming languages, such as Java,

Python, Scala (Behera et al., 2018). This platform

provides large number of

tools, as shown in Figure 4,

for example, stream processing engine called Spark

Streaming (Shoro and Soomro, 2015).

Big Data Streaming Platforms to Support Real-time Analytics

429

Figure 4: Tools of Spark (Shoro and Soomro, 2015).

Spark can handle real-time data with an extension

feature called Spark Streaming. The advantage of

using Spark Streaming is that it can handle both batch

and streaming data. It also helps Spark to increase its

primary scheduling capability and perform streaming

analysis on real-time data.

2.4.1

Spark

Architecture

Spark has a well-defined layered architecture

integrated with many extensions (Vaidya, 2019). The

architecture of Spark is illustrated in Figure 5:

Figure 5: Structure of Apache Spark (Kirillov, 2016).

 Resilient Distributed Dataset (RDD) – is a

partitioned collection of elements that can be

operated in parallel. Each data set runs on

different nodes of a cluster;

 Directed Acyclic Graph (DAG) Scheduler - is a

graph that is directed and without cycles

connecting the other edges. The edges of the

directed graph only go one way.

Spark creates an operator chart, and when

performing an action, the chart is sent to a DAG

Scheduler. DAG Scheduler divides the graph into

phases. A phase is made up of tasks based on

partitions of the input data. At the end, the stages are

sent to the Task Scheduler. The task scheduler starts

tasks via the cluster manage.

In Figure 6 the cluster view of Spark is shown. In

this cluster, the master ensures normal operation of

the entire Spark system. The worker is the compute

node, mainly used to accept the tasks of the master

node (Sun et al., 2019). System processing can be

divided into three parts, including executor, cluster

manager, and driver.

The master node converts the application into a

set of tasks to be performed by a set of executors. It’s

then passed to cluster manager for distribution. The

purpose of them is to distribute tasks to the most

appropriate server in the cluster. Each server has an

executor who receives tasks from the cluster

manager, executes them, and then returns the results

(Nasiri, Nahesi and Goudarzi, 2019).

Figure 6: Architecture of Apache Spark (Sun et al., 2019).

2.4.2 Spark Advantages and Limitations

There are many features of Spark that are very

beneficial. So, there are several advantages of Spark:

 Efficient in iterative queries and immediate

support in SQL queries (Levy, 2019);

 High-level Machine Learning Algorithms.

Although the Spark system provides solutions for

streaming data on the time delay, fault-tolerance and

throughput, there are also some unsolved

shortcomings of the Spark system (Sun et al., 2019):

 It can be complex to configure and deploy;

 The Spark system tends to be unstable and can

only be used for calculations;

 Task scheduling efficiency is very low.

2.5 Apache Storm (storm.apache.org)

Storm is an open source big data streaming platform

and can handle large amounts of data. Storm

pioneered the wave of fault-tolerance distributed flow

processing platforms (Shahverdi, Awad and Sakr,

2019). This platform was originally created by

Nathan Marz in 2011 (Point, 2019).

Storm focuses on extremely low latency

(Gurusamy, Kannan and Nandhini, 2017) and it’s

scalable, and easy to set up and operate (Foundation,

2019 b).

Storm has many use cases: real-time analytics,

online machine learning, continuous computation,

ETL (Extract, Transform, Load), and among others

(Foundation, 2019 b). The Storm is written in Java

and Clojure.

2.5.1 Storm Architecture

In Storm, the topologies are composed of multiple

components that are arranged in a directed acyclic

graph (DAG) of real-time computing. In a DAG the

edges show us the data flow between them and the

vertices show the components.

ICSOFT 2020 - 15th International Conference on Software Technologies

430

Storm topology consists of several components

allowing to transfer one data stream to another stream

in a reliable and distributive way. Storm data streams

are precisely unlimited sequences of tuples, and also

the data structure to represent standard data types or

user-defined types with some additional serialization

code (Hoseiny Farahabady et al., 2016).

Spout

s are the source of data streams. It allows a

topology to retrieve data from external data

generators for later transformation into standard

tuples (Sun et al., 2019). As a topology is fed by input

tuples, Spouts can emit streams along the edges of the

directed graph (Hoseiny Farahabady et al., 2016).

Bolts are the processing nodes that receive Spout

tuples, consume any number of input streams,

perform some processing, and issue new streams

(Shahverdi, Awad and Sakr, 2019). Bolts represent

the logical components of the implementation of

various flow processing operations.

In Storm, the process of a topology is always sent

to the Zookeeper cluster. For running topologies,

there are three types of entities (Shahverdi, 2018):

 Worker Process: it’s processing executors

within its topology. A topology can contain

more than one worker process;

 Executor: This is a thread that was generated

by the Worker Process. Executor processes

perform tasks for Spouts and Bolts;

 Task: It’s the entity that processes the data. In

topology, multiple tasks are always equal or

greater than the number of executors.

The topology is then supported by the Zookeeper

cluster where the master node will distribute code

among worker nodes for execution (Amakobe, 2016).

The Storm architecture, is shown in Figure 7. It

consists of a primary node Nimbus, a number of slave

supervisors, and a Zookeeper cluster.

The master node of cluster is Nimbus, responsible

for executing the topology and monitoring the

execution of all process and Zookeeper cluster. It

analyzes the topology and the task to be performed.

ll then distribute the task to an available supervisor

(Point, 2019). It consists for distributing data among

all the worker nodes, assign tasks and monitoring

failures. Nimbus and supervisors communicate with

each other through a Zookeeper cluster.

A Zookeeper cluster is used to coordinate the

work between the master node and the slave nodes

(Nasiri, Nahesi and Goudarzi, 2019). It is responsible

for managing all message communication, with the

help of message acknowledgments, processing status,

among others (Shahverdi, Awad and Sakr, 2019).

The cluster is capable of storing job topology

information, slave supervisor status, cluster-wide

state and configuration information (Sun et al., 2019).

Figure 7: Storm Architecture (Sun et al., 2019).

A worker creates executors and asks them to

perform a particular task. Its process will have various

executors. Each worker node runs a daemon called

Supervisor, that can run one or more worker

processes (Nasiri, Nahesi and Goudarzi, 2019).

2.5.2 Storm Advantages and Limitations

There are several advantages of Storm (Point, 2019):

 Storm is unbelievably fast because it has

enormous power of processing the data;

 Storm has operational intelligence (it focuses

on real-time dynamic, business analytics

delivering visibility into data, streaming events

and business operations);

 It can guarantee data processing if a process is

killed by any of the connected nodes in the

cluster or if messages are lost.

Storm has some drawbacks (Sun et al., 2019):

 Resource allocation doesn’t take the structural

features of the task topology into account and

cannot be adapted to the dynamic changes of

the data load;

 The scalability of the system is limited.

3 COMPARISION OF BIG DATA

STREAMING PLATFORMS

In this section, some features are analyzed to compare

the presented platforms. In order to choose the main

characteristics of big data streaming platforms, an

Big Data Streaming Platforms to Support Real-time Analytics

431

analysis of possible problems that companies had

already had was made.

According to (Nasiri, Nahesi and Goudarzi,

2019), the characteristics many companies had

difficulty are scalability, privacy, load balancing,

fault tolerance, integration, consistency, timeliness,

privacy, accuracy, among others. Then, it was

concluded that these are the main characteristics in

the analysis of big data streaming. Another analysis

made on platform resources is that any platform needs

to be robust, i.e., it contains the main characteristics

of a big data streaming platform. It is necessary that

it has a simple dashboard, is accessible anywhere.

The following features were selected for

comparative analysis (Kolajo, Daramola and

Adebiyi, 2019) (Imanuel, 2019):

▪ Fault-tolerance: that allows an application to

continue working without interruption;

▪ Scalability: that means research efforts should

be focused on developing scalable structures

that accommodate data flow computation

mode, effective resource allocation strategy,

and parallelization issues to address the

increasing size and complexity of data;

▪ Robustness: it’s the ability of a computer

system to handle errors during execution;

▪ Dashboards: make it possible to visualize data

in the form of graphs or images that show the

most important graphics;

▪ Integration: it enables efficient operations on

different data sets;

▪ Consistency: achieving high consistency (i.e.

stability) in big data stream computing

environments is non-trivial as it is difficult to

determine which data is needed and which

nodes should be consistent;

▪ Security: it proposes techniques for protecting

a dataset before its analysis;

▪ Time handling: it is desired to process data

using the event time, the time when the event

occurred, instead of the processing-time, the

time of the machine when the data is processed;

▪ Stream SQL: it’s a query language that extends

SQL and process real-time data streams;

▪ ETL Optimization: is the process by which data

is extracted from optimized data sources;

▪ Machine Learning: data analysis method that

automates the construction of analytical

models;

▪ Elasticity: the degree to which a system is able

to adapt to workload changes.

After choosing the attributes we will proceed to

the comparison of the five big data streaming

platforms, shown in Table 1.

Table 1: Platforms comparison based on the presented

features.

Features Flin

Kafka Samza Spar

Stor

Fault-

tolerance

✔ ✔ ✔ ✔ ✔

Scalability

✔ ✔ ✔ ✔ ✔

Robustness

✔ ✔ ✔

✔

Dashboards

✔ ✔ ✔ ✔ ✔

Integration

✔ ✔ ✔ ✔ ✔

Consistency

✔ ✔ ✔ ✔

Security

✔ ✔ ✔ ✔ ✔

Time

handlin

✔

Stream SQL

✔ ✔ ✔ ✔ ✔

ETL

timization

✔ ✔

Machine

Learnin

✔ ✔ ✔ ✔ ✔

Elasticity

✔ ✔ ✔ ✔ ✔

Flink, Kafka, Samza, Spark and Storm are open-

source big data streaming platforms and are used for

real-time data analysis. All of them offer fault-

tolerance, scalability, dashboards, integration,

security, SQL stream, machine learning and elasticity

and have a simple implementation methodology.

Regarding the robustness, not all platforms offer

this feature that is relatively important. We can verify

that the only platform that doesn’t contain this feature

is Spark. Another feature that managed to divide the

quality of the platforms was the consistency, four of

the five platforms have this component, namely:

Flink, Kafka Samza and Spark.

Finally, there is a slight highlight on the Storm, as

only this tool contains time handling.

In general, there is an emphasis on three

platforms, Flink, Kafka and Storm, because among

the features chosen for comparison, these platforms

form the ones that obtained the greatest number of

features. These platforms only fail in one feature, as

already mentioned. However, the Samza and Spark

fail in two features, being just behind the other

platforms.

4 CONCLUSIONS

As the amount of data generated by different devices

worldwide is growing, flow processing becomes a

crucial and essential requirement on big data

streaming platforms. The main objective of this work

was to describe and compare the most popular and

open-source big data streaming platforms: Flink,

Kafka, Samza, Spark and Storm. A description was

made of these platforms, their architectures and

advantages and limitations. The comparison was

ICSOFT 2020 - 15th International Conference on Software Technologies

432

made using features that were chosen through

brainstorming and researches, taking into account the

needs that companies have when using these big data

streaming platforms. The Flink, Kafka, and Storm

platforms were the ones that achieved the best range,

as they contain more features that we analyzed.

As future work, we intend to choose three of the

compared platforms to evaluate them with a

benchmark application. Research on existing

benchmarks will be carried out and the one that best

fits to evaluate the platforms will be chosen. The

evaluation will be made taking into account the

features that have been compared. We intend to

choose the best platform and use it in a real

environment. An extensive quantitative assessment

(performance) of these systems will also be a good

suggestion.

REFERENCES

Amakobe, M. (2016) ‘A comparison between Apache

Samza and Storm’, Colorado Tech University.

Behera, R. K, Das, S., Jena, M., Rath, S. K. & Sahoo, B.

(2017). ‘A Comparative Study of Distributed Tools for

Analyzing Streaming Data’, 2017 Int. Conference on

Information Technology (ICIT), pp. 79–84.

D'Silva, G. M., Khan, A., Gaurav & Bari, S. (2018) ‘Real-

time processing of IoT events with historic data using

Apache Kafka and Apache Spark with dashing

framework’, 2017 2nd IEEE Int. Conference on Recent

Trends in Electronics, Information & Communication

Technology (RTEICT), pp. 1804–1809.

Foundation, A. S. (2019 a) Apache Kafka. Available at:

https://kafka.apache.org/.

Foundation, A. S. (2019 b) Apache Storm. Available at:

https://storm.apache.org/.

Freiknecht, J., Papp, S, Freiknecht, J. & Papp, S. (2018)

‘Apache Kafka’, Encyclopedia of Big Data

Technologies. Springer, Cham, p. 8.

Ghasemi, E. & Chow, P. (2019) ‘Accelerating Apache

Spark with FPGAs’, 2016, Wiley Online Library,

Concurrency and Computation: Practice and

Experience, v31, Issue 2.

Gurusamy, V., Kannan, S. and Nandhini, K. (2017) ‘The

Real Time Big Data Processing Framework

Advantages and Limitations’, Int. Journal of Computer

Sciences and Eng., 5(12): pp 305-312.

Hoseiny Farahabady, M. R., Dehghani Samani, H. R.,

Wang, Y., Zomaya, A. Y. & Tari, Z, (2016) ‘A QoS-

aware controller for Apache Storm’, 2016 IEEE 15th

Int. Symposium on Network Computing and

Applications (NCA), pp. 334–342.

Imanuel (2019) Top 20, free open source and premium

stream analytics platforms. Available at:

https://www.predictiveanalyticstoday.com/top-open-

source-commercial-stream-analytics-platforms.

Instaclustr (2019) Apache Kafka. Available at:

https://www.instaclustr.com/apache-kafka/#apache-

kafka-advantages.

Katsifodimos, A. and Schelter, S. (2016) ‘Apache Flink:

Stream Analytics at Scale’, 2016 IEEE Int. Conference

on Cloud Eng. Workshop (IC2EW), pp. 193–193.

Kirillov, A. (2016) Apache Spark. Available at:

http://datastrophic.io/tag/spark/.

Kleppmann, M. (2018) ‘Apache Samza’, Encyclopedia of

Big Data Technologies. SpringerLink, p. 8.

Kleppmann, M. and Kreps, J. (2015) ‘Kafka, Samza and the

Unix Philosophy of Distributed Data’, IEEE Data

Engineering Bulletin, December 2015, 38(4), pp.4–14.

Kolajo, T., Daramola, O. and Adebiyi, A. (2019) ‘Big data

stream analysis: a systematic literature review’, Journal

of Big Data volume 6, Article number: 47 (2019).

Levy, E. (2019) 7 Popular Stream Processing Frameworks

Compared. Available at: https://www.upsolver.com/

blog/popular-stream-processing-frameworks-compared.

Nasiri, H., Nahesi, S. and Goudarzi, M. (2019) ‘Evaluation

of Distributed Stream Processing Frameworks for IoT

Applications in Smart Cities’, Journal of Big Data

volume 6, Article number: 52 (2019).

Neves, P., Bernardino, J. (2015) ‘Big Data Issues’, In

Proceedings of the 19th International Database

Engineering & Applications Symposium (IDEAS ’15),

ACM, New York, USA, pp. 200–201.

Point, T. (2019) Apache Storm. Available at:

https://www.tutorialspoint.com/apache_storm.

Safaei, A. A. (2017) ‘Real-time processing of streaming big

data’,

Real-Time Systems, v. 53, pp. 1–44.

Shaheen, J. A. (2017) ‘Apache Kafka: Real time implemen-

tation with Kafka architecture review’, Int. Journal of

Advanced Science and Technology, pp.35-42.

Shahverdi, E. (2018) ‘Comparative Evaluation for the

Performance of Big Stream Processing Systems’, Int.

Journal of Pure and Applied Mathematics, V. 119 No.

16, pp.937-948.

Shahverdi, E., Awad, A. and Sakr, S. (2019) ‘Big Stream

Processing Systems: An Experimental Evaluation’,

2019 IEEE 35th Int. Conference on Data Eng.

Workshops (ICDEW), pp.53-60.

Shoro, A. G. and Soomro, T. R. (2015) ‘Big Data Analysis:

Apache Spark Perspective’, Int. Journal of Technical

Innovation in Modern Engineering & Science

(IJTIMES), V.4, Issue 5.

Stratosphere, A. F. and Markl, B. V. (2018) ‘Mosaics in big

data’, DEBS ’18: The 12th ACM Int. Conference on

Distributed and Event-based Systems, pp. 7–13.

Sun, G., Song, Y., Gong, Z., Zhou, X. & Bi, Y. (2019)

‘Survey on streaming data computing system’, ACM

TURC 2019: ACM Turing Celebration Conf., pp. 1–8.

Team, D. (2019) Apache Kafka Tutorial. Available at:

https://data-flair.training/blogs/apache-kafka-tutorial/.

Vaidya, N. (2019) Apache Spark Architecture – Spark

Cluster Architecture Explained. Available at:

https://www.edureka.co/blog/spark-architecture/.

Big Data Streaming Platforms to Support Real-time Analytics

433