Big Data Processing Tools Navigation Diagram

Martin Macak

1,2

, Hind Bangui

1,2

, Barbora Buhnova

1,2

, Andr

as J. Moln

3,4

and Csaba Istv

an Sidl

Institute of Computer Science, Masaryk University, Brno, Czech Republic

Faculty of Informatics, Masaryk University, Brno, Czech Republic

Christian-Albrechts-University Kiel, Computer Science Institute, Kiel, Germany

Institute for Computer Science and Control (SZTAKI), Budapest, Hungary

Keywords:

Big Data, Processing Tools, Tool Selection, Navigation Diagram.

Abstract:

Big Data processing has become crucial in many domains because the amount of the produced data has enor-

mously increased almost everywhere. The effective selection of the right Big Data processing tool is hard due

to the high number and large variety of the available state-of-the-art tools. Many research results agree that

there is no one best Big Data solution for all needs and requirements. It is therefore essential to be able to

navigate more efﬁciently in the world of Big Data processing tools. In this paper, we present a map of current

Big Data processing tools, recommended according to their capabilities and advantageous properties identi-

ﬁed in previously published academic benchmarks. This map—as a navigation diagram—is aimed at helping

researchers and practitioners to ﬁlter a large amount of available Big Data processing tools according to the

requirements and properties of their tasks. Additionally, we provide recommendations for future experiments

comparing Big Data processing tools, to improve the navigation diagram.

1 INTRODUCTION

Several domains, such as the Internet of Things (IoT),

smart grids, e-health, and transportation (Oussous

et al., 2018a), have to deal with the phenomenon of

Big Data, with its large data volume, great variety,

and the speed with which the data is being gener-

ated (Fang et al., 2015a). Big Data tools are used, for

example, in data mining, machine learning, predictive

analytics, and statistics, supporting numerous differ-

ent software tasks (Oussous et al., 2018b). Software

engineers integrate Big Data tools and techniques in

their systems on an increasingly common basis. How-

ever, the proper selection of the right Big Data pro-

cessing tool for the given problem is a tedious task,

due to the number and variety of the available solu-

tions.

Practitioners, as well as researchers, would highly

beneﬁt from aiding navigation among the tools, sum-

marizing the knowledge about which processing tool

is better in speciﬁc situations. Based on the proper-

ties of their problem, practitioners would be able to

navigate to the suitable solution more easily, using a

visual, easily readable diagram.

Although some help for such a diagram can be

found in Big Data surveys, these do not focus primar-

ily on tools and technologies, but rather on algorithms

and approaches used to process Big Data. When Big

Data processing tools are included in the compari-

son (G

okalp et al., 2017a; Oussous et al., 2018b),

the comparison is still on the level of tool features

rather than tool effectiveness and efﬁciency on realis-

tic problems. In (Gessert et al., 2017), a decision tree

mapping requirements to NoSQL databases is pro-

vided for the context of Big Data storage, based on

theoretical knowledge. These attempts towards a de-

cision tree of Big Data processing tools are missing

so far, and practitioners hence need to study various

benchmarks to gain insight.

In this paper, we aim at creating a navigation di-

agram for Big Data processing tools, whose purpose

is to visualize the ﬁndings of existing benchmarks of

Big Data processing tools, visualizing the ﬁndings of

previous related comparative research papers. The

main contribution of this diagram is that it aids re-

searchers and practitioners to navigate among many

different Big Data processing tools and help them to

ﬁnd a candidate which best ﬁts the requirements.

The remainder of the paper is structured as fol-

lows. Section 2 discusses the state of the art and

related work. Then, in Section 3, we describe the

methodology used for this research. Section 4 con-

304

Macak, M., Bangui, H., Buhnova, B., Molnár, A. and Sidló, C.

Big Data Processing Tools Navigation Diagram.

DOI: 10.5220/0009406403040312

In Proceedings of the 5th International Conference on Internet of Things, Big Data and Security (IoTBDS 2020), pages 304-312

ISBN: 978-989-758-426-8

tains and describes the Big Data processing tool nav-

igation diagram. In Section 5, we discuss the results

and provide recommendations for future experiments,

aiming at enhancing the quality of the diagram. Sec-

tion 6 describes threats to validity of the presented

diagram. Finally, Section 7 concludes the paper by

summarizing the results and outlining future work.

2 RELATED WORK

Big data tools are recently being developed so rapidly

that maintaining a list of available tools and choosing

the best option for a sophisticated Big Data problem

is a very complicated and lengthy task. We can see it,

for example, in an application-oriented landscape of

current solutions (Turck, 2019). To explore genuinely

how current research results support this process, we

have selected the surveys that focus on reviewing and

comparing Big Data tools.

Based on examining the related surveys (Table 1),

we found that the Big Data ecosystem is plentiful, and

there is no one solution that would meet all needs,

imposing robust interoperability between the existing

Big Data solutions. In other words, beyond the choice

of a particular Big Data tool for a given problem, it is

essential to be able to navigate towards best ﬁtting Big

Data tools. For example, in (Gessert et al., 2017), the

authors created a decision tree that might help with

navigation among Big Data storage options, speciﬁ-

cally NoSQL databases. The practitioners can ﬁlter

the signiﬁcant number of storage tools based on their

requirements on the database. However, this compar-

ison is only in the context of Big Data storage, and the

results are solely based on theoretical knowledge, not

supported by experiments.

On the other hand, based on our review, we have

found that the existing works do not provide clear

guidelines recommending Big Data tools. For ex-

ample, in (G

okalp et al., 2017b), the authors have

provided a comprehensive review of open source Big

Data tools, with an attempt to propose a strategy to

choose suitable Big Data tools based on a list of crite-

ria, which are: computation time, data size, interoper-

ability, and data storage model. However, the authors

found that it is crucial to decide which tool is most

suitable for the inherent characteristics and require-

ments of a given Big Data problem since most of the

open source Big Data tools are implemented based on

their breakthrough publications. Likewise, in (Ulusar

et al., 2020), the authors have considered the trade-

offs that exist between usability, performance, and al-

gorithm selection when reviewing different Big Data

solutions. They have noticed that there is no single

Table 1: Big Data tools survey papers.

Category of

survey

Papers

General

(Ramadan, 2017)

okalp et al., 2017a)

(Oussous et al., 2018b)

(Fang et al., 2015b)

(Acharjya and Ahmed,

2016)

Storage

(Mazumdar et al., 2019)

(Corbellini et al., 2017)

(Gessert et al., 2017)

(Almassabi et al., 2018)

(Siddiqa et al., 2017)

(Lourenc¸o et al., 2015)

(Płuciennik and Zgorzałek,

2017)

(Makris et al., 2016)

(Chen et al., 2014)

Processing

(Oussous et al., 2018a)

(Ulusar et al., 2020)

(Landset et al., 2015)

(Habeeb et al., 2019)

(Inoubli et al., 2018)

(Liu et al., 2014)

Visualization (Raghav et al., 2016)

framework that covers all or even the majority of Big

Data processing tasks. Furthermore, the majority of

tools focuses on solving the data processing require-

ments without considering the scalability issues. The

setting of the guidelines hence remains problematic.

In this paper, we take a different approach, fo-

cusing on the visualization of the ﬁndings of existing

benchmarks among Big Data processing tools, given

various problems and properties. Despite the great

academic efforts (Table 1), we have not found any

research paper summarizing and visualizing the Big

Data processing tool benchmarks. The selected pa-

pers (Table 1) have focused on providing a compre-

hensive review of available Big Data tools and ana-

lyzing the general advantages and drawbacks of each

tool. They do not give clear guidance in Big Data

tool selection that would help developers to under-

stand how to use Big Data tools together to build a

robust architecture capable of processing Big Data ef-

ﬁciently and effectively.

3 METHODOLOGY

In this section, we describe the step-by-step process

of building the visual representation of the identiﬁed

benchmarks into a so-called navigation diagram. The

process is illustrated in Figure 1.

First, we identiﬁed the relevant benchmark papers

by searching academic databases and well-known

Big Data Processing Tools Navigation Diagram

305

Identify

benchmark

papers

Design

categories

Extract

tools from

papers

Extract

results from

papers

Build

navigation

diagram

Figure 1: Methodology of this work.

publishers such as ScienceDirect, Google Scholar,

ACM Digital Library, IEEE Xplore Digital Library,

Springer as well as general Google search with these

keywords: Big Data (processing OR machine learn-

ing) tools (benchmark OR performance comparison

OR performance evaluation OR case study compari-

son). We limited our search to the papers published in

the last ﬁve years, which is from 2014 to 2019. The

search resulted in papers that compare two or more

Big Data processing tools by experiments. A detailed

description of the experiments and results needs to ex-

ist in the papers. We paid special attention to the pa-

pers that compare several Big Data processing tools.

In total, 24 papers ended up in the selection.

In the next step, we have grouped the papers

into categories reﬂecting their purpose (the domain

of Big Data processing). As the basis for the cat-

egories, we have used the approach by Sakr (Sakr,

2016), which suggests grouping Big Data processing

tools into general-purpose, SQL, graph processing,

and stream processing categories. General-purpose

processing systems were designed for multiple types

of data processing scenarios (e.g., batch, stream, and

graph processing). SQL systems provide a high-level

declarative language for easier querying of structured

data. Graph processing systems were designed for

large-scale graph processing, and stream processing

systems can process streams of Big Data.

In our classiﬁcation, we have adopted three cate-

gories from (Sakr, 2016), which are stream process-

ing, graph processing, and SQL systems. Moreover,

we have added the machine learning category because

of its popularity in Big Data processing and also its

occurrence in comparison benchmarks. As each iden-

tiﬁed benchmark could be assigned to a single pur-

pose (category), there was no need for the general-

purpose category. Instead, each general-purpose tool

might be included in multiple categories. Moreover,

as some papers are comparing tools on large static

data, we added the batch processing category. There-

fore, we ended up with ﬁve categories: batch process-

ing, stream processing, graph processing, SQL, and

machine learning.

Then we assigned the found papers into de-

signed categories and extracted the tools that were

compared in each paper. These tools are Hadoop

MapReduce, Spark, Flink, Storm, Giraph, Hama,

GraphChi, GraphLab, GraphX, Pregel, Pregel+,

GPS, Mizan, Impala, Hive, Spark SQL, HAWQ,

Drill, Pesto, Pheonix, Mahout, MLlib, TensorFlow,

H2O, SAMOA, Theano, Torch, Caffe, CNTK, and

Deeplearning4j. The list of extracted tools for each

categorized paper can be found in Table 2. Then we

extracted the results of those papers, speciﬁcally the

list of triplets that contain the information about the

tool that outperformed the other, the tool that fell be-

hind, and the feature driving the experiment. One pa-

per could result in multiple of these triplets. For ex-

ample, in (Chintapalli et al., 2016), Flink had better

latency than Spark, Spark had better throughput than

Storm, etc. From this information, we were able to

build the navigation diagram presented in Section 4.

The complete data that were used for the construction

of this diagram can be found in Appendix.

4 PROCESSING TOOLS

NAVIGATION DIAGRAM

This chapter contains the map of the results based on

the current knowledge from benchmarks of Big Data

processing tools, in the form of a diagram. This di-

agram (in Figure 2) can be used to ﬁlter the possi-

ble Big data processing tools for a given Big Data

problem. The practitioners can navigate through it,

and based on the features of a given problem, can see

which tools might be relevant.

The diagram is illustrated as follows. The black

dot is the initial node, diamonds are decision nodes,

and arrows represent the control ﬂow. The labels of

the arrows navigate towards the tool that outperforms

another in that characteristic according to a bench-

mark cited within the label. Each labeled arrow is

supported by a published benchmark. The citation

number within the label corresponds to the number

of the paper in Table 2. Note that papers with num-

bers 8 and 9 are not included in the diagram because

their results are not strong enough to be relevant for

the diagram. Unlabeled arrows navigate directly to

the linked tools (meaning that at that point, the tool

is the best solution according to literature). If more

benchmarks represented with arrows from the same

decision node contradict each other, we put the papers

with the opposite results on these arrows in parenthe-

ses (as visible in the stream processing part).

The user can traverse this diagram and end up with

multiple candidates for the solution of a given prob-

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

306

TensorFlow

MLlib

Storm

Spark

Theano

SAMOA

Mahout

H2O

Flink

Impala

HAWQ

Presto

Giraph

GraphX

Hama

GPS

GraphLab

Pregel+

Spark

latency, throughput [6]

performance [5,7]

performance [13]

latency [1, 4]

performance [11]

scalability [10]

speed [10, 12]

latency [12]

memory efficiency [12]

performance [7, 10, 11, 13, 14]

performance [17]

performance [15, 16, 18]

performance on

~30 GB dataset [15]

latency [1, 4],

(throughput [3])

performance, scalability [2],

fault tolerance [3, 4],

(throughput [1, 4])

throughput [1, 3, 4]

graph processing

SQL

batch processing

coverage [21], performance [22]

extensibility, scalability, usability,

fault tolerance, speed [19]

extensibility, scalability, usability,

speed [19, 21], maturity [20, 22]

coverage [21]

fault tolerance [19]

accuracy,

complexity [23]

flexibility [22]

stream processing

extensibility, scalability,

usability, speed [19]

latency [20]

machine learning

performance [24]

Figure 2: Big Data Processing Tools Navigation Diagram.

Big Data Processing Tools Navigation Diagram

307

Table 2: The benchmarks of processing tools.

# Paper Category List of compared tools

1 (Chintapalli et al., 2016) stream Spark, Flink, Storm

2 (Veiga et al., 2016) stream Hadoop MapReduce, Spark, Flink

3 (Lopez et al., 2016) stream Spark, Storm, Flink

4 (Qian et al., 2016) stream Spark, Storm

5 (Verma and Patel, 2016) batch Hadoop MapReduce, Spark

6 (Samadi et al., 2016) batch Hadoop MapReduce, Spark

7 (Marcu et al., 2016) batch / graph Spark, GraphX, Flink

8 (Koschel et al., 2016) graph Giraph, Hadoop MapReduce

9 (Lu and Thomo, 2016) graph Giraph, GraphChi

10 (Siddique et al., 2016) graph Giraph, Hama

11 (Batarﬁ et al., 2015) graph Giraph, GraphChi, GraphLab, GraphX, GPS

12 (Han et al., 2014) graph Giraph, GraphLab, Pregel, GPS, Mizan

13 (Lu et al., 2014) graph Giraph, GraphChi, GraphLab, Pregel+, GPS

14 (Wei et al., 2016) graph GraphLab, Spark

15 (Rodrigues et al., 2019) SQL Impala, Hive, Spark, HAWQ, Drill, Presto

16 (Qin et al., 2017) SQL Impala, Hive, Spark SQL

17 (Santos et al., 2017) SQL Hive, Spark, Drill, Presto

18 (Tapdiya and Fabbri, 2017) SQL Impala, Spark SQL, Drill, Phoenix

19 (Richter et al., 2015) ML Mahout, MLlib, H2O, SAMOA

20 (Aziz et al., 2018) ML Mahout, MLlib

21 (Landset et al., 2015) ML Mahout, MLlib, H2O, SAMOA

22 (Kochura et al., 2017) ML TensorFlow, H2O, Deeplearning4j

23 (Kovalev et al., 2016) ML Tensorﬂow, Theano, Torch, Caffe, Deeplearning4j

24 (Shatnawi et al., 2018) ML TensorFlow, Theano, CNTK

lem. For the more comfortable usability of this di-

agram, we provide the list of the used features and

their descriptions in Table 3.

5 DISCUSSION

The most commonly used Big Data tools in bench-

marks are Spark for batch and stream processing, Gi-

raph for graph processing, Impala for querying, and

Mahout, MLlib, and TensorFlow for machine learn-

ing. That might also indicate that these tools are pop-

ular among scholars, or they are easy to run and oper-

ate.

The category of graph processing has the most

comparative papers published, while batch process-

ing and SQL systems are less frequent. Regarding the

years, the majority of benchmarks were performed in

2016. Then, only four benchmarks were published in

2017, two in 2018, and one in 2019.

The diagram consists of ﬁve areas corresponding

to the categories discussed in Section 3: batch pro-

cessing, stream processing, graph processing, SQL,

and machine learning. The smallest and the least

complicated parts are batch processing and SQL. The

reason might be the lower number of tools tested (in

case of batch processing tools) or that few tools pre-

vail against the many others (in case of SQL sys-

tems). The biggest and the most complex categories

are graph processing and machine learning, contain-

ing six tools and many choices.

Another observation from the diagram is that the

number of leaves in the diagram is smaller than the

number of Big Data processing tools that were tested

in the benchmark papers. The reason is that the miss-

ing tools did not result as the recommended option in

either of the experiments. Also, the benchmark papers

do not cover all of the available Big Data processing

tools. For example, HPCC, Samza, Gearpump, Beam,

and GraphJet were not covered by any of these papers.

As visible in Table 2, there are not so many re-

search papers comparing Big Data processing tools

based on experiments, therefore some aspects in our

diagram might have a higher level of uncertainty. We

would like to encourage researchers to perform more

experiments in this area, so that the knowledge about

the most suitable Big Data processing tools for each

speciﬁc situation might be improved.

We propose the following list of experiments as

future work:

• to compare the tools like Hadoop MapReduce,

Spark, and Flink in batch processing with more

detailed experiments,

• to compare the throughput of Spark and Storm,

because currently there are papers that contradict

each other,

• to compare the general-purpose tools with the spe-

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

308

Table 3: Features and their descriptions.

Feature Description

Performance

The ability of a tool to accomplish its functionality within a time-interval. The lower the comple-

tion time (or lower than a certain threshold), the higher the performance of the tool.

Scalability

The ability of a tool to adopt to a given problem size and use its resources effectively as data size

grows.

Fault tolerance The ability of a tool to cope with failures that cause an adverse effect on the entire workﬂow.

Flexibility The capability of a tool to deal with the changing requirements of data processing.

Accuracy The ability of a tool to produce realistic data values close to the true values modeled.

Complexity

It measures how well the features of a tool are divided into different modules and how to implement

them with other programming interfaces.

Extensibility The ability of a tool to integrate with other frameworks in its totality or partially.

Usability

It describes how the usage of a tool could satisfy users’ requirements, such as the ease of use,

availability of documentation, and programming language interfaces.

Coverage The range of modules contained in a tool and the variety of features of each module.

Maturity The the number of deployments that the tool has obtained.

Speed The execution time needed to accomplish tasks.

Latency The amount of time between starting a task and getting the related outcome.

Throughput The amount of tasks done over a given time period.

Memory efﬁciency The ability of a tool to handle memory economically as data size grows.

cialized variants (e.g., Flink with graph process-

ing tools),

• to compare the current tools that were not used in

any of the mentioned benchmarks with other vari-

ants (e.g., HPCC in batch processing or Samza in

stream processing),

• to compare the graph processing tools against

each other, speciﬁcally Hama, GraphLab,

GraphX, and Pregel+, which were compared to

the other only in one or no paper,

• add more qualitative measures of the tools (e.g., is

it easier to work with Impala or Presto?).

6 THREATS TO VALIDITY

Before concluding the paper, we would like to dis-

cuss the construct validity threats for our diagram.

The benchmarks that we found might not necessar-

ily use the best conﬁguration of the compared tools,

which could inﬂuence the results. In addition, there

might be other factors, like the cluster size or hard-

ware speciﬁcation, which could have an impact on

the results of benchmarks. Moreover, there are miss-

ing experimental comparisons between some tools, so

some parts of the diagram might not be fully repre-

sentative (e.g., the navigation between GraphX and

Pregel+). Furthermore, some transitions in this dia-

gram contradict each other, so further investigation in

this direction is needed (e.g., the navigation between

Spark and Storm).

The fact that Big Data processing tools are evolv-

ing is problematic, while comparisons might use dif-

ferent versions of the same tool. This might lead

to more contradictions. Similarly, if the tools are

tested in different scenarios or on different testbeds,

the results might contradict each other. Although this

clearly indicates that more benchmarks and research

is needed in this direction, to further extend the dia-

gram, we ﬁnd it valuable that this work has shed light

on these gaps and discrepancies.

Development is so fast in the area that some of

the tools might have changed in the last few years

to a great extent—e.g., the development of Apache

Hama slowed down

, GraphLab has become Turi

and the GraphChi project seems to be abandoned

These fundamental changes represent another threat,

while research papers tend to react to these changes

slower than industrial resources. As this paper’s fo-

cus is only on academic papers, in the future, it might

be beneﬁcial to extend the diagram by other sources

that can be considered reliable, like technical reports

or weblogs. This extension might increase the accu-

racy of the proposed diagram.

Nevertheless, given all this, we believe that the

presented visualization can be highly beneﬁcial for

researchers as well as practitioners, and stimulate fur-

ther steps and experiments extending the state of the

art.

Latest news and releases on http://hama.apache.org are

2016 and 2018 at the time of writing.

See https://turi.com.

See https://github.com/GraphChi.

Big Data Processing Tools Navigation Diagram

309

7 CONCLUSION

In this paper, we have constructed a Big Data Pro-

cessing Tools Navigation Diagram, which summa-

rizes and illustrates the results of the Big Data pro-

cessing tools benchmarks from 2014 to 2019. We

believe that this ﬁrst attempt to create such a visual

knowledge summary and tool navigation can help re-

searchers and practitioners to ﬁlter a large number

of possible Big Data processing tools for their prob-

lems. We have created this diagram by ﬁrst identi-

fying proper benchmark papers, then designing ﬁve

categories, and assigning the papers to them. After

that, we have extracted tools and results from those

papers, and based on them, constructed the diagram.

We have identiﬁed and recommend further possi-

ble comparative experiments still missing from litera-

ture, which would improve this diagram signiﬁcantly.

We also mentioned several issues in the current ver-

sion of the diagram, which should be addressed in the

future. Furthermore, we believe that it would be ben-

eﬁcial to merge the practical knowledge in this paper

with the theoretical knowledge of selected tools that

could be derived from the survey papers referenced in

Section 2.

ACKNOWLEDGEMENTS

The work was supported from European Regional De-

velopment Fund Project CERIT Scientiﬁc Cloud (No.

CZ.02.1.01/0.0/0.0/16 013/0001802).

REFERENCES

Acharjya, D. P. and Ahmed, K. (2016). A survey on big data

analytics: challenges, open research issues and tools.

International Journal of Advanced Computer Science

and Applications, 7(2):511–518.

Almassabi, A., Bawazeer, O., and Adam, S. (2018). Top

newsql databases and features classiﬁcation. Inter-

national Journal of Database Management Systems

(IJDMS), 10(2):11–31.

Aziz, K., Zaidouni, D., and Bellafkih, M. (2018). Big data

processing using machine learning algorithms: Mllib

and mahout use case. In Proceedings of the 12th Inter-

national Conference on Intelligent Systems: Theories

and Applications, page 25. ACM.

Batarﬁ, O., Shawi, R. E., Fayoumi, A. G., Nouri, R., Be-

heshti, S.-M.-R., Barnawi, A., and Sakr, S. (2015).

Large scale graph processing systems: survey and

an experimental evaluation. Cluster Computing,

18(3):1189–1213.

Chen, M., Mao, S., and Liu, Y. (2014). Big data: A survey.

Mobile Networks and Applications, 19(2):171–209.

Chintapalli, S., Dagit, D., Evans, B., Farivar, R., Graves,

T., Holderbaugh, M., Liu, Z., Nusbaum, K., Patil, K.,

Peng, B. J., and Poulosky, P. (2016). Benchmark-

ing streaming computation engines: Storm, ﬂink and

spark streaming. In 2016 IEEE International Paral-

lel and Distributed Processing Symposium Workshops

(IPDPSW), pages 1789–1792.

Corbellini, A., Mateos, C., Zunino, A., Godoy, D., and

Schiafﬁno, S. (2017). Persisting big-data: The nosql

landscape. Information Systems, 63:1 – 23.

Fang, H., Zhang, Z., Wang, C. J., Daneshmand, M., Wang,

C., and Wang, H. (2015a). A survey of big data re-

search. IEEE Network, 29(5):6–9.

Fang, H., Zhang, Z., Wang, C. J., Daneshmand, M., Wang,

C., and Wang, H. (2015b). A survey of big data re-

search. IEEE Network, 29(5):6–9.

Gessert, F., Wingerath, W., Friedrich, S., and Ritter, N.

(2017). Nosql database systems: a survey and deci-

sion guidance. Computer Science - Research and De-

velopment, 32(3):353–365.

okalp, M. O., Kayabay, K., Zaki, M., Koc¸yi

git, A., Eren,

P. E., and Neely, A. (2017a). Big-data analytics ar-

chitecture for businesses: a comprehensive review on

new open-source big-data tools. Cambridge Service

Alliance: Cambridge, UK.

okalp, M. O., Kayabay, K., Zaki, M., Koc¸yi

git, A., Eren,

P. E., and Neely, A. (2017b). Big-data analytics ar-

chitecture for businesses: a comprehensive review on

new open-source big-data tools. no. October.

Habeeb, R. A. A., Nasaruddin, F., Gani, A., Hashem, I.

A. T., Ahmed, E., and Imran, M. (2019). Real-time

big data processing for anomaly detection: A sur-

vey. International Journal of Information Manage-

ment, 45:289 – 307.

Han, M., Daudjee, K., Ammar, K.,

Ozsu, M. T., Wang,

X., and Jin, T. (2014). An experimental compari-

son of pregel-like graph processing systems. PVLDB,

7:1047–1058.

Inoubli, W., Aridhi, S., Mezni, H., Maddouri, M., and

Nguifo, E. M. (2018). An experimental survey on big

data frameworks. Future Generation Computer Sys-

tems, 86:546 – 564.

Kochura, Y., Stirenko, S., Alienin, O., Novotarskiy, M.,

and Gordienko, Y. (2017). Comparative analysis of

open source frameworks for machine learning with

use case in single-threaded and multi-threaded modes.

In 2017 12th International Scientiﬁc and Technical

Conference on Computer Sciences and Information

Technologies (CSIT), volume 1, pages 373–376.

Koschel, A., Heine, F., Astrova, I., Korte, F., Rossow,

T., and Stipkovic, S. (2016). Efﬁciency experi-

ments on hadoop and giraph with pagerank. In 2016

24th Euromicro International Conference on Parallel,

Distributed, and Network-Based Processing (PDP),

pages 328–331.

Kovalev, V., Kalinovsky, A., and Kovalev, S. (2016). Deep

learning with theano, torch, caffe, tensorﬂow, and

deeplearning4j: Which one is the best in speed and

accuracy?

Landset, S., Khoshgoftaar, T. M., Richter, A. N., and

Hasanin, T. (2015). A survey of open source tools for

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

310

machine learning with big data in the hadoop ecosys-

tem. Journal of Big Data, 2:1–36.

Liu, X., Iftikhar, N., and Xie, X. (2014). Survey of real-time

processing systems for big data. In Proceedings of

the 18th International Database Engineering &

Applications Symposium, IDEAS ’14, pages 356–361,

New York, NY, USA. ACM.

Lopez, M. A., Lobato, A. G. P., and Duarte, O. C. M. B.

(2016). A performance comparison of open-source

stream processing platforms. In 2016 IEEE Global

Communications Conference (GLOBECOM), pages

1–6.

Lourenc¸o, J. R., Cabral, B., Carreiro, P., Vieira, M., and

Bernardino, J. (2015). Choosing the right nosql

database for the job: a quality attribute evaluation.

Journal of Big Data, 2(1):18.

Lu, J. and Thomo, A. (2016). An experimental evaluation

of giraph and graphchi. In 2016 IEEE/ACM Inter-

national Conference on Advances in Social Networks

Analysis and Mining (ASONAM), pages 993–996.

Lu, Y., Cheng, J., Yan, D., and Wu, H. (2014). Large-scale

distributed graph computing systems: An experimen-

tal evaluation. Proceedings of the VLDB Endowment,

8(3):281–292.

Makris, A., Tserpes, K., Andronikou, V., and Anagnos-

topoulos, D. (2016). A classiﬁcation of nosql data

stores based on key design characteristics. Procedia

Computer Science, 97:94 – 103. 2nd International

Conference on Cloud Forward: From Distributed to

Complete Computing.

Marcu, O., Costan, A., Antoniu, G., and P

erez-Hern

andez,

M. S. (2016). Spark versus ﬂink: Understanding per-

formance in big data analytics frameworks. In 2016

IEEE International Conference on Cluster Computing

(CLUSTER), pages 433–442.

Mazumdar, S., Seybold, D., Kritikos, K., and Verginadis,

Y. (2019). A survey on data storage and placement

methodologies for cloud-big data ecosystem. Journal

of Big Data, 6(1):15.

Oussous, A., Benjelloun, F.-Z., Lahcen, A. A., and Belfkih,

S. (2018a). Big data technologies: A survey. Journal

of King Saud University - Computer and Information

Sciences, 30(4):431 – 448.

Oussous, A., Benjelloun, F.-Z., Lahcen, A. A., and Belfkih,

S. (2018b). Big data technologies: A survey. Journal

of King Saud University - Computer and Information

Sciences, 30(4):431 – 448.

Płuciennik, E. and Zgorzałek, K. (2017). The multi-model

databases – a review. In Kozielski, S., Mrozek, D.,

Kasprowski, P., Małysiak-Mrozek, B., and Kostrzewa,

D., editors, Beyond Databases, Architectures and

Structures. Towards Efﬁcient Solutions for Data Anal-

ysis and Knowledge Representation, pages 141–152,

Cham. Springer International Publishing.

Qian, S., Wu, G., Huang, J., and Das, T. (2016). Bench-

marking modern distributed streaming platforms. In

2016 IEEE International Conference on Industrial

Technology (ICIT), pages 592–598.

Qin, X., Chen, Y., Chen, J., Li, S., Liu, J., and Zhang,

H. (2017). The performance of sql-on-hadoop sys-

tems - an experimental study. In 2017 IEEE Inter-

national Congress on Big Data (BigData Congress),

pages 464–471.

Raghav, R. S., Pothula, S., Vengattaraman, T., and Ponnu-

rangam, D. (2016). A survey of data visualization

tools for analyzing large volume of data in big data

platform. In 2016 International Conference on Com-

munication and Electronics Systems (ICCES), pages

1–6.

Ramadan, R. (2017). Big data tools-an overview. Inter-

national Journal of Computer Science and Software

Engineering, 2.

Richter, A. N., Khoshgoftaar, T. M., Landset, S., and

Hasanin, T. (2015). A multi-dimensional comparison

of toolkits for machine learning with big data. In 2015

IEEE International Conference on Information Reuse

and Integration, pages 1–8. IEEE.

Rodrigues, M., Santos, M. Y., and Bernardino, J. (2019).

Experimental evaluation of big data analytical tools.

In Themistocleous, M. and Rupino da Cunha, P., ed-

itors, Information Systems, pages 121–127, Cham.

Springer International Publishing.

Sakr, S. (2016). Big data 2.0 processing systems: a survey.

Springer.

Samadi, Y., Zbakh, M., and Tadonki, C. (2016). Compara-

tive study between hadoop and spark based on hibench

benchmarks. In 2016 2nd International Conference

on Cloud Computing Technologies and Applications

(CloudTech), pages 267–275.

Santos, M. Y., Costa, C., Galv

ao, J. a., Andrade, C., Mart-

inho, B. A., Lima, F. V., and Costa, E. (2017). Evalu-

ating sql-on-hadoop for big data warehousing on not-

so-good hardware. In Proceedings of the 21st Inter-

national Database Engineering & Applications Sym-

posium, IDEAS 2017, page 242–252, New York, NY,

USA. Association for Computing Machinery.

Shatnawi, A., Al-Bdour, G., Al-Qurran, R., and Al-Ayyoub,

M. (2018). A comparative study of open source deep

learning frameworks. In 2018 9th International Con-

ference on Information and Communication Systems

(ICICS), pages 72–77.

Siddiqa, A., Karim, A., and Gani, A. (2017). Big data stor-

age technologies: a survey. Frontiers of Information

Technology & Electronic Engineering, 18(8):1040–

1070.

Siddique, K., Akhtar, Z., Yoon, E. J., Jeong, Y., Dasgupta,

D., and Kim, Y. (2016). Apache hama: An emerging

bulk synchronous parallel computing framework for

big data applications. IEEE Access, 4:8879–8887.

Tapdiya, A. and Fabbri, D. (2017). A comparative analysis

of state-of-the-art sql-on-hadoop systems for interac-

tive analytics. In 2017 IEEE International Conference

on Big Data (Big Data), pages 1349–1356.

Turck, M. (2019). A turbulent year: The 2019 data &

ai landscape. https://mattturck.com/Data2019/. Ac-

cessed: 2019-12-20.

Ulusar, U. D., Ozcan, D. G., and Al-Turjman, F. (2020).

Open source tools for machine learning with big data

in smart cities. In Smart Cities Performability, Cogni-

tion, & Security, pages 153–168. Springer.

Veiga, J., Exp

osito, R. R., Pardo, X. C., Taboada, G. L.,

and Touriﬁo, J. (2016). Performance evaluation of

Big Data Processing Tools Navigation Diagram

311

big data frameworks for large-scale data analytics. In

2016 IEEE International Conference on Big Data (Big

Data), pages 424–431.

Verma, J. P. and Patel, A. (2016). Comparison of mapre-

duce and spark programming frameworks for big data

analytics on hdfs. International Journal of Computer

Science & Communication, 7(2).

Wei, J., Chen, K., Zhou, Y., Zhou, Q., and He, J. (2016).

Benchmarking of distributed computing engines spark

and graphlab for big data analytics. In 2016 IEEE Sec-

ond International Conference on Big Data Computing

Service and Applications (BigDataService), pages 10–

13.

APPENDIX

Table 4: Results from the benchmarks.

Tool that

performed

better

Tool that fell

behind

Mentioned factors

of the comparison

# of

paper

Storm Spark latency 1

Flink Spark latency 1

Spark Storm throughput 1

Spark Flink throughput 1

Spark Flink

performance,

scalability

Storm Flink throughput 3

Storm Spark throughput 3

Spark Storm fault tolerance 3

Spark Flink fault tolerance 3

Spark Storm

fault tolerance,

throughput

Storm Spark latency 4

Spark Hadoop perfomance 5

Spark Hadoop latency, throughput 6

GraphX Flink performance 7

Flink Spark performance 7

Giraph Hadoop performance 8

Giraph GraphChi performance 9

Hama Giraph

performance,

scalability, speed

GraphX GraphChi performance 11

GraphX Giraph performance 11

GraphX GPS performance 11

GraphX GraphLab performance 11

GPS Giraph memory efﬁciency 12

GPS GraphLab memory efﬁciency 12

GPS Mizan memory efﬁciency 12

Giraph GraphLab latency 12

Giraph GPS latency 12

Giraph Mizan latency 12

GraphLab Giraph speed 12

GraphLab GPS speed 12

Table 4: Results from the benchmarks (cont.).

Tool that

performed

better

Tool that fell

behind

Mentioned factors

of the comparison

# of

paper

GraphLab Mizan speed 12

Pregel+ Giraph performance 13

GPS GraphLab performance 13

Pregel+ GraphLab performance 13

GPS Giraph performance 13

GraphLab Spark performance 14

Impala HAWQ performance 15

Impala Hive performance 15

HAWQ Impala

performance on 30

GB dataset

Impala Hive speed 16

Impala Spark SQL speed 16

Presto Hive performance 17

Presto Spark SQL performance 17

Presto Drill performance 17

Impala Drill performance 18

Impala Spark SQL performance 18

Impala Phoenix performance 18

Mahout MLlib

extensibility,

scalability, usability,

fault tolerance,

speed

H2O SAMOA

extensibility,

scalability, usability,

fault tolerance,

speed

MLlib SAMOA

extensibility,

scalability, usability,

fault tolerance,

speed

Mahout SAMOA

extensibility,

scalability, usability,

fault tolerance,

speed

Mahout H2O

extensibility,

scalability, usability,

fault tolerance,

speed

MLlib Mahout latency 20

Mahout MLlib stability, maturity 20

H2O MLlib

speed, usability,

scalability,

coverage,

extensibility

SAMOA Mahout

speed, usability,

scalability,

extensibility,

coverage

H2O Deeplearning4j

performance,

maturity

TensorFlow Deeplearning4j ﬂexibility 22

Theano Deeplearning4j

speed, accuracy,

complexity

TensorFlow CNTK performance 24

Theano CNTK performance 24

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

312