tasks, processes pairs from mappers and aggregates
values with the same key. The output of the reducer
can be written in HDFS or send to another Mapper, in
the case of MapReduce workflow.
A MapReduce workflow processes terabytes of
data through MapReduce jobs connected through
producer-consumer relationships. Each job consists
of one Map phase and one Reduce phase. A phase
includes multiple parallel Map or Reduce tasks.
Although the Hadoop MapReduce framework is
easy to grasp, the development of complex
MapReduce workflow can be a tedious task and
require the collaboration of many developers. The
intervention of different developers raises the
possibility of mistakes and bugs, in the map and
reduce programs, that can interrupt the MapReduce
execution or produce inaccurate output. Also, as
MapReduce applications are like every other
application, from the design phase a set of
requirements are envisaged and need to be verified
from an early stage to reduce the possibilities of
failure or inconsistency. Therefore, we propose in this
paper an approach driven by MapReduce design
patterns, for the modeling and formal verification of
MapReduce workflow using both the standard
BPMN2(Correia and Abreu 2012) and the Event B
method(Abrial 2010).
The objective of the proposed approach is twofold.
First, the approach is developed to help designers to
easily design their MapReduce workflow using the
graphical tool of a well-known and rich standard
which is BPMN, and based on a set of predefined
MapReduce design patterns.
Second, by automatically transformed the
MapReduce workflow to a formal notation, Event B,
the designer can make further analysis and
consequently detect any errors at earlier stage.
The rest of the paper is organized as follows:
section 2 presents related works and our main
contributions. Section 3 presents briefly BPMN and
the Event B method. Section 4 describe the proposed
approach for the specification and verification of
MapReduce workflow. Then a case study is
presented. Finally, section 6 concludes the paper.
2 RELATED WORKS
Although the wide spread of big data applications,
Only 13% of organizations have achieved full-scale
production for their Big Data implementations as
mentioned in the research (Colas et al. 2014) at
Capgemini Consulting. The main reason behind this
problem is the technical difficulty to design and
develop effective big data applications. Hence, in
order to increase the productivity in the development
of Big Data applications, new languages,
methodologies and tools needed to be created for
assisting and guiding developers. In this context, a
number of research are realized(Perez-Palacin et al.
2019)(Bersani et al. 2019)(Lim, Herodotou, and Babu
2012).
In (Chiang et al. 2021), Chiang et al. adopts petri
net to visually model the MapReduce framework in
order to verify its reachability property. The study act
as guidelines for the developer to ovoid common
errors such as when the system can’t find input or
output file.
Another variation of petri nets, Prioritized–Timed
Coloured Petri Nets, is used in (Ruiz, Calleja, and
Cazorla 2015), to evaluate the performance of the
application “SentiStrength” for the Hadoop module in
cloud environments. Simulations are realized by
CPNtools to find out the best performance–cost
agreements in advance.
Zeliu et al. analyses the rationality of MapReduce
workflow by using the Object Petri Net(Hong and
Bae 2000) in (Zeliu et al. 2019). The rationality
criteria are defined by: the absence of a strangler task,
the absence of conflict map and the reasonable
execution time.
In the above cited works, petri net with its
different variation is adopted at the design phase of
MapReduce application. Although, it is a mature
formalism that is widely used , a petri net model can
become very complex(Bessifi, Younes, and Ayed
2022). Also, the use of the model checking technique
as a verification technique pose always the problem
of the growing number of explored states and thus the
time of verification of data intensive applications
such as MapReduce applications.
In (Zhang et al. 2020), a runtime verification
approach at code level is presented. Both Map and
reduce programs are written in MSVL(Wang et al.
2020) language, properties to verify are expressed in
PPTL(Duan et al. 2019) formulas then verified using
the MSVL model checker. Two case studies are
presented to verify several data properties.
In this paper, we propose a model-driven
framework for the specification and verification of
MapReduce workflow to help developer to create a
correct by construction MapReduce application. This
approach combines the power of two different
languages: Event B and BPMN, to provide a
prototype tool that can be used to easily create high
quality applications.