Authors:
Marco Cavallo
;
Giuseppe Di Modica
;
Carmelo Polito
and
Orazio Tomarchio
Affiliation:
University of Catania, Italy
Keyword(s):
Big Data, MapReduce, Hierarchical Hadoop, Context Awareness, Partition Number.
Related
Ontology
Subjects/Areas/Topics:
Big Data Cloud Services
;
Cloud Applications Performance and Monitoring
;
Cloud Computing
;
Platforms and Applications
Abstract:
MapReduce is an effective distributed programming model used in cloud computing for large-scale data analysis
applications. Hadoop, the most known and used open-source implementation of the MapReduce model,
assumes that every node in a cluster has the same computing capacity and that data are local to tasks. However,
in many real big data applications where data may be located in many datacenters distributed over the
planet these assumptions do not hold any longer, thus affecting Hadoop performance. This paper addresses
this point, by proposing a hierarchical MapReduce programming model where a toplevel scheduling system
is aware of the underlying computing contexts heterogeneity. The main idea of the approach is to improve
the job processing time by partitioning and redistributing the workload among geo-distributed workers: this is
done by adequately monitoring the bottom-level computing and networking context.