MapReduce (MR) (Jeffrey and Sanjay, 2004; 
Duda, 2012; Hejmalíček, 2015) became the next step 
of Data Warehouse evolution.  There are many 
technologies based on MR that allow implementation 
of a Data Warehouse (Li, et.al, 2014).  Hive is an 
example of major Facebook platform components.  It 
is intended for implementation of a Data Warehouse 
in Hadoop (Huai, et al., 2014).  Here a SQL query is 
translated into a series of MR tasks.  Some 
experiments show that Hive is significantly slower 
than other methods (Duda, 2012; Zhou and Wang, 
2013).  In order to solve the low performance problem 
of a Data Warehouse in MR systems some methods 
were developed that allow access to Data Warehouse 
directly in MR (i.e. without additional components).  
Four such methods (MFRJ, MRIJ, MRIJ on RCFile, 
MRIJ with big dimension tables) are described in 
(Zhou and Wang, 2013).  They are based on the 
dimension table caching in the RAM of each node. 
Along with the MR Hadoop a MapReduce-like 
system Spark and others are used; however, their 
discussion is outside of this paper’s scope. 
This paper discusses Multi-Fragment-Replication 
Join (MFRJ) method (Zhou and Wang, 2013), which 
unlike other methods allows access to n-dimensional 
Data Warehouse for only one MR task and avoids 
extra transfers of the fact table external keys (section 
3).  It is also simple in implementation. Effectiveness 
of this method in comparison with other methods is 
shown in (Zhou and Wang, 2013). 
The motivation of this work is the need for Data 
Warehouse access time forecasts, due to the intense 
growth in number of MR applications. Examples of 
such applications are provided below: 
o  Internet-application log processing for large 
internet shops and social networks (Duda, 
2012) for service demand analysis. 
o  Large data volume processing for data 
collected by credit organizations for market 
behavior forecasts. 
o  Statistics calculation for large weather 
forecast processing. 
The problem is that large volume data processing 
is time-consuming which may become unacceptable.  
Discovery of this problem at the operational stage 
leads to costly resolution.  First of all, there are many 
processing tasks.  Secondly, if the tasks are complex 
then tuning does not help.  In this case algorithms 
have to be changed and Map and Reduce functions 
recoded. This means redoing the already done work 
wasting time and resources.  Thus the processing time 
estimation for peak load during the design stage, i.e. 
before MR tasks implementation, is beneficial. 
The importance of modeling can be demonstrated 
on the following example.  Two RDBMSes (column 
and row-based) and MR Hadoop were compared in 
an experiment in (Pavlo, et.al, 2009).  The conclusion 
was that Hadoop loses in the test tasks. 
Detailed analysis in (Burdakov, et.al, 2014) 
showed that experiments with RDBMS were 
executed with the node number below 100, with low 
data selectivity in queries, with the lack of sorting, 
and record exchange of fragmented tables between 
the nodes.  Modeling was performed with calibrated 
models and different input parameters (Burdakov, 
et.al, 2014).  The results showed that Hadoop over-
performing RDBMS with high selectivity and sorting 
starting from 300 nodes (6 TB of stored data). 
Obviously, implementation of a test stand and live 
experiments on a large number of nodes is much more 
expensive than development of an adequate 
mathematical model and its application. 
This paper discusses an MFRJ access method to a 
Data Warehouse (section 3), analyzes MR workflow 
(section 4), develops an analytical model for Data 
Warehouse query execution time evaluation (section 
5), and calibrates the model and evaluates its 
adequacy based on experiments (section 6). 
2 RELATED WORK 
The developed analytical model for query execution 
time evaluation is a cost model (Simhadri, 2013). 
  Below we provide an overview of the existing 
models and point out their disadvantages. 
  Burdakov, et.al (2014) and Palla (2009) model 
only two tables join in MR.  Palla (2009) evaluates 
input/output cost, but disregards the processing part.  
However, as the measurements indicate (e.g. see 
Section 6), the process load cannot be disregarded.  
The processing time is considered in (Burdakov, et.al, 
2014), however the Shuffle algorithm is simplified. 
Afrati and Ullman (2010) propose the following 
access method to n-dimensional Data Warehouses.  
The Map phase in each node reads dimension and fact 
table records (n+1 tables).  The Map function 
calculates hash-values h(b
i
) for b
i 
attributes that 
participate in a join.  Each Reduce task is associated 
with n values {h(b
i
)}.  Each record is sent to multiple 
Reduce tasks according to the calculated hash-values.  
The Reduce task joins received records.  Transferred 
records number minimization task is solved based on 
a constant number of Reduce tasks.  This method has 
the following disadvantages: