AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN
DISTRIBUTED ARCHITECTURES
M. Al Hajj Hassan and M. Bamha
LIFO, Universit
´
e d’Orl
´
eans, B.P. 6759, 45067 Orl
´
eans Cedex 2, France
Keywords:
Parallel DataBase Management Systems (PDBMS), Parallel joins, Data skew, Join product skew, GroupBy-
Join queries, BSP cost model.
Abstract:
SQL queries involving join and group-by operations are fairly common in many decision support applications
where the size of the input relations is usually very large, so the parallelization of these queries is highly
recommended in order to obtain a desirable response time. The most significant drawbacks of the algorithms
presented in the literature for treating such queries are that they are very sensitive to data skew and involve
expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we
present an algorithm that overcomes these drawbacks because it evaluates the ”GroupBy-Join” query without
the need of the direct evaluation of the costly join operation, thus reducing its Input/Output and communication
costs. Furthermore, the performance of this algorithm is analyzed using the scalable and portable BSP (Bulk
Synchronous Parallel) cost model which predicts a linear speedup even for highly skewed data.
1 INTRODUCTION
Aggregate functions used to summarize large volume
of data based on a designated grouping are widely em-
ployed in applications such as: the decision support
application, OnLine Analytical Processing (OLAP)
and Data Warehouse (Taniar et al., 2000), because
in such applications, aggregated and summarized data
are more important than detailed records (Datta et al.,
1998). Aggregate operations may be applied on the
output of the join of multiple tables having potentially
billions of records. These tables may rapidly grow ev-
ery day especially in OLAP systems. Moreover, the
output of these queries must be obtained in a reason-
able processing time. For these reasons, parallel pro-
cessing of such queries results in huge performance
gain especially in PDBMS. However, the use of ef-
ficient parallel algorithm in PDBMS is fundamental
in order to obtain an acceptable performance (Bamha
and Hains, 2000; Seetha and Yu, 1990).
Several parallel algorithms for evaluating ”GroupBy-
Join” queries were presented in the literature (Shatdal
and Naughton, 1995; Taniar et al., 2000), but these
algorithms are inefficient due to their high communi-
cation cost because all the tuples of the relations are
redistributed between processors. Some of these tu-
ples may not even contribute in the result of the join
operation.
In addition, these algorithms fully materialize the in-
termediate results of the join operations and the In-
put/Output cost is very high where it is reasonable
to assume that the output relation cannot fit in the
main memory of every processor, so it must be reread
from disk in order to evaluate the aggregate func-
tion. Finally, these algorithms cannot solve the prob-
lem of data skew because data redistribution is gener-
ally based on hashing data into buckets and hashing
is known to be inefficient in the presence of high fre-
quencies (Bamha, 2005; Seetha and Yu, 1990).
In this paper, we present a new parallel algo-
rithm used to evaluate the ”GroupBy-Join” queries on
Shared Nothing machines (a distributed architecture
where each processor has its own memory and own
disks), when the join attributes are different from the
group-by attributes. Our main contribution is that, in
this algorithm, we do not need to materialize the join
operation as in the traditional algorithms where the
join operation is evaluated first and then the group-by
246
Al Hajj Hassan M. and Bamha M. (2007).
AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES.
In Proceedings of the Third Inter national Conference on Web Information Systems and Technologies - Internet Technology, pages 246-252
DOI: 10.5220/0001281302460252
Copyright
c
SciTePress
and aggregate functions (Yan and Larson, 1994). It is
also insensitive to data skew and its communication
and Input/Output costs are reduced to a minimum.
In this algorithm, we partially evaluate the aggre-
gate function before redistributing the tuples. This
helps in reducing the cost of data redistribution. We
use the histograms of both relations in order to find
the tuples that participate in the result of the join oper-
ation. It is proved in (Bamha and Hains, 2005; Bamha
and Hains, 2000), using the BSP model, that his-
togram management has a negligible cost when com-
pared to the gain it provides in reducing the commu-
nication cost.
In traditional algorithms, all the tuples of the output of
the join are redistributed using a hash function. In the
contrary, in our algorithm we only redistribute the re-
sult of the semi-join of the histograms which are very
small compared to the size of input relations. This
helps in reducing the amount of data transferred over
the network and therefore the communication cost.
The performance of this algorithm is analyzed using
the BSP cost model which predicts for our algorithm
a linear speedup even for highly skewed data.
2 COMPUTATION OF
”GROUPBY-JOIN” QUERIES
In DBMS, we can distinguish two types of ”GroupBy-
Join” queries. In the first type the join attributes and
the group-by attributes are the same. In this case,
it is preferable to carry out the group-by and aggre-
gate functions first and then the join operation (Taniar
et al., 2000), because this helps in reducing the size
of the relations to be joined and consequently de-
creasing the communication cost and the query exe-
cution time. In the contrary, this can not be applied
on the second type of queries, because the join at-
tributes are different from the group-by attributes
see
(Al Hajj Hassan and Bamha, 2007) for a long and de-
tailed version of this paper
. In this paper, we will fo-
cus on this type of ”GroupBy-Join” queries where we
present an algorithm which partially evaluates the ag-
gregate functions before redistributing the tuples us-
ing histograms, thus reducing the communication cost
as much as possible.
3 GROUPBY-JOIN QUERIES: A
NEW APPROACH
In this section, we present a detailed description
of a new parallel algorithm used to evaluate the
”GroupBy-Join” queries when the group-by attributes
are different from the join attributes. We assume
that the relation R (resp. S) is evenly partitioned
among processors by horizontal fragmentation such
that |R
i
|
|R|
p
(i = 1, ..., p) where p is the number
of processors.
For simplicity of description and without loss of gen-
erality, we consider that the query has only one join
attribute x and that the group-by attribute set consists
of one attribute y of R and another attribute z of S.
We also assume that the aggregate function is applied
on the values of the attribute u of S.
In the rest of this paper we use the following nota-
tion for each relation T {R, S}:
T
i
denotes the fragment of relation T placed on
processor i, a sub-relation of T ,
Hist
w
(T ) denotes the histogram
1
of relation T
with respect to the attribute w, i.e. a list of pairs
(v, n
v
) where n
v
6= 0 is the number of tuples of
relation T having the value v for the attribute w.
The histogram is often much smaller and never
larger than the relation it describes,
Hist
w
(T
i
) denotes the histogram of fragment T
i
while Hist
w
i
(T ) is processor is fragment of the
histogram of T ,
Hist
w
(T )(v) is the frequency (n
v
) of value v in
relation T while Hist
w
(T
i
)(v) is its frequency in
sub-relation T
i
,
AGGR
w
f,u
(T )
2
is the result of applying the aggre-
gate function f on the values of the attribute u of
every group of tuples of T having identical val-
ues of the group-by attributes w. AGGR
w
f,u
(T ) is
formed of a list of tuples (v, f
v
) where f
v
is the
result of applying the aggregate function on the
group of tuples having value v for the attribute w
(w may be formed of more than one attribute),
AGGR
w
f,u
(T
i
) denotes the result of applying the
aggregate function on the attribute u of relation
T
i
while AGGR
w
f,u,i
(T ) is processor is fragment
of the result of applying the aggregate function on
T ,
AGGR
w
f,u
(T )(v) (resp.AGGR
w
f,u
(T
i
)(v)) is the re-
sult f
v
of the aggregate function of the group of
tuples having value v for the group-by attribute w
in relation T (resp. T
i
),
kT k denotes the number of tuples of relation T ,
and |T | denotes the size (expressed in bytes or
number of pages) of relation T .
1
Histograms are implemented as a balanced tree (B-
tree): a data structure that maintains an ordered set of data
to allow efficient search and insert operations.
2
AGGR
w
f,u
(T ) is implemented as a B-tree.
AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES
247
The algorithm proceeds in six phases. To study
the cost of each phase we use the scalable and
portable Bulk-Synchronous Parallel (BSP) cost
model which is a programming model introduced
by L. Valiant (Valiant, 1990). In this algorithm, the
notation O(...) hides only small constant factors:
they depend only on the program implementation but
neither on data nor on the BSP machine parameters
(Al Hajj Hassan and Bamha, 2007).
Phase 1: Creating local histograms
In this phase, the local histograms Hist
x,y
(R
i
)(i =
1, ..., p ) of blocks R
i
are created in parallel by a
scan of the fragment R
i
on processor i in time
c
i/o
max
i=1,...,p
|R
i
| where c
i/o
is the cost of
writing/reading a page of data from disk.
In addition, the local fragments AGGR
x,z
f,u
(S
i
)(i =
1, ..., p ) of blocks S
i
are also created in parallel
on each processor i by applying the aggregate
function f on every group of tuples having identical
values of the couple of attributes (x, z) in time
c
i/o
max
i=1,...,p
|S
i
|.
In this phase we also compute the frequency
of each value of the attribute x in Hist
x,y
(R
i
) and
AGGR
x,z
f,u
(S
i
) needed in phase 2. So while creating
Hist
x,y
(R
i
) (resp. AGGR
x,z
f,u
(S
i
)), we also create
on the fly their local histograms Hist
x
(R
i
) (resp.
Hist
x
(S
i
)) with respect to x, i.e. Hist
x
(R
i
) (resp.
Hist
x
(S
i
)) holds the frequency of each value of
the attribute x in Hist
x,y
(R
i
) (resp. AGGR
x,z
f,u
(S
i
))
where we count tuples having the same values of the
attributes (x, y) only once.
We use the following algorithm to create Hist
x
(R
i
)
and a similar one is used to create Hist
x
(S
i
).
Par (on each node in parallel) i = 1, ..., p
Hist
x
(R
i
) = NULL
3
For every tuple t that will be inserted or used to
modify Hist
x,y
(R
i
) do
If Hist
x,y
(R
i
)(t.x, t.y) = NULL Then
4
freq
1
= Hist
x
(R
i
)(t.x)
If freq
1
6= NULL Then
Increment the frequency of t.x in Hist
x
(R
i
)
Else
Insert a new tuple (t.x, 1) into Hist
x
(R
i
)
EndIf
EndIf
EndFor
EndPar
In principle, this phase costs:
T ime
phase1
= O
c
i/o
max
i=1,...,p
(|R
i
| + |S
i
|) .
Phase 2: Local semi-joins computation
In order to minimize the communica-
tion cost, only tuples of Hist
x,y
(R) and
AGGR
x,z
f,u
(S) that will be present in the join re-
sult are redistributed. To this end, we compute
Hist
x,y
(R
i
) = Hist
x,y
(R
i
) AGGR
x,z
f,u
(S) and
AGGR
x,z
f,u
(S
i
) = AGGR
x,z
f,u
(S
i
) Hist
x,y
(R) using
proposition 2 presented in (Bamha and Hains, 2005),
where we apply the hash function on the tuples of
Hist
x
(R
i
) and Hist
x
(S
i
). In fact the number of
tuples of Hist
x
(R
i
) and that of Hist
x
(R
i
) are equal,
what differs is only the value of their frequency
attribute, so |Hist
x
(R
i
)| = |Hist
x
(R
i
)| (this also
applies to Hist
x
(S
i
) and Hist
x
(S
i
) ). Hence the
cost of this phase is (Bamha and Hains, 2005):
T ime
phase2
=
O
max
i=1,...,p
||Hist
x,y
(R
i
)|| +
max
i=1,...,p
||AGGR
x,z
f,u
(S
i
)||
+min
g|Hist
x
(R)|+||Hist
x
(R)||, g
|R|
p
+
||R||
p
+min g |Hist
x
(S)| + ||Hist
x
(S)||, g
|S|
p
+
||S||
p
+ l .
where g is the BSP communication parameter and l
is the cost of a barrier of synchronization.
During semi-join computation, we store for each
value d Hist
x
(R) Hist
x
(S) an extra infor-
mation called index(d) {1, 2, 3} which will allow
us to decide if, for a given value d, the frequencies
of tuples of Hist
x,y
(R) and AGGR
x,z
f,u
(S) having the
value d are greater (resp. lesser) than a threshold
frequency f
0
. It also permits us to choose dynami-
cally the probe and the build relation for each value d
of the join attribute. This choice reduces the global
redistribution cost to a minimum.
In the rest of this paper, we use f
0
= p log(p) as the
threshold frequency (Bamha and Hains, 2000).
For a given value d Hist
x
(R) Hist
x
(S)
5
, the
value index(d) = 3, means that Hist
x
(R)(d) < f
0
and Hist
x
(S)(d) < f
0
, while index(d) = 2,
means that Hist
x
(S)(d) f
0
and Hist
x(S)(d) >
Hist
x
(R)(d) and index(d) = 1, means that
Hist
x
(R)(d) f
0
and Hist
x
(R)(d) Hist
x
(S)(d).
Note that unlike the algorithms presented in (Shatdal
and Naughton, 1995; Taniar et al., 2000) where both
relations R and S are redistributed, we will only
redistribute
Hist
x,y
(R
i
) and AGGR
x,z
f,u
to find the
final result. This reduces the communication costs to
a minimum.
At the end of this phase, we will divide
Hist
x,y
(R
i
) and AGGR
x,z
f,u
(S
i
) on each pro-
cessor i into three sub-histograms such
that:
Hist
x,y
(R
i
) =
3
j=1
Hist
(j)x,y
(R
i
) and
5
The intersection of Hist
x
(R) and Hist
x
(S) is found
while computing the semi-joins (c.f proposition 2 presented
in (Bamha and Hains, 2005))
WEBIST 2007 - International Conference on Web Information Systems and Technologies
248
AGGR
x,z
f,u
(S
i
) =
3
j=1
AGGR
(j)x,z
f,u
(S
i
) where all the
tuples of Hist
(1)x,y
(R
i
) (resp. AGGR
(1)x,z
f,u
(S
i
)) are
associated to values d such that index(d) = 1 (resp.
index(d) = 2), while that of
Hist
(2)x,y
(R
i
) (resp.
AGGR
(2)x,z
f,u
(S
i
)) are associated to values d such
that index(d) = 2 (resp. index(d) = 1), and all
the tuples of
Hist
(3)x,y
(R
i
) and AGGR
(3)x,z
f,u
(S
i
) are
associated to values d such that index(d) = 3.
Phase 3: Creating the communication tem-
plates
The tuples of relations
Hist
(3)x,y
(R
i
) and
AGGR
(3)x,z
f,u
(S
i
) (have very low frequencies for
the join attribute) have no effect neither on Attribute
Value Skew (AVS) nor on Join Product Skew (JPS) so
they are redistributed using a hash function. However
the tuples of Hist
(1)x,y
(R
i
) and AGGR
(1)x,z
f,u
(S
i
) are
associated to high frequencies for the join attribute
so they have an important effect on AVS and JPS. So
we will use an appropriate redistribution algorithm in
order to efficiently avoid both AVS and JPS (Bamha
and Hains, 2000).
3.a To this end, we partition the histogram
Hist
x
(R S) (which is simply the intersection of
Hist
x
(R) and Hist
x
(S)) into two sub-histograms:
Hist
(1,2)
x
(R S) and Hist
(3)
x
(R S) where
the values d Hist
(1,2)
x
(R S) are associ-
ated to high frequencies of the join attribute (i.e.
index(d) = 1 or index(d) = 2) while the values
d Hist
(3)
x
(R S) are associated to low fre-
quencies (i.e. index(d) = 3). This partition step is
performed in parallel, on each processor i, by a local
traversal of the histogram Hist
x
i
(R S) in time:
T ime
3.a
= O
max
i=1,...,p
||Hist
x
i
(R S)||
.
3.b Communication templates for high frequencies:
We first create a communication template: the list of
messages which constitutes the relations’ redistribu-
tion. This step is performed jointly by all processors,
each one not necessarily computing the list of its own
messages, so as to balance the overall process.
So each processor i computes a set of neces-
sary messages relating to the values d it owns in
Hist
(1,2)
x
i
(R S). The communication template of
Hist
(1)x,y
(R) is derived by applying the following
algorithm. We also apply the same algorithm to com-
pute the communication template of AGGR
(1)x,z
f,u
(S),
but we replace Hist
x
(R) by Hist
x
(S).
if Hist
x
(R)(d)mod(p) = 0 then
each processor j will hold a block of size
block
j
(d) =
Hist
x
(R)(d)
p
of tuples of value d.
else
begin
Pick a random value j
0
between 0 and (p 1)
if
processor’s index j is between j
0
and
j
0
+
Hist
x
(R)(d) mod p then
processor of index j will hold a block
of size: block
j
(d) =
Hist
x
(R)(d)
p
+ 1
else
processor of index j will hold a block
of size: block
j
(d) =
Hist
x
(R)(d)
p
end.
In the above algorithm, x is the largest integral
value not greater than x and block
j
(d) is the number
of tuples of value d that processor j should own after
redistribution of the fragments T
i
of relation T .
The absolute value of Rest
j
(d) = Hist
j
(T )(d)
block
j
(d) determines the number of tuples of value
d that processor j must send (if Rest
j
(d) > 0) or
receive (if Rest
j
(d) < 0).
For d Hist
(1,2)
x
i
(R S), processor i owns
a description of the layout of tuples of value d
over the network. It may therefore determine the
number of tuples of value d which every processor
must send/receive. This information constitutes the
communication template. Only those j for which
Rest
j
(d) > 0 (resp. Rest
j
(d) < 0) send (resp.
receive) tuples of value of d. This step is thus com-
pleted in time: T ime
3.b
= O
||Hist
(1,2)
x
(R S)|| .
The tuples associated to low frequencies (i.e. tuples
having d Hist
(3)
x
i
(R S)) have no effect neither
on the AVS nor the JPS. These tuples are simply
mapped to processors using a hash function.
Thus the cost of creating the communication tem-
plates is: T ime
phase3
=
O
max
i=1,...,p
||Hist
x
i
(R S)|| + ||Hist
(1,2)
x
(R
S)|| .
Phase 4: Data redistribution
4.a Redistribution of tuples having d
Hist
(1,2)
x
i
(R S): Every processor i holds,
for every one of its local d Hist
(1,2)
x
i
(R S), the
non-zero communication volumes it prescribes as a
part of communication template: Rest
j
(d) 6= 0 for
j = 1, ..., p. This information will take the form of
sending orders sent to their target processor in a first
superstep, followed then by the actual redistribution
superstep where processors obey all orders they have
received.
Each processor i first splits the processors indices
j into two groups: those for which Rest
j
(d) > 0 and
those for which Rest
j
(d) < 0. This is done by a
sequential traversal of the Rest
..
(d) array.
AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES
249
Let α (resp. β) be the number of js
where Rest
j
(d) is positive (resp. negative) and
P roc(k)
k=1,...,α+β
the array of processor in-
dices for which Rest
j
(d) 6= 0 in the manner
that: Rest
proc(j)
(d) > 0 for j = 1, ..., α and
Rest
proc(j)
(d) < 0 for j = 1 + α, ..., β.
A sequential traversal of P roc(k)
k=1,...,α+β
deter-
mines the number of tuples that each processor j will
send. The sending orders concerning attribute value
d are computed using the following procedure whose
maximal complexity is O
||Hist
(1,2)
x
(R S)||
because for a given d, no more than (p1) processors
can send data and each processor i is in charge of re-
distribution of tuples having d Hist
(1,2)
x
i
(R S).
i := 1; j := α + 1;
while (i α) do
begin
n
tuples = min(Rest
proc(i)
(d), Rest
proc(j)
(d));
order
to send(Proc(i),Proc(j),d,n-tuples);
Rest
proc(i)
(d) := Rest
proc(i)
(d) - n
tuples;
Rest
proc(j)
(d) := Rest
proc(j)
(d) + n
tuples;
if Rest
proc(i)
(d) = 0 then i := i + 1; endif
if Rest
proc(j)
(d) = 0 then j := j + 1; endif
end.
For each processor i and d Hist
(1,2)
x
i
(R S),
all the order to send(j, i, ...) are sent to processor j
when j 6= i in time O
g |Hist
(1,2)
x
(R S)| + l .
Thus, this step costs: T ime
4.a
=
O
g|Hist
(1,2)
x
(R S)|+||Hist
(1,2)
x
(R S)||+l .
4.b Tuples of Hist
(3)x,y
(R
i
) and AGGR
(3)x,z
f,u
(S
i
) (i.e.
tuples having d Hist
(3)
x
i
(R S)) are associated
to low frequencies, they have no effect neither on the
AVS nor the JPS. These relations are redistributed
using a hash function.
At the end of steps 4.a and 4.b, each processor i,
has local knowledge of how the tuples of semi-joins
Hist
x,y
(R
i
) and AGGR
x,z
f,u
(S
i
) will be redistributed.
Redistribution is then performed in time:
T ime
4.b
= O
g |Hist
x,y
(R
i
)| + |AGGR
x,z
f,u
(S
i
)| +
l
.
Thus the total cost of the redistribution phase is:
T ime
phase4
= O
g max
i=1,...,p
|Hist
x,y
(R
i
)| + |AGGR
x,z
f,u
(S
i
)|
+ |Hist
(1,2)
x
(R S)|
+ ||Hist
(1,2)
x
(R S)|| + l
We mention that we only redistribute the tuples
of the semi-joins
Hist
x,y
(R
i
) and AGGR
x,z
f,u
(S
i
)
where |
Hist
x,y
(R
i
)| and |AGGR
x,z
f,u
(S
i
)| are generally
very small compared to |R
i
| and |S
i
|. In addition
|Hist
x
(R S)| is generally very small compared to
|Hist
x,y
(R)| and |AGGR
x,z
f,u
(S)|. Thus we reduce the
communication cost to a minimum.
Phase 5: local computation of the aggregate
function
At this step, every processor has partitions of
Hist
x,y
(R) and AGGR
x,z
f,u
(S). Using equation 2
in (Bamha, 2005), we can deduce that the tuples
of Hist
(1)x,y
(R
i
), Hist
(2)x,y
(R
i
), Hist
(3)x,y
(R
i
)
can be joined with the tuples of AGGR
(2)x,z
f,u
(S
i
),
AGGR
(1)x,z
f,u
(S
i
), AGGR
(3)x,z
f,u
(S
i
) respectively. But
the frequencies of tuples of
Hist
(1)x,y
(R
i
) and
AGGR
(1)x,z
f,u
(S
i
) are by definition greater than the
corresponding (matching) tuples in
Hist
(2)x,y
(R
i
)
and AGGR
(2)x,z
f,u
(S
i
) respectively. So we will choose
Hist
(1)x,y
(R
i
) and AGGR
(1)x,z
f,u
(S
i
) as the build
relations and Hist
(2)x,y
(R
i
) and AGGR
(2)x,z
f,u
(S
i
) as
probe relations. Hence, we need to duplicate the
probe relations to all processors in time:
T ime
phase5.a
= O
g |Hist
(2)x,y
(R)| +
|
AGGR
(2)x,z
f,u
(S)| + l .
Now, using the following algorithm, we are able
to compute the local aggregate function on every
processor without the necessity to fully materialize
the intermediate results of the join operation.
In this algorithm, we create on each processor
i, the relation AGGR
y,z
f,u
((R S)
i
) that holds the
local results of applying the aggregate function on
every group of tuples having the same value of the
couple of attributes (y, z). AGGR
y,z
f,u
((R S)
i
) has
the form (y, z, v) where y and z are the group-by at-
tributes and v is the result of the aggregate function.
(1)Par (on each node in parallel) i = 1, ..., p
(2) AGGR
y,z
f,u
((R S)
i
) = NULL;
6
(3) For every tuple t of relation
Hist
(1)x,y
(R
i
) do
(4) For every entry v
1
= AGGR
(2)x,z
f,u
(S
i
)(t.x, z) do
(5) v
2
= AGGR
y,z
f,u
((R S)
i
)(t.y, z);
(6) If v
2
6= NULL Then
(7) Update AGGR
y,z
f,u
((R S)
i
)(t.y, z)=F (v
1
, v
2
)
where F () is the aggregate function;
(8) Else
(9) Insert a new tuple (t.y, z, v
1
) into the
histogram AGGR
y,z
f,u
((R S)
i
);
(10) EndIf
(11) EndFor
(12) EndFor
(13) Repeat steps (3)...(12) but replace
Hist
(1)x,y
(R
i
) in (3) by AGGR
(1)x,z
f,u
(S
i
)
and
AGGR
(2)x,z
f,u
(S
i
)(t.x, z) in (4) by
Hist
(2)x,y
(R
i
)(t.x, y);
(14) Repeat steps (3)...(12) but replace
Hist
(1)x,y
(R
i
) in (3) by Hist
(3)x,y
(R
i
)
and
AGGR
(2)x,z
f,u
(S
i
)(t.x, z) in (4) by
AGGR
(3)x,z
f,u
(S
i
)(t.x, z);
(15)EndPar
The cost of applying this algorithm is:
T ime
phase5.b
=
WEBIST 2007 - International Conference on Web Information Systems and Technologies
250
c
i/o
O max
i=1,...,p
|Hist
(1)x,y
(R
i
)
AGGR
(2)x,z
f,u
(S)|+
|
Hist
(2)x,y
(R) AGGR
(1)x,z
f,u
(S
i
)|+
|Hist
(3)x,y
(R
i
)
AGGR
(3)x,z
f,u
(S
i
)|
So the total cost of this phase is simply the sum of
T ime
phase5.a
and T ime
phase5.b
.
Phase 6: global computation of the aggregate
function
In this phase, a global application of the aggregate
function is carried out. For this purpose, every
processor redistributes the local aggregation results,
AGGR
y ,z
f,u
((R S)
i
), using a common hash func-
tion whose input attributes are y and z. After hashing,
every processor applies the aggregate function on the
received messages in order to compute the global
result AGGR
y ,z
f,u
(R S). The time of this step is:
T ime
phase6
= O
min g |AGGR
y,z
f,u
(R S)|+
||AGGR
y,z
f,u
(R S)||, g
|R S|
p
+
||R S||
p
+ l
where we apply proposition 1 in (Bamha and Hains,
2005)) to redistribute AGGR
y,z
f,u
((R S)
i
).
The global cost of evaluating the ”GroupBy-
Join” queries in this algorithm is of order:
T ime
total
= O
c
i/o
max
i=1,...,p
(|R
i
| + |S
i
|)
+min g |Hist
x
(R)| + ||Hist
x
(R)||, g
|R|
p
+
||R||
p
+min g |Hist
x
(S)| + ||Hist
x
(S)||, g
|S|
p
+
||S||
p
+g max
i=1,...,p
|Hist
x,y
(R
i
)| + |AGG R
x,z
f,u
(S
i
)|
+|Hist
(1,2)
x
(R S)| + ||Hist
(1,2)
x
(R S)||
+g |Hist
(2)x,y
(R)| + |AGGR
(2)x,z
f,u
(S)|
+c
i/o
max
i=1,...,p
|Hist
(1)x,y
(R
i
)
AGGR
(2)x,z
f,u
(S)|
+|
Hist
(2)x,y
(R) AGGR
(1)x,z
f,u
(S
i
)|
+|Hist
(3)x,y
(R
i
) AGGR
(3)x,z
f,u
(S
i
)|
+min g |AGGR
y,z
f,u
(R S)| + ||AGGR
y,z
f,u
(R S)||,
g
|R S|
p
+
||R S||
p
+ max
i=1,...,p
||Hist
x,y
(R
i
)||
+ max
i=1,...,p
||AGGR
x,z
f,u
(S
i
)|| + l
.
Remark 1
In the traditional algorithms, the aggregate function
is applied on the output of the join operation. The
sequential evaluation of the ”groupBy-Join” queries
requires at least the following lower bound:
bound
inf
1
=
c
i/o
(|R| + |S| + |R S|)
.
Parallel processing with p processors requires there-
fore: bound
inf
p
=
1
p
bound
inf
1
.
Using our approach, the evaluation of the ”GroupBy-
Join” queries when the join attributes are different
from the group-by attributes has an optimal asymp-
totic complexity when:
max |
Hist
(2)x,y
(R)|, |AGGR
(2)x,z
f,u
(S)|, |Hist
(1,2)
x
(R
S)|
c
i/o
max(
|R|
p
,
|S|
p
,
|R S|
p
),
this is due to the fact that the local join results
have almost the same size and all the terms in
T ime
total
are bounded by those of bound
inf
p
. This
inequality holds if we choose a threshold frequency
f
0
greater than p (which is the case for our threshold
frequency f
0
= p log(p)).
4 CONCLUSION
In this paper, we presented a parallel algorithm used
to compute ”GroupBy-Join” queries in a distributed
architecture when the group-by attributes and the join
attributes are not the same. This algorithm can be
used efficiently to reduce the execution time of the
query, because we do not materialize the costly join
operation which is a necessary step in all the other al-
gorithms presented in the literature that treat this type
of queries, thus reducing the Input/Output cost. It also
helps us to balance the load of all the processors even
in the presence of AVS and to avoid the JPS which
may result from computing the intermediate join re-
sults.
In addition, the communication cost is reduced to
the minimum owing to the fact that only histograms
and the results of semi-joins are redistributed across
the network where their size is very small compared
to the size of input relations.
The performance of this algorithm was analyzed us-
ing the BSP cost model which predicts an asymptotic
optimal complexity for our algorithm even for highly
skewed data.
In our future work, we will implement this algo-
rithm and extend it to a GRID environment.
REFERENCES
Al Hajj Hassan, M. and Bamha, M. (2007). An opti-
mal evaluation of groupby-join queries in distributed
architectures. Research Report RR-2007-01, LIFO,
Universit
´
e d’Orl
´
eans, France.
Bamha, M. (2005). An optimal and skew-insensitive
join and multi-join algorithm for ditributed architec-
tures. In Proceedings of the International Confer-
ence on Database and Expert Systems Applications
AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES
251
(DEXA’2005). 22-26 August, Copenhagen, Dane-
mark, LNCS 3588, pages 616–625.
Bamha, M. and Hains, G. (2000). A skew insensitive al-
gorithm for join and multi-join operation on Shared
Nothing machines. In the 11th International Confer-
ence on Database and Expert Systems Applications
DEXA’2000, LNCS 1873, London, United Kingdom.
Bamha, M. and Hains, G. (2005). An efficient equi-semi-
join algorithm for distributed architectures. In Pro-
ceedings of the 5th International Conference on Com-
putational Science (ICCS’2005). 22-25 May, Atlanta,
USA, LNCS 3515, pages 755–763.
Datta, A., Moon, B., and Thomas, H. (1998). A case for
parallelism in datawarehousing and OLAP. In Ninth
International Workshop on Database and Expert Sys-
tems Applications, DEXA 98, IEEE Computer Society,
pages 226–231, Vienna.
Seetha, M. and Yu, P. S. (December 1990). Effectiveness of
parallel joins. IEEE, Transactions on Knowledge and
Data Enginneerings, 2(4):410–424.
Shatdal, A. and Naughton, J. F. (1995). Adaptive paral-
lel aggregation algorithms. ACM SIGMOD Record,
24(2):104–114.
Taniar, D., Jiang, Y., Liu, K., and Leung, C. (2000).
Aggregate-join query processing in parallel database
systems. In Proceedings of The Fourth International
Conference/Exhibition on High Performance Comput-
ing in Asia-Pacific Region HPC-Asia2000, volume 2,
pages 824–829. IEEE Computer Society Press.
Valiant, L. G. (August 1990). A bridging model for par-
allel computation. Communications of the ACM,
33(8):103–111.
Yan, W. P. and Larson, P.-
˚
A. (1994). Performing group-
by before join. In Proceedings of the Tenth Inter-
national Conference on Data Engineering, pages 89–
100, Washington, DC, USA. IEEE Computer Society.
WEBIST 2007 - International Conference on Web Information Systems and Technologies
252