AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN

DISTRIBUTED ARCHITECTURES

M. Al Hajj Hassan and M. Bamha

LIFO, Universit

e d’Orl

eans, B.P. 6759, 45067 Orl

eans Cedex 2, France

Keywords:

Parallel DataBase Management Systems (PDBMS), Parallel joins, Data skew, Join product skew, GroupBy-

Join queries, BSP cost model.

Abstract:

SQL queries involving join and group-by operations are fairly common in many decision support applications

where the size of the input relations is usually very large, so the parallelization of these queries is highly

recommended in order to obtain a desirable response time. The most signiﬁcant drawbacks of the algorithms

presented in the literature for treating such queries are that they are very sensitive to data skew and involve

expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we

present an algorithm that overcomes these drawbacks because it evaluates the ”GroupBy-Join” query without

the need of the direct evaluation of the costly join operation, thus reducing its Input/Output and communication

costs. Furthermore, the performance of this algorithm is analyzed using the scalable and portable BSP (Bulk

Synchronous Parallel) cost model which predicts a linear speedup even for highly skewed data.

1 INTRODUCTION

Aggregate functions used to summarize large volume

of data based on a designated grouping are widely em-

ployed in applications such as: the decision support

application, OnLine Analytical Processing (OLAP)

and Data Warehouse (Taniar et al., 2000), because

in such applications, aggregated and summarized data

are more important than detailed records (Datta et al.,

1998). Aggregate operations may be applied on the

output of the join of multiple tables having potentially

billions of records. These tables may rapidly grow ev-

ery day especially in OLAP systems. Moreover, the

output of these queries must be obtained in a reason-

able processing time. For these reasons, parallel pro-

cessing of such queries results in huge performance

gain especially in PDBMS. However, the use of ef-

ﬁcient parallel algorithm in PDBMS is fundamental

in order to obtain an acceptable performance (Bamha

and Hains, 2000; Seetha and Yu, 1990).

Several parallel algorithms for evaluating ”GroupBy-

Join” queries were presented in the literature (Shatdal

and Naughton, 1995; Taniar et al., 2000), but these

algorithms are inefﬁcient due to their high communi-

cation cost because all the tuples of the relations are

redistributed between processors. Some of these tu-

ples may not even contribute in the result of the join

operation.

In addition, these algorithms fully materialize the in-

termediate results of the join operations and the In-

put/Output cost is very high where it is reasonable

to assume that the output relation cannot ﬁt in the

main memory of every processor, so it must be reread

from disk in order to evaluate the aggregate func-

tion. Finally, these algorithms cannot solve the prob-

lem of data skew because data redistribution is gener-

ally based on hashing data into buckets and hashing

is known to be inefﬁcient in the presence of high fre-

quencies (Bamha, 2005; Seetha and Yu, 1990).

In this paper, we present a new parallel algo-

rithm used to evaluate the ”GroupBy-Join” queries on

Shared Nothing machines (a distributed architecture

where each processor has its own memory and own

disks), when the join attributes are different from the

group-by attributes. Our main contribution is that, in

this algorithm, we do not need to materialize the join

operation as in the traditional algorithms where the

join operation is evaluated ﬁrst and then the group-by

246

Al Hajj Hassan M. and Bamha M. (2007).

AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES.

In Proceedings of the Third Inter national Conference on Web Information Systems and Technologies - Internet Technology, pages 246-252

DOI: 10.5220/0001281302460252

 SciTePress

and aggregate functions (Yan and Larson, 1994). It is

also insensitive to data skew and its communication

and Input/Output costs are reduced to a minimum.

In this algorithm, we partially evaluate the aggre-

gate function before redistributing the tuples. This

helps in reducing the cost of data redistribution. We

use the histograms of both relations in order to ﬁnd

the tuples that participate in the result of the join oper-

ation. It is proved in (Bamha and Hains, 2005; Bamha

and Hains, 2000), using the BSP model, that his-

togram management has a negligible cost when com-

pared to the gain it provides in reducing the commu-

nication cost.

In traditional algorithms, all the tuples of the output of

the join are redistributed using a hash function. In the

contrary, in our algorithm we only redistribute the re-

sult of the semi-join of the histograms which are very

small compared to the size of input relations. This

helps in reducing the amount of data transferred over

the network and therefore the communication cost.

The performance of this algorithm is analyzed using

the BSP cost model which predicts for our algorithm

a linear speedup even for highly skewed data.

2 COMPUTATION OF

”GROUPBY-JOIN” QUERIES

In DBMS, we can distinguish two types of ”GroupBy-

Join” queries. In the ﬁrst type the join attributes and

the group-by attributes are the same. In this case,

it is preferable to carry out the group-by and aggre-

gate functions ﬁrst and then the join operation (Taniar

et al., 2000), because this helps in reducing the size

of the relations to be joined and consequently de-

creasing the communication cost and the query exe-

cution time. In the contrary, this can not be applied

on the second type of queries, because the join at-

tributes are different from the group-by attributes



see

(Al Hajj Hassan and Bamha, 2007) for a long and de-

tailed version of this paper



. In this paper, we will fo-

cus on this type of ”GroupBy-Join” queries where we

present an algorithm which partially evaluates the ag-

gregate functions before redistributing the tuples us-

ing histograms, thus reducing the communication cost

as much as possible.

3 GROUPBY-JOIN QUERIES: A

NEW APPROACH

In this section, we present a detailed description

of a new parallel algorithm used to evaluate the

”GroupBy-Join” queries when the group-by attributes

are different from the join attributes. We assume

that the relation R (resp. S) is evenly partitioned

among processors by horizontal fragmentation such

that |R

| ≃

|R|

(i = 1, ..., p) where p is the number

of processors.

For simplicity of description and without loss of gen-

erality, we consider that the query has only one join

attribute x and that the group-by attribute set consists

of one attribute y of R and another attribute z of S.

We also assume that the aggregate function is applied

on the values of the attribute u of S.

In the rest of this paper we use the following nota-

tion for each relation T ∈ {R, S}:

• T

denotes the fragment of relation T placed on

processor i, a sub-relation of T ,

• Hist

(T ) denotes the histogram

of relation T

with respect to the attribute w, i.e. a list of pairs

(v, n

) where n

6= 0 is the number of tuples of

relation T having the value v for the attribute w.

The histogram is often much smaller and never

larger than the relation it describes,

• Hist

) denotes the histogram of fragment T

while Hist

(T ) is processor i’s fragment of the

histogram of T ,

• Hist

(T )(v) is the frequency (n

) of value v in

relation T while Hist

)(v) is its frequency in

sub-relation T

• AGGR

f,u

(T )

is the result of applying the aggre-

gate function f on the values of the attribute u of

every group of tuples of T having identical val-

ues of the group-by attributes w. AGGR

f,u

(T ) is

formed of a list of tuples (v, f

) where f

is the

result of applying the aggregate function on the

group of tuples having value v for the attribute w

(w may be formed of more than one attribute),

• AGGR

f,u

) denotes the result of applying the

aggregate function on the attribute u of relation

while AGGR

f,u,i

(T ) is processor i’s fragment

of the result of applying the aggregate function on

T ,

• AGGR

f,u

(T )(v) (resp.AGGR

f,u

)(v)) is the re-

sult f

of the aggregate function of the group of

tuples having value v for the group-by attribute w

in relation T (resp. T

• kT k denotes the number of tuples of relation T ,

and |T | denotes the size (expressed in bytes or

number of pages) of relation T .

Histograms are implemented as a balanced tree (B-

tree): a data structure that maintains an ordered set of data

to allow efﬁcient search and insert operations.

AGGR

f,u

(T ) is implemented as a B-tree.

AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES

247

The algorithm proceeds in six phases. To study

the cost of each phase we use the scalable and

portable Bulk-Synchronous Parallel (BSP) cost

model which is a programming model introduced

by L. Valiant (Valiant, 1990). In this algorithm, the

notation O(...) hides only small constant factors:

they depend only on the program implementation but

neither on data nor on the BSP machine parameters

(Al Hajj Hassan and Bamha, 2007).

Phase 1: Creating local histograms

In this phase, the local histograms Hist

x,y

)(i =

1, ..., p ) of blocks R

are created in parallel by a

scan of the fragment R

on processor i in time

i/o

∗ max

i=1,...,p

| where c

i/o

is the cost of

writing/reading a page of data from disk.

In addition, the local fragments AGGR

x,z

f,u

)(i =

1, ..., p ) of blocks S

are also created in parallel

on each processor i by applying the aggregate

function f on every group of tuples having identical

values of the couple of attributes (x, z) in time

i/o

∗ max

i=1,...,p

In this phase we also compute the frequency

of each value of the attribute x in Hist

x,y

) and

AGGR

x,z

f,u

) needed in phase 2. So while creating

Hist

x,y

) (resp. AGGR

x,z

f,u

)), we also create

on the ﬂy their local histograms Hist

′

) (resp.

Hist

′

)) with respect to x, i.e. Hist

′

) (resp.

Hist

′

)) holds the frequency of each value of

the attribute x in Hist

x,y

) (resp. AGGR

x,z

f,u

))

where we count tuples having the same values of the

attributes (x, y) only once.

We use the following algorithm to create Hist

′

)

and a similar one is used to create Hist

′

Par (on each node in parallel) i = 1, ..., p

Hist

′

) = NULL

For every tuple t that will be inserted or used to

modify Hist

x,y

) do

If Hist

x,y

)(t.x, t.y) = NULL Then

freq

= Hist

′

)(t.x)

If freq

6= NULL Then

Increment the frequency of t.x in Hist

′

)

Else

Insert a new tuple (t.x, 1) into Hist

′

)

EndIf

EndFor

EndPar

In principle, this phase costs:

T ime

phase1

= O

i/o

∗ max

i=1,...,p

(|R

| + |S

|) .

Phase 2: Local semi-joins computation

In order to minimize the communica-

tion cost, only tuples of Hist

x,y

(R) and

AGGR

x,z

f,u

(S) that will be present in the join re-

sult are redistributed. To this end, we compute

Hist

x,y

) = Hist

x,y

) ⋉ AGGR

x,z

f,u

(S) and

AGGR

x,z

f,u

) = AGGR

x,z

f,u

) ⋉ Hist

x,y

(R) using

proposition 2 presented in (Bamha and Hains, 2005),

where we apply the hash function on the tuples of

Hist

′

) and Hist

′

). In fact the number of

tuples of Hist

′

) and that of Hist

) are equal,

what differs is only the value of their frequency

attribute, so |Hist

′

)| = |Hist

)| (this also

applies to Hist

′

) and Hist

) ). Hence the

cost of this phase is (Bamha and Hains, 2005):

T ime

phase2

max

i=1,...,p

||Hist

x,y

)|| +

max

i=1,...,p

||AGGR

x,z

f,u

)||

+min

g∗|Hist

(R)|+||Hist

(R)||, g∗

|R|

||R||

+min g ∗ |Hist

(S)| + ||Hist

(S)||, g ∗

|S|

||S||

+ l .

where g is the BSP communication parameter and l

is the cost of a barrier of synchronization.

During semi-join computation, we store for each

value d ∈ Hist

′

(R) ∩ Hist

′

(S) an extra infor-

mation called index(d) ∈ {1, 2, 3} which will allow

us to decide if, for a given value d, the frequencies

of tuples of Hist

x,y

(R) and AGGR

x,z

f,u

(S) having the

value d are greater (resp. lesser) than a threshold

frequency f

. It also permits us to choose dynami-

cally the probe and the build relation for each value d

of the join attribute. This choice reduces the global

redistribution cost to a minimum.

In the rest of this paper, we use f

= p ∗ log(p) as the

threshold frequency (Bamha and Hains, 2000).

For a given value d ∈ Hist

′

(R) ∩ Hist

′

(S)

, the

value index(d) = 3, means that Hist

′

(R)(d) < f

and Hist

′

(S)(d) < f

, while index(d) = 2,

means that Hist

′

(S)(d) ≥ f

and Hist

′

x(S)(d) >

Hist

′

(R)(d) and index(d) = 1, means that

Hist

′

(R)(d) ≥ f

and Hist

′

(R)(d) ≥ Hist

′

(S)(d).

Note that unlike the algorithms presented in (Shatdal

and Naughton, 1995; Taniar et al., 2000) where both

relations R and S are redistributed, we will only

redistribute

Hist

x,y

) and AGGR

x,z

f,u

to ﬁnd the

ﬁnal result. This reduces the communication costs to

a minimum.

At the end of this phase, we will divide

Hist

x,y

) and AGGR

x,z

f,u

) on each pro-

cessor i into three sub-histograms such

that:

Hist

x,y

) =

j=1

Hist

(j)x,y

) and

The intersection of Hist

′

(R) and Hist

′

(S) is found

while computing the semi-joins (c.f proposition 2 presented

in (Bamha and Hains, 2005))

WEBIST 2007 - International Conference on Web Information Systems and Technologies

248

AGGR

x,z

f,u

) =

j=1

AGGR

(j)x,z

f,u

) where all the

tuples of Hist

(1)x,y

) (resp. AGGR

(1)x,z

f,u

)) are

associated to values d such that index(d) = 1 (resp.

index(d) = 2), while that of

Hist

(2)x,y

) (resp.

AGGR

(2)x,z

f,u

)) are associated to values d such

that index(d) = 2 (resp. index(d) = 1), and all

the tuples of

Hist

(3)x,y

) and AGGR

(3)x,z

f,u

) are

associated to values d such that index(d) = 3.

Phase 3: Creating the communication tem-

plates

The tuples of relations

Hist

(3)x,y

) and

AGGR

(3)x,z

f,u

) (have very low frequencies for

the join attribute) have no effect neither on Attribute

Value Skew (AVS) nor on Join Product Skew (JPS) so

they are redistributed using a hash function. However

the tuples of Hist

(1)x,y

) and AGGR

(1)x,z

f,u

) are

associated to high frequencies for the join attribute

so they have an important effect on AVS and JPS. So

we will use an appropriate redistribution algorithm in

order to efﬁciently avoid both AVS and JPS (Bamha

and Hains, 2000).

3.a To this end, we partition the histogram

Hist

′

(R ⋊⋉ S) (which is simply the intersection of

Hist

′

(R) and Hist

′

(S)) into two sub-histograms:

Hist

(1,2)

′

(R ⋊⋉ S) and Hist

(3)

′

(R ⋊⋉ S) where

the values d ∈ Hist

(1,2)

′

(R ⋊⋉ S) are associ-

ated to high frequencies of the join attribute (i.e.

index(d) = 1 or index(d) = 2) while the values

d ∈ Hist

(3)

′

(R ⋊⋉ S) are associated to low fre-

quencies (i.e. index(d) = 3). This partition step is

performed in parallel, on each processor i, by a local

traversal of the histogram Hist

′

(R ⋊⋉ S) in time:

T ime

3.a

= O

max

i=1,...,p

||Hist

′

(R ⋊⋉ S)||

3.b Communication templates for high frequencies:

We ﬁrst create a communication template: the list of

messages which constitutes the relations’ redistribu-

tion. This step is performed jointly by all processors,

each one not necessarily computing the list of its own

messages, so as to balance the overall process.

So each processor i computes a set of neces-

sary messages relating to the values d it owns in

Hist

(1,2)

′

(R ⋊⋉ S). The communication template of

Hist

(1)x,y

(R) is derived by applying the following

algorithm. We also apply the same algorithm to com-

pute the communication template of AGGR

(1)x,z

f,u

(S),

but we replace Hist

′

(R) by Hist

′

(S).

if Hist

′

(R)(d)mod(p) = 0 then

each processor j will hold a block of size

block

(d) =

Hist

′

(R)(d)

of tuples of value d.

else

begin

Pick a random value j

between 0 and (p − 1)

processor’s index j is between j

and

Hist

′

(R)(d) mod p then

processor of index j will hold a block

of size: block

(d) = ⌊

Hist

′

(R)(d)

⌋ + 1

else

processor of index j will hold a block

of size: block

(d) = ⌊

Hist

′

(R)(d)

⌋

end.

In the above algorithm, ⌊x⌋ is the largest integral

value not greater than x and block

(d) is the number

of tuples of value d that processor j should own after

redistribution of the fragments T

of relation T .

The absolute value of Rest

(d) = Hist

(T )(d) −

block

(d) determines the number of tuples of value

d that processor j must send (if Rest

(d) > 0) or

receive (if Rest

(d) < 0).

For d ∈ Hist

(1,2)

′

(R ⋊⋉ S), processor i owns

a description of the layout of tuples of value d

over the network. It may therefore determine the

number of tuples of value d which every processor

must send/receive. This information constitutes the

communication template. Only those j for which

Rest

(d) > 0 (resp. Rest

(d) < 0) send (resp.

receive) tuples of value of d. This step is thus com-

pleted in time: T ime

3.b

= O

||Hist

(1,2)

′

(R ⋊⋉ S)|| .

The tuples associated to low frequencies (i.e. tuples

having d ∈ Hist

(3)

′

(R ⋊⋉ S)) have no effect neither

on the AVS nor the JPS. These tuples are simply

mapped to processors using a hash function.

Thus the cost of creating the communication tem-

plates is: T ime

phase3

max

i=1,...,p

||Hist

′

(R ⋊⋉ S)|| + ||Hist

(1,2)

′

(R ⋊⋉

S)|| .

Phase 4: Data redistribution

4.a Redistribution of tuples having d ∈

Hist

(1,2)

′

(R ⋊⋉ S): Every processor i holds,

for every one of its local d ∈ Hist

(1,2)

′

(R ⋊⋉ S), the

non-zero communication volumes it prescribes as a

part of communication template: Rest

(d) 6= 0 for

j = 1, ..., p. This information will take the form of

sending orders sent to their target processor in a ﬁrst

superstep, followed then by the actual redistribution

superstep where processors obey all orders they have

received.

Each processor i ﬁrst splits the processors indices

j into two groups: those for which Rest

(d) > 0 and

those for which Rest

(d) < 0. This is done by a

sequential traversal of the Rest

(d) array.

AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES

249

Let α (resp. β) be the number of j’s

where Rest

(d) is positive (resp. negative) and

P roc(k)

k=1,...,α+β

the array of processor in-

dices for which Rest

(d) 6= 0 in the manner

that: Rest

proc(j)

(d) > 0 for j = 1, ..., α and

Rest

proc(j)

(d) < 0 for j = 1 + α, ..., β.

A sequential traversal of P roc(k)

k=1,...,α+β

deter-

mines the number of tuples that each processor j will

send. The sending orders concerning attribute value

d are computed using the following procedure whose

maximal complexity is O

||Hist

(1,2)

′

(R ⋊⋉ S)||

because for a given d, no more than (p−1) processors

can send data and each processor i is in charge of re-

distribution of tuples having d ∈ Hist

(1,2)

′

(R ⋊⋉ S).

i := 1; j := α + 1;

while (i ≤ α) do

begin

tuples = min(Rest

proc(i)

(d), −Rest

proc(j)

(d));

order

to send(Proc(i),Proc(j),d,n-tuples);

Rest

proc(i)

(d) := Rest

proc(i)

(d) - n

tuples;

Rest

proc(j)

(d) := Rest

proc(j)

(d) + n

tuples;

if Rest

proc(i)

(d) = 0 then i := i + 1; endif

if Rest

proc(j)

(d) = 0 then j := j + 1; endif

end.

For each processor i and d ∈ Hist

(1,2)

′

(R ⋊⋉ S),

all the order to send(j, i, ...) are sent to processor j

when j 6= i in time O

g ∗ |Hist

(1,2)

′

(R ⋊⋉ S)| + l .

Thus, this step costs: T ime

4.a

g∗|Hist

(1,2)

′

(R ⋊⋉ S)|+||Hist

(1,2)

′

(R ⋊⋉ S)||+l .

4.b Tuples of Hist

(3)x,y

) and AGGR

(3)x,z

f,u

) (i.e.

tuples having d ∈ Hist

(3)

′

(R ⋊⋉ S)) are associated

to low frequencies, they have no effect neither on the

AVS nor the JPS. These relations are redistributed

using a hash function.

At the end of steps 4.a and 4.b, each processor i,

has local knowledge of how the tuples of semi-joins

Hist

x,y

) and AGGR

x,z

f,u

) will be redistributed.

Redistribution is then performed in time:

T ime

4.b

= O

g ∗ |Hist

x,y

)| + |AGGR

x,z

f,u

)| +

Thus the total cost of the redistribution phase is:

T ime

phase4

= O

g ∗ max

i=1,...,p

|Hist

x,y

)| + |AGGR

x,z

f,u

+ |Hist

(1,2)

′

(R ⋊⋉ S)|

+ ||Hist

(1,2)

′

(R ⋊⋉ S)|| + l

We mention that we only redistribute the tuples

of the semi-joins

Hist

x,y

) and AGGR

x,z

f,u

)

where |

Hist

x,y

)| and |AGGR

x,z

f,u

)| are generally

very small compared to |R

| and |S

|. In addition

|Hist

′

(R ⋊⋉ S)| is generally very small compared to

|Hist

x,y

(R)| and |AGGR

x,z

f,u

(S)|. Thus we reduce the

communication cost to a minimum.

Phase 5: local computation of the aggregate

function

At this step, every processor has partitions of

Hist

x,y

(R) and AGGR

x,z

f,u

(S). Using equation 2

in (Bamha, 2005), we can deduce that the tuples

of Hist

(1)x,y

), Hist

(2)x,y

), Hist

(3)x,y

)

can be joined with the tuples of AGGR

(2)x,z

f,u

AGGR

(1)x,z

f,u

), AGGR

(3)x,z

f,u

) respectively. But

the frequencies of tuples of

Hist

(1)x,y

) and

AGGR

(1)x,z

f,u

) are by deﬁnition greater than the

corresponding (matching) tuples in

Hist

(2)x,y

)

and AGGR

(2)x,z

f,u

) respectively. So we will choose

Hist

(1)x,y

) and AGGR

(1)x,z

f,u

) as the build

relations and Hist

(2)x,y

) and AGGR

(2)x,z

f,u

) as

probe relations. Hence, we need to duplicate the

probe relations to all processors in time:

T ime

phase5.a

= O

g ∗ |Hist

(2)x,y

(R)| +

AGGR

(2)x,z

f,u

(S)| + l .

Now, using the following algorithm, we are able

to compute the local aggregate function on every

processor without the necessity to fully materialize

the intermediate results of the join operation.

In this algorithm, we create on each processor

i, the relation AGGR

y,z

f,u

((R ⋊⋉ S)

) that holds the

local results of applying the aggregate function on

every group of tuples having the same value of the

couple of attributes (y, z). AGGR

y,z

f,u

((R ⋊⋉ S)

) has

the form (y, z, v) where y and z are the group-by at-

tributes and v is the result of the aggregate function.

(1)Par (on each node in parallel) i = 1, ..., p

(2) AGGR

y,z

f,u

((R ⋊⋉ S)

) = NULL;

(3) For every tuple t of relation

Hist

(1)x,y

) do

(4) For every entry v

= AGGR

(2)x,z

f,u

)(t.x, z) do

(5) v

= AGGR

y,z

f,u

((R ⋊⋉ S)

)(t.y, z);

(6) If v

6= NULL Then

(7) Update AGGR

y,z

f,u

((R ⋊⋉ S)

)(t.y, z)=F (v

, v

)

where F () is the aggregate function;

(8) Else

(9) Insert a new tuple (t.y, z, v

) into the

histogram AGGR

y,z

f,u

((R ⋊⋉ S)

);

(10) EndIf

(11) EndFor

(12) EndFor

(13) Repeat steps (3)...(12) but replace

Hist

(1)x,y

) in (3) by AGGR

(1)x,z

f,u

)

and

AGGR

(2)x,z

f,u

)(t.x, z) in (4) by

Hist

(2)x,y

)(t.x, y);

(14) Repeat steps (3)...(12) but replace

Hist

(1)x,y

) in (3) by Hist

(3)x,y

)

and

AGGR

(2)x,z

f,u

)(t.x, z) in (4) by

AGGR

(3)x,z

f,u

)(t.x, z);

(15)EndPar

The cost of applying this algorithm is:

T ime

phase5.b

WEBIST 2007 - International Conference on Web Information Systems and Technologies

250

i/o

∗ O max

i=1,...,p

|Hist

(1)x,y

) ⋊⋉

AGGR

(2)x,z

f,u

(S)|+

Hist

(2)x,y

(R) ⋊⋉ AGGR

(1)x,z

f,u

)|+

|Hist

(3)x,y

) ⋊⋉

AGGR

(3)x,z

f,u

So the total cost of this phase is simply the sum of

T ime

phase5.a

and T ime

phase5.b

Phase 6: global computation of the aggregate

function

In this phase, a global application of the aggregate

function is carried out. For this purpose, every

processor redistributes the local aggregation results,

AGGR

y ,z

f,u

((R ⋊⋉ S)

), using a common hash func-

tion whose input attributes are y and z. After hashing,

every processor applies the aggregate function on the

received messages in order to compute the global

result AGGR

y ,z

f,u

(R ⋊⋉ S). The time of this step is:

T ime

phase6

= O

min g ∗ |AGGR

y,z

f,u

(R ⋊⋉ S)|+

||AGGR

y,z

f,u

(R ⋊⋉ S)||, g ∗

|R⋊⋉ S|

||R⋊⋉ S||

+ l

where we apply proposition 1 in (Bamha and Hains,

2005)) to redistribute AGGR

y,z

f,u

((R ⋊⋉ S)

The global cost of evaluating the ”GroupBy-

Join” queries in this algorithm is of order:

T ime

total

= O

i/o

∗ max

i=1,...,p

(|R

| + |S

+min g ∗ |Hist

(R)| + ||Hist

(R)||, g ∗

|R|

||R||

+min g ∗ |Hist

(S)| + ||Hist

(S)||, g ∗

|S|

||S||

+g ∗ max

i=1,...,p

|Hist

x,y

)| + |AGG R

x,z

f,u

+|Hist

(1,2)

′

(R ⋊⋉ S)| + ||Hist

(1,2)

′

(R ⋊⋉ S)||

+g ∗ |Hist

(2)x,y

(R)| + |AGGR

(2)x,z

f,u

(S)|

i/o

∗ max

i=1,...,p

|Hist

(1)x,y

) ⋊⋉

AGGR

(2)x,z

f,u

(S)|

Hist

(2)x,y

(R) ⋊⋉ AGGR

(1)x,z

f,u

+|Hist

(3)x,y

) ⋊⋉ AGGR

(3)x,z

f,u

+min g ∗ |AGGR

y,z

f,u

(R ⋊⋉ S)| + ||AGGR

y,z

f,u

(R ⋊⋉ S)||,

g ∗

|R⋊⋉ S|

||R⋊⋉ S||

+ max

i=1,...,p

||Hist

x,y

)||

+ max

i=1,...,p

||AGGR

x,z

f,u

)|| + l

Remark 1

In the traditional algorithms, the aggregate function

is applied on the output of the join operation. The

sequential evaluation of the ”groupBy-Join” queries

requires at least the following lower bound:

bound

inf

= Ω



i/o

∗ (|R| + |S| + |R ⋊⋉ S|)



Parallel processing with p processors requires there-

fore: bound

inf

∗ bound

inf

Using our approach, the evaluation of the ”GroupBy-

Join” queries when the join attributes are different

from the group-by attributes has an optimal asymp-

totic complexity when:

max |

Hist

(2)x,y

(R)|, |AGGR

(2)x,z

f,u

(S)|, |Hist

(1,2)

′

(R ⋊⋉

S)|

≤ c

i/o

∗ max(

|R|

|S|

|R ⋊⋉ S|

this is due to the fact that the local join results

have almost the same size and all the terms in

T ime

total

are bounded by those of bound

inf

. This

inequality holds if we choose a threshold frequency

greater than p (which is the case for our threshold

frequency f

= p ∗ log(p)).

4 CONCLUSION

In this paper, we presented a parallel algorithm used

to compute ”GroupBy-Join” queries in a distributed

architecture when the group-by attributes and the join

attributes are not the same. This algorithm can be

used efﬁciently to reduce the execution time of the

query, because we do not materialize the costly join

operation which is a necessary step in all the other al-

gorithms presented in the literature that treat this type

of queries, thus reducing the Input/Output cost. It also

helps us to balance the load of all the processors even

in the presence of AVS and to avoid the JPS which

may result from computing the intermediate join re-

sults.

In addition, the communication cost is reduced to

the minimum owing to the fact that only histograms

and the results of semi-joins are redistributed across

the network where their size is very small compared

to the size of input relations.

The performance of this algorithm was analyzed us-

ing the BSP cost model which predicts an asymptotic

optimal complexity for our algorithm even for highly

skewed data.

In our future work, we will implement this algo-

rithm and extend it to a GRID environment.

REFERENCES

Al Hajj Hassan, M. and Bamha, M. (2007). An opti-

mal evaluation of groupby-join queries in distributed

architectures. Research Report RR-2007-01, LIFO,

Universit

e d’Orl

eans, France.

Bamha, M. (2005). An optimal and skew-insensitive

join and multi-join algorithm for ditributed architec-

tures. In Proceedings of the International Confer-

ence on Database and Expert Systems Applications

AN OPTIMAL EVALUATION OF GROUPBY-JOIN QUERIES IN DISTRIBUTED ARCHITECTURES

251

(DEXA’2005). 22-26 August, Copenhagen, Dane-

mark, LNCS 3588, pages 616–625.

Bamha, M. and Hains, G. (2000). A skew insensitive al-

gorithm for join and multi-join operation on Shared

Nothing machines. In the 11th International Confer-

ence on Database and Expert Systems Applications

DEXA’2000, LNCS 1873, London, United Kingdom.

Bamha, M. and Hains, G. (2005). An efﬁcient equi-semi-

join algorithm for distributed architectures. In Pro-

ceedings of the 5th International Conference on Com-

putational Science (ICCS’2005). 22-25 May, Atlanta,

USA, LNCS 3515, pages 755–763.

Datta, A., Moon, B., and Thomas, H. (1998). A case for

parallelism in datawarehousing and OLAP. In Ninth

International Workshop on Database and Expert Sys-

tems Applications, DEXA 98, IEEE Computer Society,

pages 226–231, Vienna.

Seetha, M. and Yu, P. S. (December 1990). Effectiveness of

parallel joins. IEEE, Transactions on Knowledge and

Data Enginneerings, 2(4):410–424.

Shatdal, A. and Naughton, J. F. (1995). Adaptive paral-

lel aggregation algorithms. ACM SIGMOD Record,

24(2):104–114.

Taniar, D., Jiang, Y., Liu, K., and Leung, C. (2000).

Aggregate-join query processing in parallel database

systems. In Proceedings of The Fourth International

Conference/Exhibition on High Performance Comput-

ing in Asia-Paciﬁc Region HPC-Asia2000, volume 2,

pages 824–829. IEEE Computer Society Press.

Valiant, L. G. (August 1990). A bridging model for par-

allel computation. Communications of the ACM,

33(8):103–111.

Yan, W. P. and Larson, P.-

A. (1994). Performing group-

by before join. In Proceedings of the Tenth Inter-

national Conference on Data Engineering, pages 89–

100, Washington, DC, USA. IEEE Computer Society.

WEBIST 2007 - International Conference on Web Information Systems and Technologies

252