UNASSSUMING VIEW-SIZE ESTIMATION TECHNIQUES IN OLAP

An Experimental Comparison

Kamel Aouiche and Daniel Lemire

LICEF, University of Quebec at Montreal, 100 Sherbrooke West, Montreal, Canada

Keywords:

Probabilistic estimation, skewed distributions, sampling, hashing.

Abstract:

Even if storage was inﬁnite, a data warehouse could not materialize all possible views due to the running time

and update requirements. Therefore, it is necessary to estimate quickly, accurately, and reliably the size of

views. Many available techniques make particular statistical assumptions and their error can be quite large.

Unassuming techniques exist, but typically assume we have independent hashing for which there is no known

practical implementation. We adapt an unassuming estimator due to Gibbons and Tirthapura: its theoretical

bounds do not make unpractical assumptions. We compare this technique experimentally with stochastic

probabilistic counting, LOGLOG probabilistic counting, and multifractal statistical models. Our experiments

show that we can reliably and accurately (within 10%, 19 times out 20) estimate view sizes over large data

sets (1.5 GB) within minutes, using almost no memory. However, only GIBBONS-TIRTHAPURA provides

universally tight estimates irrespective of the size of the view. For large views, probabilistic counting has a

small edge in accuracy, whereas the competitive sampling-based method (multifractal) we tested is an order

of magnitude faster but can sometimes provide poor estimates (relative error of 100%). In our tests, LOGLOG

probabilistic counting is not competitive. Experimental validation on the US Census 1990 data set and on the

Transaction Processing Performance (TPC H) data set is provided.

1 INTRODUCTION

View materialization is presumably one of the most

effective technique to improve query performance of

data warehouses. Materialized views are physical

structures that improve data access time by precom-

puting intermediary results. Typical OLAP queries

deﬁned on data warehouses consist in selecting and

aggregating data with queries such as grouping sets

(GROUP BY clauses). By precomputing many plau-

sible groupings, we can avoid aggregates over large

tables. However, materializing views requires addi-

tional storage space and induces maintenance over-

head when refreshing the data warehouse.

One of the most important issues in data ware-

house physical design is to select an appropriate con-

ﬁguration of materialized views. Several heuris-

tics and methodologies were proposed for the ma-

terialized view selection problem, which is NP-

hard (Gupta, 1997). Most of these techniques exploit

cost models to estimate the data access cost using ma-

terialized views, their maintenance and storage cost.

This cost estimation mostly depends on view-size es-

timation.

Several techniques have been proposed for view-

size estimation: some requiring assumptions about

the data distribution and others that are “unassuming.”

A common statistical assumption is uniformity (Gol-

farelli and Rizzi, 1998), but any skew in the data leads

to an overestimate of the size of the view. Generally,

while statistically assuming estimators are computed

quickly, the most expensive step being the random

sampling, their error can be large and it cannot be

bounded a priori.

In this paper, we consider several state-of-the-

art statistically unassuming estimation techniques:

GIBBONS-TIRTHAPURA (Gibbons and Tirthapura,

2001), probabilistic counting (Flajolet and Martin,

1985), and LOGLOG probabilistic counting (Durand

145

Aouiche K. and Lemire D. (2007).

UNASSSUMING VIEW-SIZE ESTIMATION TECHNIQUES IN OLAP - An Experimental Comparison.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 145-150

DOI: 10.5220/0002354601450150

 SciTePress

and Flajolet, 2003). While relatively expensive, unas-

suming estimators tend to provide a good accuracy.

To our knowledge, this is the ﬁrst experimental com-

parisons of unassuming view-size estimation tech-

niques in a data warehousing setting.

2 RELATED WORK

Haas et al. (Haas et al., 1995) estimate the view-

size from the histogram of a sample: adaptively, they

choose a different estimator based on the skew of the

distribution. Faloutsos et al. (Faloutsos et al., 1996)

obtain results nearly as accurate as Haas et al., that is,

an error of approximately 40%, but they only need the

dominant mode of the histogram, the number of dis-

tinct elements in the sample, and the total number of

elements. In sample-based estimations, in the worst-

case scenario, the histogram might be as large as the

view size we are trying to estimate. Moreover, it is

difﬁcult to derive unassuming accuracy bounds since

the sample might not be representative. However, a

sample-based algorithm is expected to be an order of

magnitude faster than an algorithm which processes

the entire data set.

Probabilistic counting (Flajolet and Martin, 1985)

and LOGLOG probabilistic counting (henceforth

LOGLOG) (Durand and Flajolet, 2003) have been

shown to provide very accurate unassuming view-size

estimations quickly, but their estimates assume we

have independent hashing. Because of this assump-

tion, their theoretical bound may not hold in practice.

Whether this is a problem in practice is one of the

contribution of this paper.

Gibbons and Tirthapura (Gibbons and Tirtha-

pura, 2001) derived an unassuming bound (henceforth

GIBBONS-TIRTHAPURA) that only requires pairwise

independent hashing. It has been shown recently

that if you have k-wise independent hashing for k >

2 the theoretically bound can be improved substan-

tially (Lemire and Kaser, 2006). The beneﬁt of

GIBBONS-TIRTHAPURA is that as long as the ran-

dom number generator is truly random, the theoret-

ical bounds have to hold irrespective of the size of the

view or of other factors.

All unassuming estimation techniques in this pa-

per (LOGLOG, probabilistic counting and GIBBONS-

TIRTHAPURA), have an accuracy proportional to

√

M where M is a parameter noting the memory

usage.

3 ESTIMATION BY

MULTIFRACTALS

We implemented the statistically assuming algo-

rithm by Faloutsos et al. based on a multifractal

model (Faloutsos et al., 1996). Nadeau and Teo-

rey (Nadeau and Teorey, 2003) reported competitive

results for this approach. Maybe surprisingly, given

a sample, all that is required to learn the multifractal

model is the number of distinct elements in the sam-

ple F

, the number of elements in the sample N

, the

total number of elements N, and the number of occur-

rences of the most frequent item in the sample m

max

Hence, a very simple implementation is possible (see

Algorithm 1). Faloutsos et al. erroneously introduced

a tolerance factor ε in their algorithm: unlike what

they suggest, it is not possible, unfortunately, to ad-

just the model parameter for an arbitrary good ﬁt, but

instead, we have to be content with the best possible

ﬁt (see line 9 and following).

Algorithm 1 View-size estimation using a multifrac-

tal distribution model.

1: INPUT: Fact table t containing N facts

2: INPUT: GROUP BY query on dimensions

,...,D

3: INPUT: Sampling ratio 0 < p < 1

4: OUTPUT: Estimated size of GROUP BY query

5: Choose a sample in t

of size N

= bpNc

6: Compute g=GROUP BY(t

)

7: let m

max

be the number of occurrences of the most fre-

quent tuple x

,...,x

in g

8: let F

be the number of tuples in g

9: k ← dlog F

10: while F < F

11: p ← (m

max

)

1/k

12: F ←

∑

a=0





(1 −(p

k−a

(1 − p)

)

13: k ← k + 1

14: p ← (m

max

/N)

1/k

15: RETURN:

∑

a=0





(1 −(p

k−a

(1 − p)

)

4 UNASSUMING VIEW-SIZE

ESTIMATION

4.1 Independent Hashing

Hashing maps objects to values in a nearly random

way. It has been used for efﬁcient data structures such

as hash tables and in cryptography. We are interested

in hashing functions from tuples to [0, 2

) where L is

ﬁxed (L = 32 in this paper). Hashing is uniform if

P(h(x) = y) = 1/2

for all x, y, that is, if all hashed

ICEIS 2007 - International Conference on Enterprise Information Systems

146

values are equally likely. Hashing is pairwise in-

dependent if P(h(x

) = y ∧h(x

) = z) = P(h(x

) =

y)P(h(x

) = z) = 1/4

for all x

,y,z. Pairwise in-

dependence implies uniformity. Hashing is k -wise in-

dependent if P(h(x

) = y

∧···∧h(x

) = y

) = 1/2

for all x

. Finally, hashing is (fully) independent if

it is k-wise independent for all k. It is believed that

independent hashing is unlikely to be possible over

large data sets using a small amount of memory (Du-

rand and Flajolet, 2003).

Next, we show how k-wise independent hashing

is easily achieved in a multidimensional data ware-

housing setting. For each dimension D

, we build

a lookup table T

, using the attribute values of D

as keys. Each time we meet a new key, we gener-

ate a random number in [0, 2L) and store it in the

lookup table T

. This random number is the hashed

value of this key. This table generates (fully) inde-

pendent hash values in amortized constant time. In

a data warehousing context, whereas dimensions are

numerous, each dimension will typically have few

distinct values: for example, there are only 8,760

hours in a year. Therefore, the lookup table will of-

ten use a few Mib or less. When hashing a tuple

,.. ., x

in D

×D

×.. .D

, we use the value

) ⊕T

) ⊕···⊕T

) where ⊕ is the EXCLU-

SIVE OR operator. This hashing is k-wise independent

and requires amortized constant time. Tables T

can be

reused for several estimations.

4.2 Probabilistic Counting

Our implementation of (stochastic) probabilistic

counting (Flajolet and Martin, 1985) is given in

Algorithm 2. Recently, a variant of this algo-

rithm, LOGLOG, was proposed (Durand and Flajolet,

2003). Assuming independent hashing, these algo-

rithms have standard error (or the standard deviation

of the error) of 0.78/

√

M and 1.3/

√

M respectively.

These theoretical results assume independent hashing

which we cannot realistically provide. Thus, we do

not expect these theoretical results to be always reli-

able.

4.3 GIBBONS-TIRTHAPURA

Our implementation of the GIBBONS-TIRTHAPURA

algorithm (see Algorithm 4) hashes each tuple only

once unlike the original algorithm (Gibbons and

Tirthapura, 2001). Moreover, the independence of the

hashing depends on the number of dimensions used

by the GROUP BY. If the view-size is smaller than the

memory parameter ( M), the view-size estimation is

Algorithm 2 View-size estimation using (stochastic)

probabilistic counting.

1: INPUT: Fact table t containing N facts

2: INPUT: GROUP BY query on dimensions

,...,D

3: INPUT: Memory budget parameter M = 2

4: INPUT: Independent hash function h from d tuples to

[0,2

5: OUTPUT: Estimated size of GROUP BY query

6: b ← M ×L matrix (initialized at zero)

7: for tuple x ∈t do

8: x

← π

,...,D

(x) {projection of the tuple}

9: y ← h(x

) {hash x

to [0,2

)}

10: α = y mod M

11: i ← position of the ﬁrst 1-bit in by/Mc

12: b

α,i

← 1

13: A ← 0

14: for α ∈ {0,1,...,M −1} do

15: increment A by the position of the ﬁrst zero-bit in

α,0

α,1

,...

16: RETURN: M/φ2

A/M

where φ ≈ 0.77351

Algorithm 3 View-size estimation using LOGLOG.

1: INPUT: fact table t containing N facts

2: INPUT: GROUP BY query on dimensions

,...,D

3: INPUT: Memory budget parameter M = 2

4: INPUT: Independent hash function h from d tuples to

[0,2

5: OUTPUT: Estimated size of GROUP BY query

6: M ← 0, 0, . . . , 0

| {z }

7: for tuple x ∈t do

8: x

← π

,...,D

(x) {projection of the tuple}

9: y ← h(x

) {hash x

to [0,2

)}

10: j ← value of the ﬁrst k bits of y in base 2

11: z ← position of the ﬁrst 1-bit in the remaining L −k

bits of y (count starts at 1)

12: M

← max(M

,z)

13: RETURN: α

∑

where α

≈ 0.39701 −

(2π

+ ln

2)/(48M).

without error. For this reason, we expect GIBBONS-

TIRTHAPURA to perform well when estimating small

and moderate view sizes.

The theoretical bounds given in (Gibbons and

Tirthapura, 2001) assumed pairwise independence.

The generalization below is from (Lemire and Kaser,

2006) and is illustrated by Figure 1.

Proposition 1 Algorithm 4 estimates the number of

distinct tuples within relative precision ε, with a k-

wise independent hash for k ≥ 2 by storing M distinct

tuples (M ≥ 8k) and with reliability 1 −δ where δ is

UNASSSUMING VIEW-SIZE ESTIMATION TECHNIQUES IN OLAP - An Experimental Comparison

147

Algorithm 4 GIBBONS-TIRTHAPURA view-size esti-

mation.

1: INPUT: Fact table t containing N facts

2: INPUT: GROUP BY query on dimensions

,...,D

3: INPUT: Memory budget parameter M

4: INPUT: k-wise hash function h from d tuples to [0,2

5: OUTPUT: Estimated size of GROUP BY query

6: M ← empty lookup table

7: t ←0

8: for tuple x ∈t do

9: x

← π

,...,D

(x) {projection of the tuple}

10: y ← h(x

) {hash x

to [0,2

)}

11: j ← position of the ﬁrst 1-bit i y (count starts at 0)

12: if j ≤t then

13: M

= j

14: while size(M ) > M do

15: t ←t + 1

16: prune all entries in M having value less than t

17: RETURN:.2

size(M )

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

ε 19/20

k=2

k=4

k=8

Figure 1: Bound on the estimation error (19 times out of

20) as a function of the number of tuples kept in memory

(M ∈[128, 2048]) according to Proposition 1 for GIBBONS-

TIRTHAPURA view-size estimation with k -wise indepen-

dent hashing.

given by

δ =

k/2

k/3

k/2

−1)

More generally, we have

δ ≤

k/2

k/3

k/2

(1 −α)

k/2

−1)

for 4k/M ≤ α < 1 and any k,M > 0.

In the case where hashing is 4-wise independent,

as in some of experiments below, we derive a more

concise bound.

Corollary 1 With 4-wise independent hashing, Algo-

rithm 4 estimates the number of distinct tuples within

relative precision ε ≈ 5/

√

M, 19 times out of 20 for ε

small.

Table 1: Characteristic of data sets.

US Census 1990 DBGEN

# of facts 2458285 13977981

# of views 20 8

# of attributes 69 16

Data size 360 MB 1.5 GB

Proof. We start from the second inequality of Propo-

sition 1. Differentiating

k/2

(1−α)

k/2

−1)

with

respect to α and setting the result to zero, we get

3α

+ 16α

−48α

−16 = 0 (recall that 4k/M ≤

α < 1). By multiscale analysis, we seek a solution of

the form α = 1 −aε

+ o(ε

) and we have that α ≈

1 −1/2

3/2ε

4/3

for ε small. Substituting this value

of α, we have

k/2

(1−α)

k/2

−1)

≈128/24ε

. The

result follows by substituting in the second inequal-

ity. 

5 EXPERIMENTAL RESULTS

To benchmark the quality of the view-size estima-

tion against the memory and speed, we have run test

over the US Census 1990 data set (Hettich and Bay,

2000) as well as on synthetic data produced by DB-

GEN (TPC, 2006). The synthetic data was produced

by running the DBGEN application with scale factor

parameter equal to 2. The characteristics of data sets

are detailed in Table 1. We selected 20 and 8 views re-

spectively from these data sets: all views in US Cen-

sus 1990 have at least 4 dimensions whereas only 2

views have at least 4 dimensions in the synthetic data

set.

We used the GNU C++ compiler version 4.0.2

with the “-O2” optimization ﬂag on a Centrino Duo

1.83 GHz machine with 2 GB of RAM running Linux

kernel 2.6.13–15. No thrashing was observed. To

ensure reproducibility, C++ source code is available

freely from the authors.

For the US Census 1990 data set, the hashing

look-up table is a simple array since there are always

fewer than 100 attribute values per dimension. Oth-

erwise, for the synthetic DBGEN data, we used the

GNU/CGI STL extension hash map which is to be

integrated in the C++ standard as an unordered map:

it provides amortized O(1) inserts and queries. All

other look-up tables are implemented using the STL

map template which has the same performance char-

acteristics of a red-black tree. We used comma sep-

arated (CSV) (and pipe separated ﬁles for DBGEN)

text ﬁles and wrote our own C++ parsing code.

ICEIS 2007 - International Conference on Enterprise Information Systems

148

The test protocol we adopted (see Algorithm 5)

has been executed for each estimation technique

(LOGLOG, probabilistic counting and GIBBONS-

TIRTHAPURA), GROUP BY query, random seed and

memory size. At each step corresponding to those pa-

rameter values, we compute the estimated-size values

of GROUP BY s and time required for their compu-

tation. For the multifractal estimation technique, we

computed at the same way the time and estimated size

for each GROUP BY, sampling ratio value and ran-

dom seed.

Algorithm 5 Test protocol.

1: for GROUP BY query q ∈ Q do

2: for memory budget m ∈ M do

3: for random seed value r ∈ R do

4: Estimate the size of GROUP BY q with m mem-

ory budget and r random seed value

5: Save estimation results (time and estimated

size) in a log ﬁle

US Census 1990. Figure 2 plots the largest 95

percentile error observed over 20 test estimations

for various memory size M ∈ {16, 64,256, 2048}.

For the multifractal estimation technique, we rep-

resent the error for each sampling ratio p ∈

{0.1%,0.3%,0.5%, 0.7%}. The X axis represents

the size of the exact GROUP BY values. This

-percentile error can be related to the theoreti-

cal bound for ε with 19/20 reliability for GIBBONS-

TIRTHAPURA (see Corollary 1): we see that this up-

per bound is veriﬁed experimentally. However, the er-

ror on “small” view sizes can exceed 100% for prob-

abilistic counting and LOGLOG.

Synthetic data set. Similarly, we computed the

19/20 error for each technique, computed from the

DDBGEN data set . We observed that the four tech-

niques have the same behaviour observed on the US

Census data set. Only, this time, the theoretical bound

for the 19/20 error is larger because the synthetic data

sets has many views with less than 2 dimensions.

Speed. We have also computed the time needed for

each technique to estimate view-sizes. We do not rep-

resent this time because it is similar for each tech-

nique except for the multifractal which is the fastest

one. In addition, we observed that time do not depend

on the memory budget because most time is spent

streaming and hashing the data. For the multifrac-

tal technique, the processing time increases with the

sampling ratio.

The time needed to estimate the size of all

the views by GIBBONS-TIRTHAPURA, probabilis-

tic counting and LOGLOG is about 5 minutes for

US Census 1990 data set and 7 minutes for the syn-

thetic data set. For the multifractal technique, all

the estimates are done on roughly 2 seconds. This

time does not include the time needed for sampling

data which can be signiﬁcant: it takes 1 minute (resp.

4 minutes) to sample 0.5% of the US Census data set

(resp. the synthetic data set – TPC H) because the

data is not stored in a ﬂat ﬁle.

6 DISCUSSION

Our results show that probabilistic counting and

LOGLOG do not entirely live up to their theoretical

promise. For small view sizes, the relative accuracy

can be very low.

When comparing the memory usage of the var-

ious techniques, we have to keep in mind that the

memory parameter M can translate in different mem-

ory usage. The memory usage depends also on

the number of dimensions of each view. Generally,

GIBBONS-TIRTHAPURA will use more memory for

the same value of M than either probabilistic counting

or LOGLOG, though all of these can be small com-

pared to the memory usage of the lookup tables T

used for k-wise independent hashing. In this paper,

the memory usage was always of the order of a few

MiB which is negligible in a data warehousing con-

text.

View-size estimation by sampling can take min-

utes when data is not layed out in a ﬂat ﬁle or in-

dexed, but the time required for an unassuming es-

timation is even higher. Streaming and hashing the

tuples accounts for most of the processing time so for

faster estimates, we could store all hashed values in a

bitmap (one per dimension).

7 CONCLUSION AND FUTURE

WORK

In this paper, we have provided unassuming tech-

niques for view-size estimation in a data warehousing

context. We adapted an estimator due to Gibbons and

Tirthapura. We compared this technique experimen-

tally with stochastic probabilistic counting, LOGLOG,

and multifractal statistical models. We have demon-

strated that among these techniques, only GIBBONS-

TIRTHAPURA provides stable estimates irrespective

of the size of views. Otherwise, (stochastic) proba-

bilistic counting has a small edge in accuracy for rela-

tively large views, whereas the competitive sampling-

based technique (multifractal) is an order of mag-

nitude faster but can provide crude estimates. Ac-

cording to our experiments, LOGLOG was not faster

UNASSSUMING VIEW-SIZE ESTIMATION TECHNIQUES IN OLAP - An Experimental Comparison

149

0.01

0.1

100

1000

100 1000 10000 100000 1e+006 1e+007

ε(%), 19 out of 20

View size

M=16

M=64

M=256

M=2048

4-wise M=2048

(a) GIBBONS-TIRTHAPURA

0.01

0.1

100

1000

100 1000 10000 100000 1e+06 1e+07

ε(%), 19 out of 20

View size

M=16

M=64

M=256

M=2048

(b) Probabilistic counting

0.01

0.1

100

1000

100 1000 10000 100000 1e+06 1e+07

ε(%), 19 out of 20

View size

M=16

M=64

M=256

M=2048

0.01

0.1

100

1000

100 1000 10000 100000 1e+06 1e+07

ε(%), 19 out of 20

View size

p=0.1%

p=0.3%

p=0.5%

p=0.7%

(d) Multifractal

Figure 2: 95

-percentile error 19/20 ε as a function of exact view size for increasing values of M (US Census 1990).

than either GIBBONS-TIRTHAPURA or probabilistic

counting, and since it is less accurate than probabilis-

tic counting, we cannot recommend it. There is ample

room for future work. Firstly, we plan to extend these

techniques to other types of aggregated views (for ex-

ample, views including HAVING clauses). Secondly,

we want to precompute the hashed values for very

fast view-size estimation. Furthermore, these tech-

niques should be tested in a materialized view selec-

tion heuristic.

ACKNOWLEDGEMENTS

The authors wish to thank Owen Kaser for hardware

and software. This work is supported by NSERC

grant 261437 and by FQRNT grant 112381.

REFERENCES

Durand, M. and Flajolet, P. (2003). Loglog counting of

large cardinalities. In ESA’03. Springer.

Faloutsos, C., Matias, Y., and Silberschatz, A. (1996). Mod-

eling skewed distribution using multifractals and the

80-20 law. In VLDB’96, pages 307–317.

Flajolet, P. and Martin, G. (1985). Probabilistic counting al-

gorithms for data base applications. Journal of Com-

puter and System Sciences, 31(2):182–209.

Gibbons, P. B. and Tirthapura, S. (2001). Estimating simple

functions on the union of data streams. In SPAA’01,

pages 281–291.

Golfarelli, M. and Rizzi, S. (1998). A methodological

framework for data warehouse design. In DOLAP

1998, pages 3–9.

Gupta, H. (1997). Selection of views to materialize in a data

warehouse. In ICDT 1997, pages 98–112.

Haas, P., Naughton, J., Seshadri, S., and Stokes, L. (1995).

Sampling-based estimation of the number of distinct

values of an attribute. In VLDB’95, pages 311–322.

Hettich, S. and Bay, S. D. (2000). The UCI KDD

archive. http://kdd.ics.uci.edu, last checked on

32/10/2006.

Lemire, D. and Kaser, O. (2006). One-pass, one-hash

n-gram count estimation. Technical Report TR-06-

001, Dept. of CSAS, UNBSJ. available from http:

//arxiv.org/abs/cs.DB/0610010.

Nadeau, T. and Teorey, T. (2003). A Pareto model for OLAP

view size estimation. Information Systems Frontiers,

5(2):137–147.

TPC (2006). DBGEN 2.4.0. http://www.tpc.org/

tpch/, last checked on 32/10/2006.

ICEIS 2007 - International Conference on Enterprise Information Systems

150