The test protocol we adopted (see Algorithm 5)
has been executed for each estimation technique
(LOGLOG, probabilistic counting and GIBBONS
TIRTHAPURA), GROUP BY query, random seed and
memory size. At each step corresponding to those pa
rameter values, we compute the estimatedsize values
of GROUP BY s and time required for their compu
tation. For the multifractal estimation technique, we
computed at the same way the time and estimated size
for each GROUP BY, sampling ratio value and ran
dom seed.
Algorithm 5 Test protocol.
1: for GROUP BY query q ∈ Q do
2: for memory budget m ∈ M do
3: for random seed value r ∈ R do
4: Estimate the size of GROUP BY q with m mem
ory budget and r random seed value
5: Save estimation results (time and estimated
size) in a log ﬁle
US Census 1990. Figure 2 plots the largest 95
th

percentile error observed over 20 test estimations
for various memory size M ∈ {16, 64,256, 2048}.
For the multifractal estimation technique, we rep
resent the error for each sampling ratio p ∈
{0.1%,0.3%,0.5%, 0.7%}. The X axis represents
the size of the exact GROUP BY values. This
95
th
percentile error can be related to the theoreti
cal bound for ε with 19/20 reliability for GIBBONS
TIRTHAPURA (see Corollary 1): we see that this up
per bound is veriﬁed experimentally. However, the er
ror on “small” view sizes can exceed 100% for prob
abilistic counting and LOGLOG.
Synthetic data set. Similarly, we computed the
19/20 error for each technique, computed from the
DDBGEN data set . We observed that the four tech
niques have the same behaviour observed on the US
Census data set. Only, this time, the theoretical bound
for the 19/20 error is larger because the synthetic data
sets has many views with less than 2 dimensions.
Speed. We have also computed the time needed for
each technique to estimate viewsizes. We do not rep
resent this time because it is similar for each tech
nique except for the multifractal which is the fastest
one. In addition, we observed that time do not depend
on the memory budget because most time is spent
streaming and hashing the data. For the multifrac
tal technique, the processing time increases with the
sampling ratio.
The time needed to estimate the size of all
the views by GIBBONSTIRTHAPURA, probabilis
tic counting and LOGLOG is about 5 minutes for
US Census 1990 data set and 7 minutes for the syn
thetic data set. For the multifractal technique, all
the estimates are done on roughly 2 seconds. This
time does not include the time needed for sampling
data which can be signiﬁcant: it takes 1 minute (resp.
4 minutes) to sample 0.5% of the US Census data set
(resp. the synthetic data set – TPC H) because the
data is not stored in a ﬂat ﬁle.
6 DISCUSSION
Our results show that probabilistic counting and
LOGLOG do not entirely live up to their theoretical
promise. For small view sizes, the relative accuracy
can be very low.
When comparing the memory usage of the var
ious techniques, we have to keep in mind that the
memory parameter M can translate in different mem
ory usage. The memory usage depends also on
the number of dimensions of each view. Generally,
GIBBONSTIRTHAPURA will use more memory for
the same value of M than either probabilistic counting
or LOGLOG, though all of these can be small com
pared to the memory usage of the lookup tables T
i
used for kwise independent hashing. In this paper,
the memory usage was always of the order of a few
MiB which is negligible in a data warehousing con
text.
Viewsize estimation by sampling can take min
utes when data is not layed out in a ﬂat ﬁle or in
dexed, but the time required for an unassuming es
timation is even higher. Streaming and hashing the
tuples accounts for most of the processing time so for
faster estimates, we could store all hashed values in a
bitmap (one per dimension).
7 CONCLUSION AND FUTURE
WORK
In this paper, we have provided unassuming tech
niques for viewsize estimation in a data warehousing
context. We adapted an estimator due to Gibbons and
Tirthapura. We compared this technique experimen
tally with stochastic probabilistic counting, LOGLOG,
and multifractal statistical models. We have demon
strated that among these techniques, only GIBBONS
TIRTHAPURA provides stable estimates irrespective
of the size of views. Otherwise, (stochastic) proba
bilistic counting has a small edge in accuracy for rela
tively large views, whereas the competitive sampling
based technique (multifractal) is an order of mag
nitude faster but can provide crude estimates. Ac
cording to our experiments, LOGLOG was not faster
UNASSSUMING VIEWSIZE ESTIMATION TECHNIQUES IN OLAP  An Experimental Comparison
149