
 
that satisfy the conditions of a given query) could be 
drawn more efficiently using the k-MDI tree. 
The sampling scheme using the k-MDI tree index 
(Rudra et al., 2012) facilitates picking rich samples 
for queries with highly specific selection conditions. 
If the sample contains an adequate number of 
records that satisfy the query conditions, the average 
values can be estimated from these records. 
However, to estimate sum (= avg 
x count) we need 
to estimate both the average as well as count of the 
records that satisfy the query in the whole database. 
Therefore, from the sample we need to project the 
number of records in the entire database that satisfy 
the given query conditions. Chaudhuri and Mukerjee 
(1985) proposed an unbiased estimator based on 
inverse simple random sampling without 
replacement (SRSWOR) where random sampling is 
carried out on a finite population until a predefined 
number of domain members are observed. In this 
paper, we propose the adaptation of inverse 
SRSWOR to estimate adequate sample sizes for 
queries using the k-MDI tree index. The method is 
empirically evaluated on a large real world data 
warehouse. 
The rest of the paper is organized as follows:  In 
Section 2, we briefly describe the k-way multi-
dimensional (k-MDI) indexing structure and the 
storage structure of data records. Section 3 discusses 
how to pick adequate samples using inverse 
SRSWOR. In Section 4 we discuss the results of our 
experiments. Finally, Section 5 concludes the paper. 
2 TERMS, DEFINITIONS AND 
k-MDI TREE INDEX 
In this section, we define some terms pertaining to 
data warehousing, define confidence interval and 
then review the k-MDI tree index for retrieving 
relevant samples from a data warehouse. 
2.1 Dimension and Measures 
To execute decision support queries, data is usually 
structured in large databases called data warehouses. 
Typically, data warehouses are relational databases 
with a large table in the middle called the fact table 
connected to other tables called dimensions. For 
example, consider the fact table 
Sales shown as Table 
1. A dimension table 
Store linked to StoreNo in this 
fact table will contain more information on each of 
the stores such as store name, location, state, and 
country (Hobbs et al., 2003). Other dimension tables 
could exist for items and date. The remaining 
attributes like quantity and amount are typically, but 
not necessarily, numerical and are termed measures. 
A typical decision support query aggregates a 
measure using functions such as Sum(), Avg() or 
Count(). The fact table Sales along with all its 
dimension tables forms a star schema. 
Table 1: Fact table SALES. 
 
  SALES   
 
StoreNo Date  Item  Quantity  Amount 
  21  12-Jan-11  iPad      223  123,455 
  21  12-Jan-11  PC        20    24,800 
  24  11-Jan-11  iMac        11      9,990 
  77  25-Jan-11  PC        10    12,600 
 
In decision support queries a measure is of 
interest for calculation of averages, totals and 
counts. For example, a sales manager may like to 
know the total sales quantity and amount for certain 
item(s) in a certain period of time for a particular 
store or even all (or some) stores in a region. This 
may then allow her to make decisions to order more 
or less stocks as appropriate at a point in time. 
2.2 Multidimensional Indexing 
The k-ary multi-dimensional index tree (k-MDI tree) 
proposed in Rudra et al., (2012) extends the ACE 
Tree index (Joshi and Jermaine, 2008) for multiple 
dimensions. The height of the k-MDI tree is limited 
to the number of key attributes. As a multi-way tree 
index, it is relatively shallow even for a large 
number of key value ranges and so requires only a 
small number of disk accesses to traverse from the 
root to the leaf nodes. 
The k-MDI tree is a k-ary balanced tree (Bentley 
1975) as described below:
   
1.  The root node of a k-MDI tree corresponds to the 
first attribute (dimension) in the index. 
2.  The root points to k
1
 (k
1
 ≤ k) index nodes at level 
2, with each node corresponding to one of the k
1
 
splits of the ranges for attribute a
1
. 
3.  Each of the nodes at level 2, in turn, points to up 
to  k
2
 (k
2
  ≤  k) index nodes at level 3 
corresponding to k
2
 splits of the ranges of values 
of attribute a
2
; similarly for nodes at levels 3 to 
h, corresponding to attributes a
3
,..., a
h
. 
4.  At level h, each of up to k
h-1
 nodes points to up to 
k
h
 (k
h
 ≤ k) leaf nodes that store data records. 
5.  Each leaf node has h+1 sections; for sections 1 to 
h, each section i  contains random subset of 
records in the key range of the node i in
SelectingAdequateSamplesforApproximateDecisionSupportQueries
47