CONSTRUCTION OF DECISION TREES USING DATA CUBE

Lixin Fu

Abstract

Data classification is an important problem in data mining. The traditional classification algorithms based on decision trees have been widely used due to their fast model construction and good model understandability. However, the existing decision tree algorithms need to recursively partition dataset into subsets according to some splitting criteria i.e. they still have to repeatedly compute the records belonging to a node (called F-sets) and then compute the splits for the node. For large data sets, this requires multiple passes of original dataset and therefore is often infeasible in many applications. In this paper we present a new approach to constructing decision trees using pre-computed data cube. We use statistics trees to compute the data cube and then build a decision tree on top of it. Mining on aggregated data stored in data cube will be much more efficient than directly mining on flat data files or relational databases. Since data cube server is usually a required component in an analytical system for answering OLAP queries, we essentially provide “free” classification by eliminating the dominant I/O overhead of scanning the massive original data set. Our new algorithm generates trees of the same prediction accuracy as existing decision tree algorithms such as SPRINT and RainForest but improves performance significantly. In this paper we also give a system architecture that integrates DBMS, OLAP, and data mining seamlessly.

References

  1. Agarwal, S., R. Agrawal, et al. (1996). On The Computation of Multidimensional Aggregates. Proceedings of the International Conference on Very Large Databases, Mumbai (Bomabi), India: 506-521.
  2. Beyer, K. and R. Ramakrishnan (1999). Bottom-Up Computation of Sparse and Iceberg CUBEs. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD 7899). C. Faloutsos. Philadelphia, PA: 359-370.
  3. Chan, C. Y. and Y. E. Ioannidis (1998). Bitmap Index Design and Evaluation. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD 7898), Seattle, WA: 355-366.
  4. Chaudhuri, S. and U. Dayal (1997). "An Overview of Data Warehousing and OLAP Technology." SIGMOD Record 26(1): 65-74.
  5. Chaudhuri, S., U. Fayyad, et al. (1999). Scalable Classification over SQL Databases. 15th International Conference on Data Engineering, March 23 - 26, 1999, Sydney, Australia: 470.
  6. Cheeseman, P. and J. Stutz (1996). Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining. R. Uthurusamy, AAAI/MIT Press: 153-180.
  7. Comer, D. (1979). "The Ubiquitous Btree." ACM Computing Surveys 11(2): 121-137.
  8. Duda, R. and P. Hart (1973). Pattern Classification and Scene Analysis. New York, John Wiley & Sons.
  9. Fu, L. (2003). Classification for Free. International Conference on Internet Computing 2003 (IC'03) June 23 - 26, 2003, Monte Carlo Resort, Las Vegas, Nevada, USA.
  10. Fu, L. and J. Hammer (2000). CUBIST: A New Algorithm For Improving the Performance of Ad-hoc OLAP Queries. ACM Third International Workshop on Data Warehousing and OLAP, Washington, D.C, USA, November: 72-79.
  11. Gehrke, J., V. Ganti, et al. (1999). BOAT - Optimistic Decision Tree Construction. Proc. 1999 Int. Conf. Management of Data (SIGMOD 7899), Philadephia, PA, June 1999.: 169-180.
  12. Gehrke, J., R. Ramakrishnan, et al. (1998). RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. Proceedings of the 24th VLDB Conference (VLDB 7898), New York, USA, 1998: 416- 427.
  13. Hammer, J. and L. Fu (2001). Improving the Performance of OLAP Queries Using Families of Statistics Trees. 3rd International Conference on Data Warehousing and Knowledge Discovery DaWaK 01, September, 2001, Munich, Germany: 274-283.
  14. Han, J. and M. Kamber (2001). Data Mining: Concepts and Techniques, Morgan Kaufman Publishers.
  15. Harinarayan, V., A. Rajaraman, et al. (1996). "Implementing data cubes efficiently." SIGMOD Record 25(2): 205-216.
  16. Inmon, W. H. (1996). Building the Data Warehouse. New York, John Wiley & Sons.
  17. Johnson, T. and D. Shasha (1997). "Some Approaches to Index Design for Cube Forests." Bulletin of the Technical Committee on Data Engineering, IEEE Computer Society 20(1): 27-35.
  18. Lakshmanan, L. V. S., J. Pei, et al. (2003). QC-Trees: An Efficient Summary Structure for Semantic OLAP. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003. A. Doan, ACM: 64-75.
  19. Lent, B., A. Swami, et al. (1997). Clustering Association Rules. Proceedings of the Thirteenth International Conference on Database Engineering (ICDE 7897), Birmingham, U.K.: 220-231.
  20. Lu, H., R. Setiono, et al. (1995). NeuroRule: A Connectionist Approach to Data Mining. VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland. S. Nishio, Morgan Kaufmann: 478-489.
  21. Mehta, M., R. Agrawal, et al. (1996). SLIQ: A Fast Scalable Classifier for Data Mining. Advances in Database Technology - EDBT'96, 5th International Conference on Extending Database Technology, Avignon, France, March 25-29, 1996, Proceedings. G. Gardarin, Springer. 1057: 18-32.
  22. O'Neil, P. (1987). Model 204 Architecture and Performance. Proc. of the 2nd International Workshop on High Performance Transaction Systems, Asilomar, CA: 40-59.
  23. O'Neil, P. and D. Quass (1997). "Improved Query Performance with Variant Indexes." SIGMOD Record (ACM Special Interest Group on Management of Data) 26(2): 38-49.
  24. Quilan, J. R. (1986). Introduction of Decision Trees. Machine Learning. 1: 81-106.
  25. Quilan, J. R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann.
  26. Shafer, J., R. Agrawal, et al. (1996). SPRINT: A Scalable Parallel Classifier for Data Mining. VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India. N. L. Sarda, Morgan Kaufmann: 544-555.
  27. Sismanis, Y., A. Deligiannakis, et al. (2002). Dwarf: shrinking the PetaCube. Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD 7802), Madison, Wisconsin: 464 - 475.
  28. Zhao, Y., P. M. Deshpande, et al. (1997). "An ArrayBased Algorithm for Simultaneous Multidimensional Aggregates." SIGMOD Record 26(2): 159-170.
Download


Paper Citation


in Harvard Style

Fu L. (2005). CONSTRUCTION OF DECISION TREES USING DATA CUBE . In Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 972-8865-19-8, pages 119-126. DOI: 10.5220/0002509801190126


in Bibtex Style

@conference{iceis05,
author={Lixin Fu},
title={CONSTRUCTION OF DECISION TREES USING DATA CUBE},
booktitle={Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2005},
pages={119-126},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002509801190126},
isbn={972-8865-19-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - CONSTRUCTION OF DECISION TREES USING DATA CUBE
SN - 972-8865-19-8
AU - Fu L.
PY - 2005
SP - 119
EP - 126
DO - 10.5220/0002509801190126