Learning in Dynamic Environments: Decision Trees for Data Streams

João Gama, Pedro Medas

2004

Abstract

This paper presents an adaptive learning system for induction of forest of trees from data streams able to detect Concept Drift. We have extended our previous work on Ultra Fast Forest Trees (UFFT) with the ability to detect concept drift in the distribution of the examples. The Ultra Fast Forest of Trees is an incremental algorithm, that works online, processing each example in constant time, and performing a single scan over the training examples. Our system has been designed for continuous data. It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test. The number of examples required to evaluate the splitting criteria is sound, based on the Hoeffding bound. For multi-class problems the algorithm builds a binary tree for each possible pair of classes, leading to a forest of trees. During the training phase the algorithm maintains a short term memory. Given a data stream, a fixed number of the most recent examples are maintained in a data-structure that supports constant time insertion and deletion. When a test is installed, a leaf is transformed into a decision node with two descendant leaves. The sufficient statistics of these leaves are initialized with the examples in the short term memory that will fall at these leaves. To detect concept drift, we maintain, at each inner node, a naïve-Bayes classifier trained with the examples that traverse the node. While the distribution of the examples is stationary, the online error of naive-Bayes will decrease. When the distribution changes, the naive-Bayes online error will increase. In that case the test installed at this node is not appropriate for the actual distribution of the examples. When this occurs all the subtree rooted at this node will be pruned. This methodology was tested with two artificial data sets and one real world data set. The experimental results show a good performance at the change of concept detection and also with learning the new concept.

References

  1. L. Breiman. Bagging predictors. Machine Learning, 24, 1996.
  2. Leo Breiman. Random forests. Technical report, University of Berkeley, 2002.
  3. P. Domingos and G. Hulten. Mining high-speed data streams. In Knowledge Discovery and Data Mining, pages 71-80, 2000.
  4. J. Fürnkranz. Round robin classification. Journal of Machine Learning Research, 2:721-747, 2002.
  5. J. Gama, P. Medas, and R. Rocha. Forest trees for on-line data. In ACM Symposium on Applied Computing - SAC04, 2004.
  6. Michael Harries. Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales, 1999.
  7. Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Knowledge Discovery and Data Mining, 2001.
  8. J.Gama, R. Rocha, and P. Medas. Accurate decision trees for mining high-speed data streams. In Procs. of the 9th ACM SigKDD Int. Conference in Knowledge Discovery and Data Mining. ACM Press, 2003.
  9. J. Kittler. Combining classifiers: A theoretical framework. Pattern analysis and Applications, Vol. 1, No. 1, 1998.
  10. R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 2004.
  11. R. Klinkenberg and I. Renz. Adaptive information filtering: Learning in the presence of concept drifts. In Learning for Text Categorization, pages 33-40. AAAI Press., 1998.
  12. Ralf Klinkenberg and Thorsten Joachims. Detecting concept drift with support vector machines. In Pat Langley, editor, Proceedings of ICML-00, 17th International Conference on Machine Learning, pages 487-494, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US.
  13. M. Kubat and G. Widmer. Adapting to drift in continuous domain. In Proceedings of the 8th European Conference on Machine Learning, pages 307-310. Spinger, 1995.
  14. W.-Y. Loh and Y.-S. Shih. Split selection methods for classification trees. Statistica Sinica, 1997.
  15. Tom Mitchell. Machine Learning. McGraw Hill, 1997.
  16. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1997.
  17. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., 1993.
  18. Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23:69-101, 1996.
Download


Paper Citation


in Harvard Style

Gama J. and Medas P. (2004). Learning in Dynamic Environments: Decision Trees for Data Streams . In Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004) ISBN 972-8865-01-5, pages 149-158. DOI: 10.5220/0002673001490158


in Bibtex Style

@conference{pris04,
author={João Gama and Pedro Medas},
title={Learning in Dynamic Environments: Decision Trees for Data Streams},
booktitle={Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004)},
year={2004},
pages={149-158},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002673001490158},
isbn={972-8865-01-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004)
TI - Learning in Dynamic Environments: Decision Trees for Data Streams
SN - 972-8865-01-5
AU - Gama J.
AU - Medas P.
PY - 2004
SP - 149
EP - 158
DO - 10.5220/0002673001490158