Authors:
João Gama
1
and
Pedro Medas
2
Affiliations:
1
LIACC - University of Porto; Fac. Economics, University of Porto, Portugal
;
2
LIACC - University of Porto, Portugal
Keyword(s):
Concept Drift, Forest of Trees, Data Streams.
Abstract:
This paper presents an adaptive learning system for induction of forest of trees from data streams able to detect Concept Drift. We have extended our previous work on Ultra Fast Forest Trees (UFFT) with the ability to detect concept drift in the distribution of the examples. The Ultra Fast Forest of Trees is an incremental algorithm, that works online, processing each example in constant time, and performing a single scan over the training examples. Our system has been designed for continuous data. It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test. The number of examples required to evaluate the splitting criteria is sound, based on the Hoeffding bound. For multi-class problems the algorithm builds a binary tree for each possible pair of classes, leading to a forest of trees. During the training phase the algorithm maintains a short term memory. Given a data stream, a fixed number of the most
recent examples are maintained in a data-structure that supports constant time insertion and deletion. When a test is installed, a leaf is transformed into a decision node with two descendant leaves. The sufficient statistics of these leaves are initialized with the examples in the short term memory that will fall at these leaves. To detect concept drift, we maintain, at each inner node, a naïve-Bayes classifier trained with the examples that traverse the node. While the distribution of the examples is stationary, the online error of naive-Bayes will decrease. When the distribution changes, the
naive-Bayes online error will increase. In that case the test installed at this node is not appropriate for the actual distribution of the examples. When this occurs all the subtree rooted at this node will be pruned. This methodology was tested with two artificial data sets and one real world data set. The experimental results show a good performance at the change of concept detection and also with learning the new concept.
(More)