Comparison of Adaboost and ADTboost

for Feature Subset Selection

Martin Drauschke and Wolfgang F

orstner

Department of Photogrammetry, Institute of Geodesy and Geoinformation

University of Bonn, Nussallee 15, 53115 Bonn, Germany

Abstract. This paper addresses the problem of feature selection within classi-

ﬁcation processes. We present a comparison of a feature subset selection with

respect to two boosting methods, Adaboost and ADTboost. In our evaluation, we

have focused on three different criteria: the classiﬁcation error and the efﬁciency

of the process depending on the number of most appropriate features and the

number of training samples. Therefore, we discuss both techniques and sketch

their functionality, where we restrict both boosting approaches to linear weak

classiﬁers. We propose a feature subset selection method, which we evaluate on

synthetic and on benchmark data sets.

1 Introduction

Feature selection is a challenging task during image interpretation, especially if the

images show highly structured objects with a rich diversity in variable environment,

where single variables are lowly correlated with the classiﬁcation target. Also, feature

selection is used in data mining to extract useful and comprehensible information from

data, cf. [9]. One classical approach is principal component analysis, which reduces the

dimension of the feature space by projecting all features, cf. [1]. The resulting feature

set of a PCA is not a subset of all candidate features, but combinations of the original

features. Thus, the PCA is not an appropriate tool, if one wants to obtain a real subset

of features for further investigations. However, the feature weighting in Adaboost and

ADTboost may directly be used as a heuristic for feature selection. The paper inversti-

gates the potential of these two methods on synthetic and benchmark data.

Typically, one uses a data set of training samples to select a subset of appropriate

features and to train a classiﬁer. Then, the effect of this feature selection and the clas-

siﬁcation is evaluated on a different data set, the test samples. There are three different

costs which have to be observed during this evaluation:

1. the classiﬁcation error and

2. the efﬁciency of the process depending on

(a) the number of most appropriate features and

(b) the number of training samples.

Problem Speciﬁcation. The problem can be formalized as follows: Given is a data

set of N training samples (x

, y

), where x

= [f

, .., f

] is a D-dimensional feature

Drauschke M. and Förstner W. (2008).

Comparison of Adaboost and ADTboost for Feature Subset Selection.

In Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems, pages 113-122

 SciTePress

vector of real numbers, and y

is the class membership. In this paper, we only con-

sidered the binary case, i. e. y

∈ {−1, +1}, but all applied algorithms have already

been introduced for the multi-class case, cf. [15] and [8], respectively. Our ﬁnal goal

is not only the classiﬁcation of these feature vectors, we additionally want to select the

best features to reduce the dimension of the feature space and to eliminate redundant

features.

Structure of the Paper. In section 2, we give a review on feature subset selection

principles and methods. Then, in section 3, we outline the Adaboost and ADTboost

algorithms and show their similarities and differences w. r. t. feature selection. Further-

more, we demonstrate and discuss their functionality regarding a simple example. Our

feature subset selection schemes with Adaboost and ADTboost are proposed in section

4. We have tested our approach on synthetic and benchmark data. The results of these

experiments are presented and discussed in section 5. Finally, we summarize our work

and give a short outlook of our research plans.

2 Standard Feature Selection Methods

If we want to select a subset of appropriate features from the total set of features with

cardinality D, we have a choice between 2

possibilities. If we deal with feature vectors

with more than a few dozens components, the exhaustive search takes too long. Thus,

we have to ﬁnd other ways to select a subset of features.

One choice are genetic algorithms which select these subsets randomly. But al-

though they are relatively insensitive to noise, and there is no explicit domain knowl-

edge required, the creation of mutated samples within the evolutionary process might

lead to wrong solutions. Furthermore, the computational time of genetic algorithms in

combination with a wrapped classiﬁcation method is not efﬁcient, cf. [10].

Alternatively, there are deterministic approaches for selecting subsets of relevant

features. Forward selection methods start with an empty set and greedily add the best

of the remaining features to this set. Contrarily, backward elimination procedures start

with the full set containing all features, and then the most useless features are greedily

removed from this set. According to [11], feature subset selection methods can also get

characterized as ﬁlters and wrappers. Filters use evaluation methods for feature ranking

that are independent from the learning method. E. g. in [7], there the squared Pearson’s

Correlation Coefﬁcient is proposed for determining the most relevant features.

In our experiments, the features have low correlation coefﬁcients with the class

target and they are highly correlated with each other. Furthermore, the training samples

do not form compact clusters in feature space. Then, it is a hard task to ﬁnd a single

classiﬁer that is able to separate the two classes. Thus, we look for a wrapper method,

where the evaluation of the features is based on the learning results. Since we obtained

bad classiﬁcation results using only one classiﬁer, we would like to merge the feature

subset selection with a learning technique that uses several classiﬁers. This motivated

us to pursue the concept of adaptive boosting (Adaboost) where a strong (or highly

accurate) classiﬁer is found by combining several weak (or less accurate) classiﬁers, cf.

[14].

114

Such majority voting is also used in random forests. The decision trees in [2] ran-

domly choose a feature from the whole feature set and takes it for determining the best

domain split. This method only works well, if the number of random decision trees

is large, especially if there are only features which are nearly uncorrelated with the

classes. In experiments of [2], there have been used ﬁve times more decision trees than

weak learners in Adaboost. Furthermore, if the number of decision trees is high, the

number of (randomly) selected features is also quite high. Then, we do not beneﬁt on

feature subset selection, because almost every feature has been selected.

In [13], there is shown that ”the lack of implicit feature selection within random

forest can result in a loss of accuracy and efﬁciency, if irrelevant features are not re-

moved”. Therefore, we focus our work on Adaboost and ADTboost, which we brieﬂy

summarize in the next chapter.

3 Adaboost and ADTboost

In this section, we recapitulate Adaboost and ADTboost in order to lay the basis for our

feature selection scheme. Furthermore, we demonstrate their functionality on a simple

example. For more details, we would like to refer to the original publications or our

technical report [4].

Adaboost. The concept of adaptive boosting (Adaboost) is to ﬁnd a strong (or highly

accurate) classiﬁer by combining several weak (or less accurate) classiﬁers, cf. [14].

The ﬁrst weak classiﬁer (or best hypothesis) is learnt on equally treated training sam-

ples (x

, y

). Then, before training the second weak classiﬁer, the inﬂuence of all mis-

classiﬁed samples gets increased by adjusting the weights of the feature vectors. So, the

second classiﬁer will focus especially on the previously misclassiﬁed samples. In the

third step, the weights are adjusted once more depending on the classiﬁcation result of

the second weak classiﬁer before training the third weak classiﬁer, and so on.

Additionally, each weak classiﬁer h

: x

7→ {+1, −1} is characterized by a pre-

dictive value α

which depends on the classiﬁer’s success rate. After T weak classiﬁers

have been chosen, the result of the strong classiﬁer H can be depicted as the sign of the

weighted sum of the results of the weak classiﬁers:

H(x

) = sign

)

. (1)

The discriminative power of the resulting classiﬁer H can be expected to be much

higher than the discriminative power of each weak classiﬁer h

, cf. [12].

ADTboost. ADTboost is an extension of Adaboost which has been proposed in [6]

and reﬁned in [3]. These extensions of Adaboost towards ADTboost are the following:

Primarily, the weak classiﬁers are put into a hierarchical order - the alternating decision

tree. The tree consists of two different kind of nodes which alternately change on a path

through the tree, cf. right part of ﬁg. 1. Secondly, each decision node contains a weak

classiﬁer and has two prediction nodes containing the predictive values α

and α

−

as its

115

children. The weak classiﬁers in upper levels of the tree work as preconditions on those

classiﬁers below them. And third, the root node contains the predictive value α

of the

true-classiﬁer. Thus, α

is derived from the ratio of the number of samples between

both classes and, therefore, it can be interpreted as a a prior classiﬁer. In each iteration

step, the best classiﬁer candidate c

is determined in conjunction with a precondition

. Then, the new best weak classiﬁer h

is set h

= h

∧c

. Furthermore, the prediction

of the t-th weak classiﬁer is given by rule r

), which is deﬁned by

) =







= α

, if h

) = +1 and c

) = +1

= α

−

, if h

) = +1 and c

) = −1

= 0, if h

) = −1.

(2)

A Simple Example: Adaboost vs. ADTboost. We implemented Adaboost and ADT-

boost using the formulations of the algorithm in [15] and [3], respectively. Both al-

gorithms are optimized with respect to ﬁnd the best weak classiﬁer, to determine the

classiﬁer’s weight or the predictive values, respectively, and to update each sample’s

weight. There are only two unsolved questions that are left to the user. The ﬁrst question

refers to the maximum number of iterations. We stop the training either after T steps or

earlier, if the error rate on the training samples is below threshold θ. The other question

deals with the generation of the set of classiﬁer candidates. Since we are interested in

the subset selection of the given features, we consider only threshold classiﬁcation on

single features. Then, we avoid the combinatorial explosion which would occur, if we

would also consider linear seperation with 2 or more features.

Now, we want to discuss the functionality of both algorithms by presenting their

workﬂows on a synthetic data set, which is also visualized in ﬁg. 1. It consists of points

which belong to two classes (red and blue). The target of the red class is y

= 1, and

the membership to blue class is encoded by y

= −1. We work with a 2-dimensional

feature vector, and its domain is x

∈ [0, 1] × [0, 1]. The class membership of a given

data point x

= [f

, f

] is deﬁned by







+1 (red) - if f

< 0.4 and f

> 0.4 or

> 0.6 and f

< 0.6

−1 (blue) - else.

(3)

Besides the four domain bounds, we only need four other discriminative lines for

describing the feature set. Therefore, we limit our set of classiﬁer candidates onto these

eight items: the domain bounds and

: if f

< 0.4 then red else blue

: if f

> 0.6 then red else blue

: if f

< 0.4 then red else blue

: if f

> 0.6 then red else blue

(4)

Since the calculations are documented in detail in [4], we only want to present

the results of Adaboost and ADTboost, respectively. Considering Adaboost, the ﬁnal

predictions of the strong classiﬁer after four iterations are obtained for a given x =

116

, f

] by the sign of the weighted sums of the predicitions of all weak classiﬁers:

(x) f

< 0.4 0.4 ≤ f

≤ 0.6 f

> 0.6

> 0.6 −0.1149 −0.5203 0.1609

0.4 ≤ f

≤ 0.6 0.1849 −0.2205 0.4607

< 0.4 −0.1609 −0.5663 0.1149

. (5)

The bold entries in the table above document false predictions of the strong classi-

ﬁer. This applies to 32% of the data. When proceeding with a larger set of candidates,

Adaboost could choose several other axis-parallel classiﬁers as f

< 0.5, and so the

classiﬁcation could furtherly get improved.

The boosting process using alternating decision trees starts with a prior classi-

ﬁcation using the ﬁrst predictive value α

= −0.04. Then, the we determine the

weak classiﬁers (including their precondition) in the following order: h

= true ∧c

= h

∧ ¬c

, h

= ¬h

∧ ¬c

and h

= h

∧ c

. After four iterations, the resulting

classiﬁer has the in this example expected classiﬁcation rate of 100%. And the strong

classiﬁer determines the following predictions:

(x) f

< 0.4 0.4 ≤ f

≤ 0.6 f

> 0.6

0.6 ≤ f

0.79 −1.11 −0.57

0.4 < f

< 0.6 0.81 −1.09 1.62

≤ 0.4 −2.22 −1.09 1.62

. (6)

Fig. 1. Simple example with with 2 features and 2 classes (red and blue). Left: weak classiﬁers of

Adaboost. Middle: weak classiﬁers of ADTboost. Right: Alternating Decision Tree of ADTboost

result.

Comparison. Although we have demonstrated the functionality of Adaboost and

ADTboost on a very simple synthetic data set of two non-overlapping classes, we are

able to assert one major difference between both approaches. We designed our weak

classiﬁers such that each classiﬁcation within Adaboost is a interpretable as a division

of the 2D feature space into two half spaces. The T linear classiﬁers form a partition

117

of the feature space into up to 2

sub-domains. The classiﬁcation of all samples within

one sub-domain is equal, and the weak classiﬁers with the highest predictive values are

the most dominant one concerning the ﬁnal result.

Alternatively, the hierarchical order of these weak classiﬁers builds a multiple k-d-

tree, see middle part of ﬁg. 1. Multiple, because we could have several weak classiﬁers

with the same preconditions. In logical terms, the hierarchical compositions of classi-

ﬁers describe AND relations. If several classiﬁers have the same precondition, they are

composed by a OR relation.

ADTboost seems to favor to expand the alternating decision tree in depth. If the

classes do not overlap much, but form clusters of complex shapes, ADTboost will se-

lect fewer weak classiﬁers than Adaboost, if both methods are not terminated after a

prediﬁned number of iterations, but when the error bounds fall below a certain thresh-

old. Unfortunately, this behavior of ADTboost also leads to overﬁtting. For avoiding

such overﬁtting, one could either restrict the depth of the alternating decision tree, or

the splitting must be done with respect to signiﬁcant large subsets. So far, we have not

implemented such restrictions.

4 Feature Selection with Adaboost and ADTboost

If the number of iterations T is signiﬁcantly less than the number of features D, then

the selection of the feature subset would be completed after the last iteration step. Then

we would have T weak classiers h

, and each of them performs a decision on a single

feature f

d(t)

. Hence, the set of these maximal T features is the relevant subset.

If the classiﬁcation results only satisfy after many iterations with T ≈ D, then

(almost) all features could have been selected, and we do not necessarily obtain a subset

of the most relevant features so easily. In our point of view, we have two ways to derive

these most appropriate features.

In the ﬁrst variant, we would search for the best weak classiﬁers until the maximal

cardinality of the feature subset is reached. In the further iterations, we restrict the

domain of the classiﬁer candidates on these selected features. This strategy will lead to

a intervention of the classiﬁcation process of Adaboost and ADTboost, and the feature

subset selection only depends on the time of choice of a weak classiﬁer or a feature,

respectively, and not on their relevance or their effect on the strong classiﬁer.

The second variant performs the feature subset selection after the learning has

stopped. Now, we evaluate the weak classiﬁers and their associated features after the

training of the strong classiﬁer has ﬁnished. This will lead to the ranking of weak clas-

siﬁers and their features, respectively. Considering Adaboost, the impact of each weak

classiﬁer h

depends on the absolute value of α

. Then, the impact of a feature on the

classiﬁcation result can be measured by its contributive value C which we deﬁne by

C (f

) =

|α

| where h

works on f

. (7)

When the subset of relevant features shall have a cardinality of S, S ≤ D, we choose

the S features with the highest contributive values. Regarding the simple example of

118

the previous section, we obtain C (f

) = 0.5433 and C (f

) = 0.3228. Thus, the best

feature is the ﬁrst one.

When adapting this proposal, we need to integrate two predictive values for each

weak classiﬁer, and we also need to consider the hierarchical order of the weak classi-

ﬁers. Thus, we re-deﬁne the contributive values by

C (f

) =



|α

| + |α

−





|α

| + |α

−



, (8)

where h

is a classiﬁer that uses f

, h

is precondition for classiﬁers that use feature f

and the ω describe the portion of a classiﬁer’s from the whole domain. Regarding the

simple example of the previous section, we obtain the following contributive values

C (f

) = 1.00 · (0.15 + 0.23) + 0.24 · (1.46 + 1.43) +

0.40 · (1.52 + 1.51) + 0.60 · (0.50 + 1.52) +

0.24 · (1.46 + 1.43) = 4.1912

C (f

) = 0.40 · (1.52 + 1.51) + 0.60 · (0.50 + 1.52) +

0.24 · (1.46 + 1.43) = 3.1176.

(9)

The contributive value of feature f

is a sum of ﬁve components. The ﬁrst two sum-

mands are the weighted predictive values that belong to the classiﬁers h

and h

, re-

spectively. The other three summands are added, because h

is a precondition of the

three classiﬁers h

, h

and h

. After this selection procedure, we need to eliminate

those weak classiﬁers or prune the tree. Therefore, we also have to eliminate those

classiﬁers of relevant features which consist of precondition on a irrelevant one. In the

experiments, the test samples have been classiﬁed using these reduced sets of weak

classiﬁers in Adaboost or ADTboost, respectively.

5 Experiments

Experiments on Synthetic Data. First, we did some experiments on synthetic data

sets to analyse the behavour of our feature selection algorithm. Here, we demonstrate

the results on 10-dimensional samples. In each dimension, the features of both classes

have been generated by equal distributions with overlapping intervalls. In each dimen-

sion, the classes intersect between 30% and 100%, and the minimal classiﬁcation error

is 0.2%. Thus, it is impossible to ﬁnd satisfying threshold classiﬁers using one feature

only. We constructed a data set of 1000 samples, where we used up to randomly chosen

350 training samples and 200 different samples for testing. Each test consists of several

experiments where we changed the number of samples used for training and the number

of selected features for testing. Then, we repeated these tests 100 times.

Some meaningful test results are visualized in ﬁgs. 2 and 3. Fig. 2 shows how the

error rate depends on the number of selected features, if we have used a small and a

large number of training samples. If the number of training samples is very low (left),

then there is almost no remarkable difference between Adaboost and ADTboost, and

the error rate is too high. If many training samples have been used to train the classiﬁer

(right), ADTboost always returns signiﬁcantly better results. Additionally, ﬁg. 3 shows

119

how the error rate depends on the number training samples. If we only choose the best

feature, we prune the tree of the strong classiﬁer to much. Then, the error rates of

Adaboost and ADTboost are similarily bad. If we select more features, the error rates

decrease; and the difference between Adaboost and ADTboost enlarges, too.

Fig. 2. Error rates of Adaboost (solid blue line) and ADTboost (dashed red line), where the train-

ing step was based on 30 (left) and 350 (right) samples from the synthetic data set.

Fig. 3. Error rates of Adaboost (solid blue line) and ADTboost (dashed red line), where only the

best feature (left) and all ten features have been used for testing on the synthetic data set.

Experiments on Benchmark Data. For our experiments, we chose the four bench-

mark data sets breast cancer, diabetes, german and heart

, which have

been edited as two-class problems by [12]. Due to the limited space, we mainly present

the results of the breast cancer data set.

Our tests have a similar setup as the tests on the synthetic data. For each test, we

randomly selected up to 350 samples for training and testing. Then, we randomly sepa-

rated 100 samples (between 30% and 40% of the data) for testing, and the rest could be

used for training. Again, we repeated these tests 100 times. The breast cancer dataset

contains 263 samples of nine features. Some of the average error rates of our tests are

listed in tab. 1 and shown in ﬁgs. 4. We are very pleased that our lowest classiﬁcation

errors is in the same range as the results of [12]. For the other two benchmark data

available at http://theoval.sys.uea.ac.uk/matlab/default.html

120

Table 1. Average error rates of breast cancer data set. The last two column refer to the results in

[12], where a single RBF classiﬁer and Adaboost with more complex weak classiﬁers have been

used, respectively.

samples used for training 10 30 50 80 110 150 RBF AB

Adaboost, 1 feature selected 0.48 0.47 0.44 0.42 0.46 0.41

5 features selected 0.39 0.38 0.32 0.34 0.30 0.32

all 9 features 0.37 0.35 0.31 0.30 0.28 0.29 0.276 0.265

ADTboost, 1 feature selected 0.48 0.27 0.30 0.26 0.28 0.24

5 features selected 0.48 0.28 0.30 0.26 0.30 0.28

all 9 features 0.48 0.28 0.30 0.26 0.30 0.28 0.276 0.265

Fig. 4. Experiments on the breast cancer data set: Error rates of Adaboost (solid blue line) and

ADTboost (dashed red line), where the training step was based on 150 samples (left) or only the

best feature was selected (right).

sets, we obtain error rates which are not in the same range as in [12], but almost twice

as high. Thus, our linear weak classiﬁers are not suitable for these data sets. If the two

classes are badly separable by single-feature threshold classiﬁcation, ADTboost either

favors weak classiﬁers with many preconditions, and their classiﬁcation fraction con-

cerns only few samples, or the alternating decision tree is has almost no tree structure,

but contains many weak classiﬁers which share the same precondition.

The error rates of Adaboost decrease almost monotonously because of the reason-

able ranking of the weak classiﬁers. The tree structure of the strong classiﬁer of ADT-

boost leads to complications when choosing a subset from it. We assume that the pruned

weak classiﬁers may have important effects on the classiﬁcation result of several sam-

ples. If we have a tree with a large depth, any sensible feature selection scheme can lead

to bad classiﬁcation results. We need to investigate the unexpected behaviour of ADT-

boost, showing increasing error rate when selecting more features. We assume that this

is due to the overﬁtting of ADTboost.

6 Conclusion and Outlook

We compared Adaboost and ADTboost with respect to our new proposed feature subset

selection method. So far, we implemented both classiﬁcation techniques considering

only linear weak classiﬁers. Thus, we obtain reasonable and good results on synthetic

data and those benchmark data sets, where the linear classiﬁers are feasible. We are

121

mainly interested in interpreting images of man-made scenes, e. g. facade images of

building. We present our results of this real world data in [5].

We are going to improve our feature subset selection scheme. Therefore, we might

expand our feature space and look for more complex weak classiﬁers on 2-dimensional

feature planes as f

× f

. Furthermore, we want to extend it towards the multiclass

classiﬁcation.

Acknowledgment

This work has been made within the project eTraining for Interpreting Images of Man-

Made Scenes (eTRIMS) which is funded by The European Union.

References

1. Chr. M. Bishop. Pattern Recognition and Machine Learning. Information Science and

Statistics. Springer, 2006.

2. L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

3. F. De Comit

e, R. Gilleron, and M. Tommasi. Learning multi-label alternating decision trees

and applications. In Proc. CAP 2001, pages 195–210, 2001.

4. M. Drauschke. Feature subset selection with adaboost and adtboost. Technical report, De-

partment of Photogrammetry, University of Bonn, 2008.

5. M. Drauschke and W. F

orstner. Selecting appropriate features for detecting buildings and

building parts. In 21st ISPRS Congress, 2008 (to appear).

6. Y. Freund and L. Mason. The alternating decision tree learning algorithm. In Proc. 16th

ICML, pages 124–133, 1999.

7. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of

Machine Learning Research, 3:1157–1182, 2003.

8. G. Holmes, B. Pfahringer, R. Kirkby, E. Frank, and M. Hall. Multiclass alternating decision

trees. In Proc. 13th ECML, LNCS 2430, pages 161–172. Springer, 2002.

9. H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer

Academic, 1998.

10. M. J. Martin-Bautista and M.-A. Vila. A survey of genetic feature selection in mining issues.

In Proc. CEC 1999, volume 2, pages 1314–1321, Washington D.C., USA, July 1999.

11. D. Mladeni

c. Feature selection for dimensionality reduction. In SLSFS, LNCS 3940, pages

84–102, 2006.

12. G. R

atsch, T. Onoda, and K.-R. M

uller. Soft margins for adaboost. Machine Learning,

43(3):287–320, March 2001.

13. J. Rogers and St. Gunn. Identifying feature relevance using a random forest. In SLSFS,

LNCS 3940, pages 173–184, 2006.

14. R. E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.

15. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predic-

tions. Machine Learning, 37(3):297–336, 1999.

122