DECISION TREE INDUCTION FROM COUNTEREXAMPLES

Nicolas Cebron, Fabian Richter and Rainer Lienhart

Multimedia Computing Lab, University of Augsburg, Universitaetsstr. 6a, Augsburg, Germany

Keywords:

Decision trees, Counterexamples, Machine learning, Data mining, Decision making.

Abstract:

While it is well accepted in human learning to learn from counterexamples or mistakes, classic machine

learning algorithms still focus only on correctly labeled training examples.We replace this rigid paradigm by

using complementary probabilities to describe the probability that a certain class does not occur. Based on

the complementary probabilities, we design a decision tree algorithm that learns from counterexamples. In

a classiﬁcation problem with K classes, K − 1 counterexamples correspond to one correctly labeled training

example. We demonstrate that even when only a partial amount of counterexamples is available, we can still

obtain good performance.

1 INTRODUCTION

The goal of supervised classiﬁcation is to deduce a

function from examples in a dataset that maps input

objects to desired outputs. By using a set of labeled

training examples, we can train a classiﬁer that can be

used to predict the nominal target variable for unseen

test data. To achieve this, the learner has to generalize

from the presented data to unseen situations. While

a plethora of algorithms for supervised classiﬁcation

has been developed, only a few works deviate from

this classical setting.

In this paper, we focus our attention on decision

trees. Especially in multi-class problems, they are a

reliable and effective technique. They usually per-

form well and offer a simple representation in form

of a tree or a set of rules that can be deduced from

it. They have been used a lot in situations where a

decision must be made effectively and reliably, e.g.

in medical decision making (Podgorelec et al., 2002).

However, like all inductive methods in machine learn-

ing, the performance of this classiﬁer is based on cor-

rectly labeled training examples. Finding the correct

class label for an example when generating a train-

ing set for the classiﬁer can be difﬁcult – especially

when there is a large number of possible classes. In

the work of (Joshi et al., 2010), it has been shown that

the human error rate and the time needed to ﬁnd the

correct label grows with the number of classes; at the

same time the user distress increases. In some situ-

ations, it might not even be possible for the human

expert to determine the correct class label out of ma-

ny possible class labels. In a normal classiﬁcation set-

ting, we would have to ignore this example.

As an example, we stick to the domain of medical

decision making, where we have two common situa-

tions in which the human expert has problems provid-

ing the correct class label:

1. Ambiguous Information: different class labels

(e.g. diseases) may be possible, but there is a lack

of information to explicitly choose one of them.

For example, it is unclear whether a person with

headache symptoms is suffering from a cold or

has the ﬂu (or another type of disease).

2. Rare Cases: the determination of the class label

may be difﬁcult because of missing expertise in a

special ﬁeld. For example, it may be difﬁcult to

classify rare (so-called orphan) diseases.

In this work, we want to introduce a new paradigm

in supervised classiﬁcation: we do not obtain the la-

bel information itself, but the labels of the classes that

this example does not belong to. We call these exam-

ples counterexamples. For the preceding examples in

medical decision making, it can be very easy to spec-

ify the diseases that are not likely (e.g. not typhlitis,

not heartburn, etc. for a headache symptom) in order

to narrow down the set of possible classes. We argue

that in many real world settings, it is much easier for

the human expert to provide a counterexample instead

of determining the correct class label. This does not

only apply to the domain of medical decision mak-

ing, it is also true for many other domains like image,

music or text classiﬁcation.

525

Cebron N., Richter F. and Lienhart R. (2012).

DECISION TREE INDUCTION FROM COUNTEREXAMPLES.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 525-528

DOI: 10.5220/0003730405250528

 SciTePress

Within our new framework of classiﬁcation with

counterexamples, we can gain information from al-

most every example in a dataset. However, we keep

our framework open and the information of examples

and counterexamples can be included seamlessly. In a

classiﬁcation problem with K disjoint classes, there is

of course a loss of information induced from this set-

ting, as we can expect to observe less than K −1 labels

for each counterexample in practice

. The question

that we aim to answer in this paper is: How much

does this loss of information inﬂuence the resulting

classiﬁcation model?

To the best of our knowledge, this is the ﬁrst work

that considers feedback in the form of counterexam-

ples in a multiclass setting. Some works have inves-

tigated negative feedback in the image retrieval pro-

cess (Ashwin et al., 2001), (Mueller et al., 2000). As

the retrieval process corresponds to a two-class prob-

lem, these works only share the general idea of a dif-

ferent form of feedback with this work. At ﬁrst sight,

our work seems to be related to the domain of mul-

tilabel classiﬁcation (Tsoumakas and Katakis, 2007),

where a mapping from an example to a set of class

labels is sought. However, our goal is to predict one

class label from the set of counterexamples.

In order to quantify the information from coun-

terexamples, we introduce the probability theory for

counterexamples in section 2. In section 3, we will in-

troduce the decision tree learning algorithm for coun-

terexamples. We will present results on different

benchmark datasets in section 4 and ﬁnally draw con-

clusion in section 5.

2 COUNTEREXAMPLE

PROBABILITIES

We begin by recapitulating the basic laws of probabil-

ity theory: the probability of an event is the fraction of

times that the event occurs out of the total number of

trials, in the limit that the total number of trials goes

to inﬁnity. In our case, the probabilities correspond to

the events that a certain class occurs in a set of exam-

ples. We denote the probability for class k by p(k).

By deﬁnition, the probabilities must lie in the inter-

val [0, 1], and if the events are mutually exclusive and

include all outcomes, their probabilities must sum to

one:

0 ≤ p(k) ≤ 1 (1)

If we would have K − 1 labels, we could deduce the

corresponding label directly.

∑

k=1

p(k) = 1 (2)

In order to work with counterexamples and to

quantify the amount of classes that are not contained

in a set of examples, we need to deﬁne complemen-

tary probabilities. A complementary probability for

event k, denoted by

p(k) describes the probability that

event k does not occur in a set of examples. By deﬁni-

tion, the probability that event k does not occur is the

sum of the probabilities of all other events that have

occurred:

p(k) =

∑

j=1, j6=k

p( j)) (3)

The relation between p(k) and p(k) is deﬁned as

p(k) = 1 − p(k).

Like normal probabilities, p(k) must lie in the in-

terval [0, 1]. However, as the set of events is not mutu-

ally exclusive (a counterexample may have more than

one class that it does not belong to), we need to adapt

the restriction from equation 2 taking into account the

deﬁnition of p(k):

∑

k=1

(p(k)) =

∑

k=1

∑

j=1, j6=k

p( j)

∑

k=1

[1 − p(k)]

= K −

∑

k=1

p(k)

= K − 1

(4)

Having established the basic laws for complementary

probabilities and rules to transform probabilities into

complementary probabilities and vice versa, we can

use them in the design of a decision tree that learns

from counterexamples in the next section.

3 DECISION TREE INDUCTION

We assume that instead of having one class label

for each example, we have a vector ~y = (y

,.. ., y

where each entry y

∈ {0, 1} indicates whether we

know that this example does not belong to class k (1)

or that we do not have any information concerning

class k for this example (0).

The main difference between learning a tree from

examples and learning a tree from counterexamples

is the notion of purity of a data partition. Figure 1

illustrates the situation of learning a decision tree

from counterexamples. Each partition can now con-

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

526

Figure 1: Decision tree with four partitions based on coun-

terexamples.

tain multiple class labels of the classes that the exam-

ples do not belong to. The goal is to ﬁnd partitions

that have high complementary probabilities for K − 1

classes as they correspond to a ’pure’ distribution of

the class label.

We use our deﬁnition of mapping from comple-

mentary probabilities to probabilities in section 2 to

derive a new deﬁnition for the entropy (Shannon,

2001), which is commonly used to judge the quality

of a data partition X:

H(X) =

∑

k=1

(1 − p(k))ln(1 − p(k)). (5)

As can be seen in ﬁgure 1, the leaf nodes of our de-

cision tree contain a distribution of complementary

class probabilities. We can output this distribution or

transform it to normal class probabilities and use the

majority class as a decision.

4 RESULTS

The algorithms were implemented within the frame-

work of the weka (Hall et al., 2009) and mulan soft-

ware (Tsoumakas et al., 2011).

Each experiment has been been repeated 500

times. In each iteration, we split up the dataset ran-

domly and use 30% for training and 70% for test-

ing. We deduce the corresponding complementary

class probabilities from the original class probabili-

ties for each example. We then remove 0% (corre-

sponds to a fully labeled dataset) to 90% of informa-

tion from the ~y vectors in the training dataset (plotted

on the x-axis) and plot the accuracy as a boxplot on

the y-axis. As we remove information from the entries

in ~y, the counterexample probabilities p(k) become

smaller. However, this is not an issue as H(X) scales

monotonically with information removal as shown in

the Appendix.

4.1 Contact Lenses

The lenses dataset consists of 24 examples. The goal

is to predict whether a person should be ﬁtted with

hard or soft contact lenses or no contact lenses based

on four attributes. Figure 2 shows the accuracy of

the decision tree that is induced from counterexam-

ples for a varying amount of information. We can ob-

Figure 2: Information vs. accuracy on lenses dataset.

serve a linear decline of accuracy with the amount of

information removed. As the dataset is very small, re-

moving information has a deep impact on the result-

ing decision tree. However, if we remove up to 30%

of information, we still get acceptable accuracy.

4.2 Balance Scale

In the balance scale dataset, each example is classi-

ﬁed as having the balance scale tip to the right, tip to

the left, or be balanced. The four attributes are the

left weight, the left distance, the right weight, and the

right distance. Figure 3 shows the accuracy of the de-

cision tree that is induced from counterexamples for a

varying amount of information. The decline in accu-

racy is not as steep as for the contact lenses dataset,

which is due to the larger number of 187 examples in

the training set.

4.3 Nursery

The nursery dataset was derived from a hierarchical

decision model originally developed to rank applica-

tions for nursery schools in ﬁve different classes. It

contains 12960 examples. As the accuracy does al-

most not decline between 0% and 99%, we plot the

experiment with 99% to 99.9% information removed

in Figure 4. We can observe that the accuracy declines

very late.

DECISION TREE INDUCTION FROM COUNTEREXAMPLES

527

Figure 3: Information vs. accuracy on balance scale dataset.

Figure 4: Informations vs. accuracy on nurse dataset.

5 CONCLUSIONS

In this work, we have presented a new approach to

induce a decision tree classiﬁer from counterexam-

ples. Based on complementary probabilities we have

adapted the entropy measure in order to work with

this new type of human feedback. Normal exam-

ples can also be integrated seamlessly by deducing

the complementary class probabilities from the given

class probabilities. We have observed that this ap-

proach works well even if we remove a signiﬁcant

amount of information from the training examples.

This shows that we can learn from counterexamples in

a practical setting, where the user typically provides

less than K − 1 class labels. We hope that this work

does inspire future work in the community on differ-

ent forms of feedback in machine learning.

REFERENCES

Ashwin, T., Jain, N., and Ghosal, S. (2001). Improving

image retrieval performance with negative relevance

feedback. Acoustics, Speech, and Signal Processing,

IEEE International Conference on, 3:1637–1640.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,

P., and Witten, I. H. (2009). The weka data mining

software: an update. SIGKDD Explor. Newsl., 11:10–

18.

Joshi, A. J., Porikli, F., and Papanikolopoulos, N. (2010).

Breaking the interactive bottleneck in multi-class clas-

siﬁcation with active selection and binary feedback. In

CVPR, pages 2995–3002. IEEE.

Mueller, H., Mueller, W., Squire, D. M., Marchand-Maillet,

S., and Pun, T. (2000). Strategies for positive and neg-

ative relevance feedback in image retrieval. In Pro-

ceedings of the International Conference on Pattern

Recognition - Volume 1, volume 1, pages 1043–1046,

Washington, DC, USA. IEEE Computer Society.

Podgorelec, V., Kokol, P., Stiglic, B., and Rozman, I.

(2002). Decision trees: An overview and their use

in medicine. J. Med. Syst., 26:445–463.

Shannon, C. E. (2001). A mathematical theory of commu-

nication. SIGMOBILE Mob. Comput. Commun. Rev.,

5(1):3–55.

Tsoumakas, G. and Katakis, I. (2007). Multi label classi-

ﬁcation: An overview. International Journal of Data

Warehouse and Mining, 3(3):1–13.

Tsoumakas, G., Spyromitros-Xiouﬁs, E., Vilcek, J., and

Vlahavas, I. (2011). Mulan: A java library for multi-

label learning. Journal of Machine Learning Re-

search. (to appear).

APPENDIX

We use the scalar α ≥ 1 to compensate for the decline

in p(k) due to the information removal.

H(X) =

∑

k=1

α(1 − p(k))ln(α(1 − p(k)))

∑

k=1

α(1 − p(k))[ln(α) + ln(1 − p(k))]

= α

∑

k=1

(1 − p(k))ln((1 − p(k)))

+ α

∑

k=1

(1 − p(k))ln(α)

= αH(X) + α ln(α).

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

528