DISPERSION EFFECT ON GENERALISATION ERROR

IN CLASSIFICATION

Experimental Proof and Practical Algorithm

Beno

ˆ

ıt Gandar

1,2

, Ga

¨

elle Loosli

1

1

Clermont Universit

´

e, Universit

´

e Blaise Pascal, 63000 Clermont-Ferrand, France

and CNRS, UMR 6158, LIMOS, 63 173 Aubi

`

ere, France

Guillaume Deffuant

2

2

Cemagref de Clermont-Ferrand, Laboratoire LISC, 24 avenue des Landais, 63 172 Aubi

`

ere Cedex 1, France

Keywords:

Machine Learning, Classiﬁcation, Space Filling Design, Dispersion.

Abstract:

Recent theoretical work proposes criteria of dispersion to generate learning points. The aim of this paper is

to convince the reader, with experimental proofs, that dispersion is a good criterion in practice for generating

learning points for classiﬁcation problems. Problem of generating learning points consists then in generating

points with the lowest dispersion. As a consequence, we present low dispersion algorithms existing in the

literature, analyze them and propose a new algorithm.

1 INTRODUCTION

In the context of classiﬁcation tasks, we address the

question of optimal position of points in the feature

space in the case one as to deﬁne a given number of

training points anywhere in the space. Those train-

ing points will be named sequence from now on. To

that purpose, we want to use the dispersion of the se-

quence, which is an estimator of the spread. The dis-

persion is deﬁned in the next section as well as ref-

erences to a paper stating that this criterion is likely

to be more efﬁcient for classiﬁcation tasks than other

well known criteria such as discrepancy. Our point

in this position paper is to convince the reader that a)

the dispersion is in practice a good criterion and b)

we can provide an efﬁcient algorithm to use it, even

though evaluating this dispersion is costly.

2 DEFINITION OF DISPERSION

Dispersion is an estimator of the spread of a sequence

used in numerical optimization. It is usually used

in iterative algorithms in order to approximate the

extremum of a non derivable function in a compact

set. The error approximation can also be theoretically

expressed by a function of dispersion (Niederreiter,

1992).

Dispersion of a Sequence: Let I

s

be the unit cube in

dimension s with the euclidian distance d. The dis-

persion of a sequence x =

{

x

1

,...,x

n

}

is deﬁned by:

δ(x) = sup

y∈I

s

min

i=1,...,n

d(y,x

i

)

Figure 1: Estimation of dispersion. The dispersion is the

radius of the largest ball containing no point.

The dispersion of a sequence is the radius of the

largest empty ball of I

s

(see Figure 1).

Bounds of Dispersion of a Sequence: (Niederre-

iter, 1992) shows that dispersion of a sequence x =

{

x

1

,...,x

n

}

in I

s

is lower bounded by:

δ(x) ≥

1

2

b

s

√

n

c

. (1)

Moreover, he proves that for each dimension s, it ex-

703

Gandar B., Loosli G. and Deffuant G..

DISPERSION EFFECT ON GENERALISATION ERROR IN CLASSIFICATION - Experimental Proof and Practical Algorithm.

DOI: 10.5220/0003293007030706

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 703-706

ISBN: 978-989-8425-40-9

Copyright

c

2011 SCITEPRESS (Science and Technology Publications, Lda.)

ists a sequence x in I

s

such:

lim

n→+∞

s

√

nδ(x) =

1

log(4)

.

As a consequence, for each dimension s, it exists a

sequence x in I

s

such δ(x) = O

1

s

√

n

. Considering

previous inequality, this order is the lowest.

3 EXPERIMENTAL EVIDENCE

In order to illustrate theoretical results and to observe

the effects of decreasing dispersion of training points,

we present experiments involving the following three

steps:

1. Learning with k-NN from random samples of

ﬁxed size for different classiﬁcation functions,

and estimating the generalization error.

2. Then we decrease the sequence dispersion by run-

ning a few iterations of the algorithm described in

section 4. We apply the classiﬁcation algorithm

using these new samples and estimate the gener-

alization error. This step is repeated until the dis-

persion algorithm stops.

3. Finally, we represent the generalization error rate

depending on the dispersion rate (decreasing)

with boxes and whiskers. Note that the disper-

sion is observed on sequences and is not a param-

eter. Each box has lines at the lower quartile, me-

dian, and upper quartile values. Whiskers extend

from each end of the box to the adjacent values in

the data. The most extreme values are within 1.5

times the interquartile range from both end of the

box. Adjacent values represent 86.6% of popula-

tion for a gaussian distribution.

We have generated two types of classiﬁcation rules in

spaces of dimension 2 to 5:

• The ﬁrst type of rules is relative to simple classi-

ﬁcation problems. The classiﬁcation boundaries

have small variations and are smooth.

• The second type of rules is relative to difﬁcult

classiﬁcation problems. Classiﬁcation bound-

aries have more important variations and are less

smooth. Moreover classiﬁcations surfaces have

more connected components.

Experimental Protocol: We have made experiments

with 2000 learning points and 5000 test points on 500

learning problems, in space of dimension 2 to 5. The

results are similar in all tested dimensions, and we

report here for dimension 5 in Figures 2 and 3.

Figure 2: Relative decrease of generalization error vary-

ing with the relative decrease of dispersion for 500 sim-

ple learning problems in dimension 5 with 2000 learning

points. (if δ

i

is the initial dispersion and δ

t

the dispersion

after several iterations of the algorithm, the relative decrease

is

δ

i

−δ

t

δ

i

). The boxes represent distribution of error rates.

Figure 3: Relative decrease of generalization error varying

with the relative decrease of dispersion for 500 hard learn-

ing problems in dimension 5 with 2000 learning points.

Conclusion and Discussion: We can see that lines of

median value (middle lines of boxes) are below zero

in the majority of problems. It shows that decreasing

dispersion reduces error generalization in more than

50% of cases. We can also remark that the higher

the dispersion decrease, the lower the generalization

error rate. Upper liners of boxes (they represent 75%

of population) are also below zero once dispersion

minimizing algorithm has converged. Moreover it

seems that more complex learning problems are,

more k-NN algorithm is sensitive to dispersion of

learning points.

4 LOW DISPERSION

ALGORITHMS

In this part, we look for generating a sequence with

the lowest dispersion as possible for a ﬁxed size. Pi-

oneering works of (Johnson et al., 1990) about gener-

ation of low dispersion sequences proposed two dif-

ferent criteria: a criterion to minimize called minimax

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

704

and a criterion to maximize called maximin. In a ﬁrst

part, we present in ﬁrst criterion of minimax and al-

gorithms based on it. In a second part, we present

criterion of maximin and also algorithms based on it.

In each part, we explain why different algorithms pre-

sented are not optimal.

4.1 Minimax

Minimax criterion based algorithms try to minimize

the maximal distance between points of the sequence

and candidate points.

Algorithms based on Swaping Points Process:

(Johnson et al., 1990) have proposed an algorithm

which makes the reduction of minimax criterion by

a simple swap of points, starting from random. The

convergence of the algorithm is guaranteed, but no

guarantee is given to have generated the best possi-

ble sequence.

Algorithms based on Adding Points Process:

(Lindemann and LaValle, 2004) have developed an

incremental algorithm reducing dispersion of se-

quences by adding points. They reduce also disper-

sion and get around the initial global problem of min-

imization. However, this method reduces incremen-

tally the dispersion by adding points and is not op-

timal when the user has a ﬁxed maximal number of

points.

4.2 Maximin

An alternative to minimizing the dispersion consists

in maximizing the distance between the points of

the sequence. It is easier to optimize numerically

and corresponds to the maximin criterion deﬁned by

δ

2

(x) = inf

(x

1

,x

2

)∈I

s

2

d(x

1

,x

2

). We can once more distin-

guish two types of algorithms: Algorithms based on

Deleting Points Process: (Sergent et al., 1997) have

proposed to generate a sequence with a lot of points

and to delete superﬂuous points until we obtain a se-

quence with a ﬁxed minimal distance between points.

The major disadvantage of this algorithm is the quasi-

impossibility to predict the size of the ﬁnal sequence.

Algorithms based on Adding Points Process: It is

also possible to generate a low dispersion sequence

by adding points to an initial sequence according to

maximin criterion where the dispersion is high . How-

ever the algorithms based on maximisation of δ

2

(x) =

inf

(x

1

,x

2

)∈x

2

d(x

1

,x

2

) push generally points towards the

frontier of cube. This effect can be avoided by con-

sidering the criterion δ

3

(x) = inf

(x

1

,x

2

)∈x

2

d(x

1

,{x

2

}∪I

s

0

)

where I

s

0

=

{

x ∈R

s

|x /∈ I

s

}

. A set which maximizes

δ

3

(x) is also said a maximin set. Being inspired by

works of (Lindemann and LaValle, 2004), (Teytaud

et al., 2007) have developed an algorithm based on

criterion δ

3

(x) and random trees to explore the space.

At ﬁrst step, they generate a point in the middle of unit

cube. Then, at each step, they add a point with maxi-

mization of criterion δ

3

(x). As a consequence, adding

this point is optimum toward the previous space con-

ﬁguration. However the obtained conﬁguration is not

necessarily the best one with this total number of

points.

To Summarize. All theses methods are not op-

timal and we have to considers the total number of

points to generate low dispersion sequence: it is the

purpose of the next part.

5 A NEW LOW DISPERSION

ALGORITHM

We propose an algorithm using the properties of dis-

persion established by (Niederreiter, 1992), and based

on maximin criterion together with spring variables.

Those spring variables are function of the distance

between the points of the sequence and between the

points and the borders of unit cube. This algorithm

has good properties in practice: for an appropriate

number of points, it converges to the Shukarrev grid

which minimizes dispersion and it converges gener-

ally quickly. Its complexity is about O

n

2

k

for a

sequence of size n and k iterations.

Sketch of the Algorithm: The basic idea of

the algorithm is to achieve two tasks : spread the

points as much as possible and remain inside the unit

cube and not too close to the boundaries (similarly to

Shukkarev grids). Hence we have deﬁned two steps,

each dealing with one of the tasks. The algorithm it-

erates over those steps until convergence. In practice,

we also have to deal with some local minima (leading

to oscillations) and the stopping criteria.

Description of Each Step:

Initialization. We generate randomly a sequence

S with n points preferably in the middle of the unit

cube, S =

{

x

i

}

i=1,...,n

in dimension s. At each step,

each point pushes away its neighbors that are closer

that d

m

=

1

b

s

√

n

c

. Indeed, we know from inequality

(1), that δ(S) ≥ d

m

.

Spreading the Points. For each point x of S,

we only consider as neighbors the points x

i

of S

with distance inferior to d

m

. We compute a spring

variable between x and each neighbor x

i

deﬁned by

DISPERSION EFFECT ON GENERALISATION ERROR IN CLASSIFICATION - Experimental Proof and Practical

Algorithm

705

2 ∗d

m

−d(x,x

i

)

2 ∗d

m

p

. Parameter p is a positive inte-

ger and we have observed experimentally that p = 4

is satisfactory. With p = 4, the value of the spring

variable varies from 1/16 (further points) to 1 (nearest

points). Then, we move point x proportionally to each

spring variable into the direction of vector x −x

i

: the

closer the points are, the more the algorithm spaces

them. Moreover proportionality used is decreasing in

time until a threshold value, from which it becomes

constant.

This process is similar to the minimax approach and

pushes the points outside of the unit cube. Problem

of generating a low dispersion sequence consists then

in a minimization criteria with box constraints. We

apply also these contraints: cube’s borders are repul-

sive in the direction of cube’s center, depending on

the distance between these points and borders.

Applying box Constraints. In order to keep points

inside the hypercube, we apply a repulsive force on

points near borders. These points are detected with

one of their coordinates which is inferior to

d

m

2

+ ε

m

or superior to 1 −

d

m

2

−ε

m

: these values are the coor-

dinates of extremum points of a Sukharev grid with

a tolerance about ε

m

=

d

m

4

. Intensity of this force

has the same proprieties as forces used in step called

Spreading the points. Globally, we perform a local

dispersion minimization which becomes global after

iterating this process.

Avoiding Conﬁguration with Local Minimum. Ap-

plying iteratively these two previous steps can lead to

local minimum and oscillations : a high number of

points can be aligned on the edge. Repulsive forces

push points on the same direction with the same in-

tensity and the trend to push points outside hypercube

at the step called Spreading the points nulliﬁes the ac-

tion of repulsive forces. There are then oscillations of

these points. In order to avoid development of theses

local minimum conﬁgurations, after a few number of

iterations, we select randomly a point on each borders

in each dimension and change their coordinate along

these dimensions to a random value near the middle

of cube.

Stopping Criteria: Different stopping criteria

can be used to end the iterations: a maximal num-

ber of iterations, the stabilization of the dispersion or

a minimum threshold on the average changes of the

points during one iteration. This point still requires

some further exploration.

6 CONCLUSIONS

In this paper, we illustrate experimentally the theoret-

ical result established by (Gandar et al., 2009), show-

ing that dispersion is probably a pertinent criterion

for generating samples for classiﬁcation and we deal

with the question of generating the best low disper-

sion samples. We provide a quite simple algorithm

able to minimize the dispersion for a ﬁxed size se-

quence.In experimental design, grids are usually used

to select sets of parameters for experiments. How-

ever, using grids imposes hard limits to the number

of parameters that can be explored (often less than 6).

We believe that being able to efﬁciently generate low

dispersion sequences can help in this context, since

the number of points can be ﬁxed to any value and in

any dimension (obviously, the number of points has

be realistic depending on the dimension). In active

learning, the learning algorithm has to select which

training point will be used (implying that its label is

asked for, which has a cost). Most of times, the train-

ing points pre-exist but it happens that one can ask for

any point in the space. In that particular case, given

a limited budget for labels (which provides the max-

imum number of training points), the proposed algo-

rithm could be directly applied. Concerning the se-

lection task, our algorithm can be adapted to be able

to select an existing point nearby the ideal position.

This is our future work.

REFERENCES

Gandar, B., Loosli, G., and Deffuant, G. (2009). How to op-

timize sample in active learning : Dispersion, an opti-

mum criterion for classiﬁcation ? In European confer-

ence ENBIS European Network for Business and In-

dustrial Statistics.

Johnson, M., Moore, L., and Ylvisaker, D. (1990). Mimi-

max and maximin distance designs. Journal of Statis-

tical Planning Inference, 26(2):131–148.

Lindemann, S. and LaValle, S. (2004). Incrementally

Reducing Dispersion by Increasing Voronoi Bias in

RRTs. In IEEE International Conference on Robotics

and Automation.

Niederreiter, H. (1992). Random Number Generation and

Quasi-Monte Carlo Methods. Society for Industrial

and Applied Mathematics.

Sergent, M., Phan Tan Luu, R., and Elguero, J. (1997). Sta-

tistical Analysis of Solvent Scales. Anales de Quim-

ica, 93(Part. 1):3–6.

Teytaud, O., Gelly, S., and Mary, J. (2007). Active learning

in regression, with application to stochastic dynamic

programming. In Proceedings of International Con-

ference on Informatics to Control, Automation and

Robotics.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

706