CLASSICATION BY SUCCESSIVE NEIGHBORHOOD

David Grosser, Henri Ralambondrainy and Noel Conruyt

Laboratoire d’Informatique et Math

ematiques, Universit

e de la R

eunion, 97490 Sainte-Clotilde, France

Keywords:

Classiﬁcation, Similarity, Nearest neighbors, Structured data, Systematics.

Abstract:

Formalization of scientiﬁc knowledge in life sciences by experts in biology or Systematics produces arbores-

cent representations whose values could be present, absent or unknown. To improve the robustness of the

classiﬁcation process of those complex objects, often partially described, we propose a new classiﬁcation

method which is iterative, interactive and semi-directed. It combines inductive techniques for the choice of

discriminating variables and search for nearest neighbors based on various similarity measures which take into

account structures and values of the objects for the neighborhood computation.

1 INTRODUCTION

Systematics is the scientiﬁc discipline that deals with

listing, describing, naming, classifying and identify-

ing living beings. In the frame of environmental sci-

ences, the acquisition and production of knowledge

on biological specimens and taxa is an essential part

of the work of systematicians (Winston, 1999). In-

deed, being able to describe, classify and identify a

specimen from morphological characters is a ﬁrst step

for monitoring biodiversity because it gives access

to information relative to its species name (Biology,

Geography, Ecology, etc.). This process can be as-

sisted with computer science decision support tools.

In return, such complex domains deliver interesting

symbolical and numerical knowledge representation

and processing problems to the knowledge engineer-

ing and computer science community.

Indeed, classical discrimination methods devel-

oped in the frame of data analysis or machine learn-

ing, such as classiﬁcation or decision trees (Breiman

et al., 1984), (Quinlan, 1986) or more recent methods

developed in the data mining ﬁeld such as association

rules mining (Piatetsky-Shapiro, 1991) or Multifactor

dimensionality reduction (Zhu and Davidson, 2007)

are not sufﬁcient, because they do not cope with re-

lations between attributes, missing data, and are not

very tolerant to errors in descriptions.

The considered problem that we are faced with is

to determine the class of a structured description that

is partially answered and, eventually contains errors,

from a referenced case base, this last one be a priori

classiﬁed by qualiﬁed experts in k-classes. The pro-

posed discrimination method proceeds by inference

of successive neighboring. It is inductive, interac-

tive, iterative and semi-directed. It combines induc-

tive techniques of discriminatory variables and neigh-

bors search, with the help of a similarity measure that

takes into account the structure (dependencies of vari-

ables) and the content (missing and unknown values).

2 DATA REPRESENTATION

Within a knowledge base, observations are described

with the help of descriptive models. A descriptive

model represent an ontological knowledge about the

considered domain and contains descriptors structure

and organization.

2.1 The Descriptive Model

The descriptive model (ﬁg. 1), or schema, is a

rooted tree M = (A, U), where A is a set of

nodes(attributes), and U a set of edges. Leaves

are single classical attributes, as numerical or nomi-

nal ones, called ”basic attributes”. Nodes are ”struc-

tured attributes”, sub-trees made of several attributes.

For example, A

:< A

, . . . , A

> denotes a structured

attribute where A

is the root of the sub-tree and

, . . . , A

, are the sons, structured or basic attributes,

the components of A

288

Grosser D., Ralambondrainy H. and Conruyt N. (2009).

CLASSIFICATION BY SUCCESSIVE NEIGHBORHOOD.

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 288-291

DOI: 10.5220/0002299002880291

 SciTePress

Figure 1: This example shows a description part of a speci-

men belonging to the genus Stylocoeniella.

2.2 Object Representation

An object, described by a schema M , is a tree de-

rived from M where each basic attribute has been

valued (ﬁg. 1). A skeleton represents the structure

of an object, it describes the state of each compo-

nent: present (+), absent (or missing) (-), or unknown

(∗). We denote by S = {+, −, ∗}. A map σ : A → S

deﬁnes a labeled rooted tree H

= (A

, U) where

= {(A

, σ(A

))|A

∈ A }. The skeleton of an object

is represented by H

. The nodes of a skeleton must re-

spect the following consistency constraints: for each

structured attribute B :< B

l∈L

, we must have:

1. “The sons of a missing node must be missing”: if

σ(B) = − then σ(B

) = −, for l ∈ L,

2. “The sons of an unknown node are unknown or

missing”: if σ(B) = ∗ then σ(B

) = ∗|−, for l ∈ L.

3. “The sons of a present node may be present, miss-

ing or unknown”: if σ(B) = + then σ(B

) =

+| − |∗, for l ∈ L.

We denote by H the set of skeletons that are

consistent. Assume that is given a set of basic at-

tributes names A

and corresponding domains D

for

q ∈ Q. For any object, a basic attribute A

is valued

in D

only If the attribute is present. Missing val-

ues will be denoted by ⊥ and unknown values by ∗,

then Γ

= D

∪ {⊥} ∪ {∗}} is the new domain for a

basic attribute A

. The set of values of an object o

is v

= (v

)

q∈Q

where v

∈ Γ

. Let Γ

= u

q∈Q

an object is described with its skeleton and values :

o = (H

, v

) ∈ E = H × Γ.

3 SIMILARITY MEASURES

In this section, we propose a function to evaluate

the similarity level between a pair of skeletons. The

weight m(A

) of the attribute A

is the number of at-

tributes which is made of. It means that the contribu-

tion of a structured attribute in the similarity evalua-

tion of skeletons depends on the number of its com-

ponents. Given a comparison map λ : S × S −→ [0, 1]

(table 1), the structural weighted similarity measure,

between the skeletons H

et H

, is deﬁned as:

, H

) =

j∈J

m(A

)λ({σ

), σ

)})

j∈J

m(A

)

If we take α

= β

= α

= γ

= β

= 0 (table 1) and a

weight equal to 1 for all attributes, then the proposed

function is the well-known Sokal index (Sneath and

Sokal, 1973). Another possible choice is to take β

= γ

= 0 for the comparison of unknown values

and deﬁne λ({σ

) = +, σ

) = −}) = α

as the

ratio of missing sons of A

in the skeleton H

Table 1: Values of a comparaison map λ for a given node

H

+ − ∗

+ 1 α

− α

1 γ

∗ β

For some applications, users wish emphasize

present rather than missing or unknown attributes.

For that purpose, we deﬁne the structural recursive

similarity measure built from a map λ

as follow:

if A

is a basic attribute, presents both in H

and

: σ

) = σ

) = + then:

(+, +) = 1, (1)

else A

:< A

l∈L

is a structured attribute and

(+, +) =

1 + Σ

m(A

)λ

({σ

), σ

)})

m(A

)

(2)

This similarity measure is recursively computed from

the root A:

, H

) = λ

({σ

(A), σ

(A)}) (3)

4 CLASSIFICATION BY

SUCCESSIVE

NEIGHBORHOOD

The proposed classiﬁcation method allows to deter-

mine the membership of an individual to a particular

class which is partially described by a user and com-

prising eventually some errors. It takes into account

CLASSIFICATION BY SUCCESSIVE NEIGHBORHOOD

289

dependences relations between attributes. The prin-

ciple consists with to select a neighborhood V , i.e. a

descriptions set (individuals or classes) close to cur-

rent description with the help of the similarity mea-

sure. A set of candidates classes is computed from

the neighbor set. The method seeks then to supple-

ment the missing information, ﬁrstly with the appli-

cation of coherency rules, secondly by proposing a

set of discriminants attributes. A new neighborhood

is then computed on the basis of the new partial de-

scription. The process is reiterates until obtaining a

homogeneous descriptions set.

4.1 Neighborhood Classiﬁcation

Algorithm

The iterative process to predict the class of a speci-

men, from a given description e, is made of the fol-

lowing steps:

1. Initialize radius value ∆ to the max of the distance

of e to the set of observations.

2. Determine the set of objects inside of the sphere

of radius ∆ centered at e,

3. Compute the classiﬁcation scores of the a priori

classes,

4. Compute a new radius value ∆ from the set of

neighbors,

5. Repeat 2, 3, 4 until stopping condition is satisﬁed

4.1.1 Neighbors Set

The neighbors of e at iteration m is the set of objects

inside of the sphere of radius ∆

centered at e:

(m)

= {o ∈ O | d(e, o) < ∆

The radius value is determined from the maximum

distance, a dissimilarity measure between e and a set

A :

max

(e, A) = max

a∈A

d(e, a)

then ∆

is written: ∆

= D

max

(e, N

(m−1)

). It is easy to

show that {∆

} is a decreasing sequence. If each ob-

ject o

has a normalized weight p

, then the dispersion

distance:

disp

(e, A) = Σ

∈A

(e, o

)

is suitable, we can show that {∆

= D

disp

(e, N

(m−1)

)}

is still a decreasing sequence.

4.1.2 Class Classiﬁcation Score

The examples O are pre-classiﬁed into classes de-

noted by {C

}

l∈K

. Let Pr(C

(m)

) =

∩N

(m)

be the

probability of the cluster C

given the set of neighbors

(m)

or the relative frequency of the cluster C

in the

set of neighbors N

(m)

. The label that should be as-

signed to input e will be chosen from clusters such

as its probability Pr(C

) are signiﬁcantly different

from the prior probability of the cluster Pr(C

|O) =

|0|

. Usual signiﬁcant statistical test of frequencies

may be used or deﬁned user threshold for this pur-

pose. Then, the classiﬁcation score of the class C

iteration m is

Pr(C

)

Pr(C

|O)

4.1.3 Stopping Condition

The initial value of the radius is ∆

= D

max

(e, O).

As the sequence of radius is decreasing, then the se-

quence of the set of neighbors {N

(m)

} is also decreas-

ing. The iterative process is suspended, when the

maximum of a classiﬁcation score R

of a class C

greater than a threshold r

> 1. In practice r

= 2, the

class with the best score is proposed to the user as the

label of the specimen. Otherwise, a minimum size of

the neighbors set (ten percent of the population size

for example) is used as a stopping criteria. We can

notice that it is more easier to the user to ﬁx the previ-

ous thresholds than to give a good number k of nearest

neighbors.

4.1.4 Discriminant Attributes Selection

An ordered list of informative variables is computed

at each iteration of the classiﬁcation process. The ﬁrst

element is exposed as a question to the user who can

choose an alternative variable from the list or an un-

known answer. The list is built in function of several

criteria:

1. Attribute potentiality. The method considers only

at each step the attributes which can be indicated,

i.e., those for which there exists a components

(nodes) chain whose presence is proven.

2. Discriminant power. Choice of different classi-

cal criteria computing the information gain used

in machine learning such as Shannon entropy or

Gini index.

3. Background knowledge by using attributes

weighting in the descriptive model.

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

290

5 ILLUSTRATIVE EXAMPLE

The famous Anderson’s Iris data is used to illustrate

the neighborhood classiﬁcation algorithm. This data

set contains 150 plants from 3 speciﬁc species charac-

terized by 4 attributes. The task is to predict the class

of a given specimen, its description e is a slight mod-

iﬁcation of a plant description of a setoza specy (see

Table 2).

Table 2: Specimen.

Attribute petal sepal

Specimen width length width length

- 3.1 - -

- 3.1 - 2.19

e 7 3.1 5.97 2.19

To test the performance of the algorithm, we start

to identify the given specimen, using only the petal

length value, the other values are considered as ”un-

known” (specimen e

in Table 2). After 16 iterations

(Table 3), the algorithm stops as the size of neighbors

is less than 15 (the threshold has been ﬁxed to be ten

percent of the population size). Scores of the 3 classes

are not satisfactory as they are lower than 2. The at-

tribute sepal length is proposed to be assigned a value,

according to its discriminatory ability. The algorithm

is applied to the resulted specimen e

. After 46 itera-

tions, the class setoza has the best score greater than 2,

and is proposed to be the specimen label. In contrast

to decision tree methods, we can notice that the identi-

ﬁcation process can be performed with attributes that

have not necessarily a good discriminatory ability.

Table 3: Experiments.

Cluster versicolor virginica setoza

Size Cluster (C) 50 50 50

Pr(C|O) 33.3 33.3 33.3

Identiﬁcation of e

: versicolor ? (iteration =16)

Size Neighbor (N) 5 3 4

Pr(C|N) 41.7 25 33.3

Cluster score R 1.25 0.75 1

Identiﬁcation of e

: setoza (iteration =46)

Size Neighbor 0 21 48

Pr(C|N) 0 30.4 69.6

Cluster score R 0 0.91 2.08

Identiﬁcation of e: setoza (iteration =66)

Size Neighbor 0 24 49

Pr(C|N) 0 32.9 67.1

Cluster score R 0 0.99 2.01

6 CONCLUSIONS AND

PERSPECTIVES

To identify a biological object and to associate a taxon

to it, most of the time systematicians proceed in two

phases. The synthetic phase, by global observation

of the most visible characters reduces the ﬁeld of in-

vestigation. The analytical phase, by precise observa-

tion of discriminating attributes reﬁnes research until

obtaining the result. The classiﬁcation by successive

neighborhood from a partial description that we pro-

pose presents the interest to correspond to the reason-

ing followed by biologists. Starting from a partial de-

scription generally containing the most visible or easy

to observe and describe attributes, the method sug-

gests relevant information necessary to supplement to

determine the most probable class. Moreover, it is

error tolerant because an erroneous information can

nevertheless lead to a satisfactory result due to the

fact that a smooth matching is carried using similarity

function out on ﬁlled values rather than a strict one.

We expect that the method is generic and applica-

ble on any ﬁelds where structured or semi-structured

data are considered, such as XML data format, RDF

graph structures or OWL Ontologies. It’s enough to

lay out an operator of generalization and a similarity

index adapted to the considered data. This method

is in evaluation progress on the ” knowledge base on

corals of the Mascareignes archipelago” which counts

approximately 150 taxa and 800 complex descrip-

tions.

REFERENCES

Breiman, L., Freidman, J. H., Olshen, R. A., and Stone,

C. J. (1984). Classiﬁcation and regression trees.

Wadsworth, Belmont.

Piatetsky-Shapiro, G. (1991). Discovery, analysis, and pre-

sentation of strong rules. Knowledge Discovery in

Databases, AAAI/MIT Press, Cambridge, MA.

Quinlan, J. (1986). Induction of decision trees. Machine

Learning, 1:81–106.

Sneath, H. and Sokal, R. (1973). Numerical Taxonomy.

W.H. Freeman.

Winston, J. E. (1999). Describing Species: Practical Taxo-

nomic Procedure for Biologists. New York: Columbia

University Press.

Zhu, X. and Davidson, I. (2007). Knowledge Discovery

and Data Mining: Challenges and Realities with Real

World Data. Idea Group Inc.

CLASSIFICATION BY SUCCESSIVE NEIGHBORHOOD

291