OUTLIER DETECTION AND VISUALISATION IN HIGH

DIMENSIONAL DATA

Baya Lydia BOUDJELOUD, François POULET

ESIEA Recherche

38, rue des docteurs Calmette et Guérin

Parc Universitaire de Laval-Changé, 53000 Laval

Keywords: Data Mining, outlier detection, data visualisation, genetic algorithm, high dimensional data.

Abstract: The outlier detection problem has important app

lications in the field of fraud detection, network robustness

analysis, and intrusion detection. Such applications have to deal with high dimensional data sets with

hundreds of dimensions. However, in high dimensional space, the data are sparse and the notion of

proximity fails to retain its meaningfulness. Many recent algorithms use heuristics such as genetic

algorithms, the taboo search... in order to palliate these difficulties in high dimensional data. We present in

this paper a new hybrid algorithm for outlier detection in high dimensional data. We evaluate the

performances of the new algorithm on different high dimensional data sets, and visualise its results for some

data sets.

1 INTRODUCTION

The data stored in the world are rapidly growing. This

growth of databases has far outpaced the human ability to

interpret this data creating a need for automated analysis

of databases. Knowledge Discovery in Databases (KDD)

is the non-trivial process of identifying valid, novel,

potentially useful, and ultimately understandable patterns

in data (Fayyad & al. 1996). The KDD process is

interactive and iterative, involving numerous steps. Data

mining is one of the steps of KDD which has attracted a

lot of research. This paper focus on outlier detection, an

important part of data mining. An outlier is a data subset,

an observation or a point that is considerably dissimilar,

distinct, or inconsistent with the remainder of data. It may

seem that outliers are noise and must be identified and

eliminated. The outlier detection has many applications

such as the fraud detection, pharmaceutical research,

financial applications, marketing, etc. Let us consider for

example the problem of detecting credit card fraud. A

major problem that credit card companies face is illegal

use of lost or stolen credit cards. Detecting and preventing

such use is critical since credit card companies assume

liability for unauthorized expenses on lost or stolen cards.

Since the usage pattern for a stolen card is unlikely to be

similar to its usage prior to being stolen, the new usage

points are probably outliers. Detecting these outliers is

clearly an important task. The problem is then to define

this dissimilarity between objects, which is what

characterizes an outlier. Typically, this is estimated by a

function calculating the distance between objects, the

following task consists in determining the objects more

distant from the mass. Some difficulties appear when we

want to know the fact that some attribute (dimensions)

have importance in the detection of some outliers,

particularly in high dimensional data. Indeed, some

objects can be outliers for some dimensions, while these

same elements cannot be considered as outliers for other

dimensions. In the opposite case, some dimensions can be

relevant for an outlier, while these same features can be

irrelevant for another outlier. Many recent algorithms use

the concept of proximity in order to find outliers based on

their relationship to the rest of the data. However, in high

dimensional space, the data are sparse and the notion of

proximity fails to retain its meaningfulness. Indeed, the

sparsity of high dimensional data implies that every point

is an almost equally good outlier from the perspective of

proximity-based definitions. Consequently, for high

dimensional data, the notion of finding meaningful outliers

becomes substantially more complex and non-obvious.

Outlier detection is an integral part of data mining and has

attracted much attention recently; (Barnett & al. 1994,

Knorr & al. 1998, Aggarwal & al. 2001). We can find a

485

Lydia BOUDJELOUD B. and POULET F. (2004).

OUTLIER DETECTION AND VISUALISATION IN HIGH DIMENSIONAL DATA.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 485-488

DOI: 10.5220/0002656004850488

 SciTePress

synthesis of outlier detection in (Rock & al. 1999). We

propose in this paper, a hybrid algorithm for outlier

detection in high dimensional data, implying a genetic

algorithm for the selection of a subset of attributes, and a

distance-based approach for outlier detection in this subset

of attributes.

2 HYBRID ALGORITHM FOR

OUTLIER DETECTION

Genetic algorithms (Holland 1975) are stochastic search

techniques based on the mechanism of natural selection

and reproduction. GA starts with an initial set of random

solutions called population and each individual in the

population is called a chromosome representing a

combination of dimensions. Each chromosome is

composed of a set of elements called genes. At each

algorithm iteration called generation, all newly generated

chromosomes are evaluated by a fitness function to

determine their qualities. Good chromosomes are selected

to produce offspring chromosomes through genetic

operators: crossover and mutation. After several

generations, the algorithm converges to a chromosome

that is very likely to be an optimal or a solution very close

to the optimum. Our genetic algorithm starts with a

population of 30 individuals (chromosomes), every

individual is made of 4 genes (the number of genes can

vary from 2 to 8 or plus, according to the data set). The

individuals are made of a subset of dimensions (attributes)

that describe the data set. We evaluate each chromosome

of the population by an Euclidean distance-based

procedure. For a subset of attributes, we compute the

distance between every element and all the elements of the

data set (Procedure 1), and procedure 2, we determine the

gravity centre of the data set in the subset of attributes, and

then we calculate the distance between every element and

the gravity centre of the data set. These procedures get the

element that is far from the remainder of data, and the

distance that separates this point from the remainder of

data. Once the population is evaluated and sorted, we

operate a crossover on two parents chosen randomly.

Next, one of the children is mutated with a probability of

1/10, and an individual is substitute randomly in the

second part of population, under the median. The genetic

algorithm finishes after a maximum number of iterations,

or after a maximum number of crossovers or of mutations,

without improvement of the solution.

The algorithm complexity depends on the number of times

where we call the two procedures. We have 4 variants of

algorithm. AG_P1_Opt: Genetic algorithm using the

optimized cut-point crossover, and the procedure 1 as

evaluation function. AG_P2_Opt: Genetic algorithm

using the optimized cut-point crossover, and the procedure

2 as evaluation function. AG_P1_Ran: Genetic algorithm

using the random cut-point crossover, and the procedure 1

as evaluation function. AG_P2_Ran: Genetic algorithm

using the random cut-point, and the procedure 2 as

evaluation function.

3 TESTS AND VISUALISING

RESULTS

In order to test our algorithm, we compare the results of

the different versions of our algorithm. They were

implemented in C/C++ on a PC pentium IV, 1,7 GHz,

Linux. We tested the outlier detection on the following

data sets: Shuttle, Picture Segmentation, (from the UCI

machine learning repository) and Lung Cancer (from

Jinyan & al. 2002). These data sets are described in table

1. First, we evaluate these algorithms for D = 4, size of the

attributes subset (dimensions). The results are shown in

table 2.

Data sets classes Attributes data points

Shuttle 7 9 43500

Segmentation 7 19 2310

Lung Cancer 2 12533 32

Table 1: data sets description

data set Shuttle Segmen-

tation

Lung

Outlier - 1683

_P1

Opt

Attri-

butes

subset

- 4-6-8-17

5936-3834

-8431-350

Outlier 26711 1683 10 AG

_P2

Opt

Attri-

butes

subset

0-2-4-8 5-6-7-8 2823-

11953

-5936-

12232

Outlier - 1683 10 AG

_P1

Ran

Attri-

butes

subset

- 4-5-6-8 2330-890-

5936-7230

Outlier 26711 1683 10 AG

_P2

Ran

Attri-

butes

subset

0-2-4-8 5-6-7-8 7408-202-

5936-

12307

Table 2: Results of the algorithms with D=4

There is no result for AG_P1 with Shuttle data set because

the number of data points is too large. Indeed, it is

necessary to calculate and store in memory the distances

between an individual and each of the others, for all the

individuals (objects). This requires a memory size larger

than the available one (even with virtual memory i.e. hard

disk swap file). In the case of the procedure P2, this

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

486

calculation is carried out once: all individuals are

compared with the gravity centre. We note that AG_P2

(randomly or optimised) detect the outlier point in the

same attribute subset, (0, 2, 4 and 8) for the Shuttle data

set and (5, 6, 7 and 8) for the Segmentation data set. For

the Lung cancer data set, we found the same outlier set,

but only one attribute (5936) is found in all cases. The

number of attributes of the Lung cancer data set (12533

attributes) can explain this result. The four algorithms

found the same outlier point in the same attributes subset

for all data sets (except lung, we explained why). Then we

tried to evaluate the importance of the attribute subset size

(D). The results are shown in table3 (Shuttle), table 4

(Segmentation) and table 5 (Lung).

Shuttle D=1 D=2 D=3 D=4 D=9

Attri-

butes

subset

4 2-4 0-2-4 0-2-

4-8

0-…-8

Outlier 26711 26711 26711 2671

26711

Segmen

tation

D=1 D=2 D=3 D=4 D=6

Attri-

butes

subset

8 6-8 6-7-

5-6-

7-8

2-5-

4-6-

7-8

0-…-

Outlier 1683 1683 1683 1683 1683 1683

Lung D=1 D=2 D=4 D=6 D=

6000

12533

Attri-

butes

sub

set

5936 5936

6038

5103-

3476-

5936-

2329

11472

3613-

3086-

10507

5936-

1430

5936

-….-

1730

0-…-

12532

Out-

lier

10 10 10 10 10 10

For all the data sets we can see the results are the same

whatever the subset dimension is. This is only this

particular value of one attribute that makes the point

significantly different from the other ones. There is no

need to compute the distance with all the attributes.

Then we visualise these results using both parallel-

coordinates (Inselberg, 1985) and 2D (Fig.3a and 3b) or

3D (Fig. 1) scatter-plot matrices (Becker, 1987), to try to

explain why these points are different from the other ones.

The 2D scatter-plot matrices are the 2D projections of the

data points according to all possible pairs of attributes and

the 3D scatter plot matrix is a 3D projection of the n-

dimensional data points.

These kinds of visualisation tools allow the user to see

how far from the others is the outlier. For example in

figure 3, we can easily see the outlier has extreme values

along three attributes (the left ones, the first two being

minimum values and the last one being a maximum

value).

4 CONCLUSION AND FUTURE

WORK

In this paper, we have presented a hybrid algorithm for

outlier detection, which is especially suited for high

dimensional data sets. Conventional approaches compute

the distance with the all the attributes and so are unable to

deal with large number of attributes (because of the

computational cost). Here, we have only to find the best

significant attribute subset to efficiently detect the outliers.

The main idea is to combine attributes in a reduced subset

and find the combination where we can detect the best

outlier point, the point that is farthest from the other ones

in the whole data set. Some numerical tests have shown

that the new algorithm is able to significantly reduce the

research space in term of dimensions without any loss of

quality result. Then we visualise the obtained results with

scatter-plot matrices and parallel coordinates to try to

explain the results and show the attributes relevant for

making a point an outlier. These visualisation tools show

the outlier point is farthest from the other ones. A first

forthcoming improvement will be to try to qualify the

outlier: is it an error or only a point significantly different

from the other ones? We also will try to extend this

algorithm for the clustering task in high dimensional data.

We think that it must be possible to find good clusters in

reduced dimensional data set. Another subject will be to

try to find a low cost evaluation function, like a function

evaluating the combination of attribute subset to improve

the execution time.

Table 3: AG_P2_Opt Results (Shuttle data set)

Table 4: AG_P2_Opt Results (Segmentation data set)

Table 5: AG_P2_Opt Results (Lung Cancer data set)

REFERENCES

Aggarwal C.C., Yu P.S., 2001. Outlier detection for high

dimensional data, ACM Press New York, NY, USA,

Periodical-Issue-Article, pp 37 - 46.

Barnett V., Lewis T., 1994. Outliers in statistical data,

John Wiley.

Becker, R., Cleveland, W. and Wilks, A. 1987 "Dynamic

graphics fordata analysis," Statistical Science, 2, pp

355-395.

Fayyad U. , Piatetsky-Shapiro G., Smyth P. , 1996. From

Data Mining to Knowledge Discovery in Databases, AI

Magazine Vol. 17, No. 3, pp 37-54.

OUTLIER DETECTION AND VISUALISATION IN HIGH DIMENSIONAL DATA

487

Holland J., 1975. Adaptation in Natural and Artificial

Systems. University of Michigan Press, Ann Arbor,.

Inselberg, A., 1985: The Plane with Parallel Coordinates,

Special Issue on Computational Geometry, The Visual

Computer, Vol.1, pp.69-97.

Jinyan L., Huiqing L., 2002. Kent ridge bio-medical dat

set repository, http://sdmc.-lit.org.sg/GEDatasets.

Knorr E., Ng R., 1998. Algorithms for mining distance-

based outliers in large data sets. VLDB Conference

Proceedings, September.

Rocke D. M. and Woodruff D. L., 1999. A Synthesis of

Outlier Detection and Cluster Identification, Working

Paper, University of California.

Figure 3a: Visualisation AG_P2_Opt Results

for Lung Cancer data set

Figure 1: Visualisation AG_P2_Opt Results

for Segmentation data set

Figure 3b: Visualisation AG_P2_Opt Result for

Lung Cancer data set

Figure 2: Visualisation AG_P2_Opt Results

for Shuttle data set

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

488