POP: A Parallel Optimized Preparation of Data for Data Mining

Christian Ernst

, Youssef Hmamouche

and Alain Casali

Ecole des Mines de St Etienne and LIMOS, CNRS UMR 6158, Gardanne, France

Laboratoire d’Informatique Fondamentale de Marseille, CNRS UMR 7279, Aix Marseille Universit

e, Marseille, France

Keywords:

Data Mining, Data Preparation, Outliers, Discretization Methods, Parallelism and Multicore Encoding.

Abstract:

In light of the fact that data preparation has a substantial impact on data mining results, we provide an original

framework for automatically preparing the data of any given database. Our research focuses, for each attribute

of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii), Identifying the

most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier

depends on if data distribution is normal or not. When attempting to discern the best discretization method,

what is important is the shape followed by the density function of its distribution law. For this reason, we

propose an automatic choice for ﬁnding the optimized discretization method based on a multi-criteria (Entropy,

Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities. Conducted

experiments validate our approach, showing that it is not always the very same discretization method that is

the best.

1 INTRODUCTION AND

MOTIVATION

Parallel architectures have become a standard when

we think about modern architectures. Historically,

processors were designed to provide parallel instruc-

tions, then operating systems were built in order to ac-

cept them, in particular the task (including the thread)

manager. As a result of these advances, applications

can be distributed on several cores. Consequently,

multicore applications run faster given that they re-

quire less processor time to be executed. However,

they may need more memory since each thread re-

quires its own amount of memory.

On the other hand, many methods exist to prepare

data (Pyle, 1999), even if data preparation is few de-

veloped in the literature: The accent is more often put

on the single mining step. It is obvious that raw in-

put data must be prepared in any KDD (Knowledge

Discovery in Databases) system for the mining step.

This is for two main reasons: (i) If each value of each

column is considered as a single item, there will be a

combinatorial explosion of the search space, and thus

very large response times and/or few values returned,

and (ii) We cannot expect this task to be performed by

an expert because manual cleaning is time consuming

and subject to many errors. The data preparation step

is generally divided into:

a) Preprocessing: Which consists in reducing the

data structure by eliminating columns and rows of

low signiﬁcance (Stepankova et al., 2003). Out-

liers are removed at this step. In addition, we can

perform an elimination of concentrated data by re-

moving columns having a small standard devia-

tion or containing too few distinct values;

b) Transformation: Discrete values deal with inter-

vals of values (also called bins, clusters, classes,

etc.), which are more concise for representing

knowledge, in a way that they are easier to use

and more comprehensive than continuous values.

Many discretization algorithms (see Section 4.1)

have been proposed over the years to achieve this

goal.

But data preparation often focuses on a single

parameter (discretization method, outlier detection,

null values management, etc.). Associated speciﬁc

proposals only highlight on their advantages com-

paring themselves to others. There is no global nor

automatic approach taking advantage of all of them.

However, the better data are prepared, the better

results are. Previously in (Ernst and Casali, 2011), we

proposed a simple but efﬁcient approach for prepar-

ing input data in order to transform them into a set

of intervals to which we apply speciﬁc mining algo-

rithms to detect: Correlation Rules (Casali and Ernst,

Ernst, C., Hmamouche, Y. and Casali, A..

POP: A Parallel Optimized Preparation of Data for Data Mining.

In Proceedings of the 7th Inter national Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 36-45

ISBN: 978-989-758-158-8

2013), and Association Rules (Agrawal et al., 1996).

The reasons for which we have decided to reconsider

our previous work are: (i) To improve the detection

of outliers with regard to the data distribution (nor-

mal or not), (ii) To propose an automatic choice of

the best discretization method, and (iii) To parallelize

these jobs.

Globally, the objective of the presented work is

to determine automatically the values of most of the

data preparation variables, with a focus set on outlier

and discretization management. In terms of imple-

mentation, corresponding “tasks” are performed for

each attribute in parallel. And ﬁnally, we carry out

experiments.

The paper is organized as follows: Section 2

presents “current” aspects of multicore programming

we use in our work. Section 3 and Section 4 are re-

spectively dedicated to outlier detection and to dis-

cretization methods. Each Section is composed of two

parts: (i) a related work, and (ii) our approach for im-

proving it. In Section 5, we show the results of the

ﬁrst experimentations. The last Section summarizes

our contribution and outlines some research perspec-

tives.

2 NEW FEATURES IN

MULTICORE ENCODING

Multicore processing is not a new concept, however

only in the mid 2000s has the technology become

mainstream with Intel and AMD. Moreover, since

then, novel software environments that are able to

take advantage simultaneously of the different exist-

ing processors have been designed (Cilk++, Open MP,

TBB, etc.). They are based on the fact that looping

functions are the key area where splitting parts of a

loop across all available hardware resources increase

application performance.

We focus hereafter on the relevant versions of the

Microsoft .NET framework for C++ proposed since

2010. These enhance support for parallel program-

ming by several utilities, among which the Task Paral-

lel Library. This component entirely hides the multi-

threading activity on the cores: The job of spawning

and terminating threads, as well as scaling the number

of threads according to the number of available cores,

is done by the library itself.

The Parallel Patterns Library (PPL) is the

corresponding available tool in the Visual C++

environment. The PPL operates on small units of

work called Tasks. Each of them is deﬁned by a λ

calculus expression (see below). The PPL deﬁnes

three kinds of facilities for parallel processing, where

only templates for algorithms for parallel operations

are of interest for this presentation.

Among the algorithms deﬁned as templates for

initiating parallel execution on multiple cores, we fo-

cus on the parallel invoke algorithm used in the pre-

sented work (see Sections 3.2 and 4). It executes a

set of two or more independent Tasks in parallel. An-

other novelty introduced by the PPL is the use of λ ex-

pressions, now included in the C++11 language norm:

These remove all need for scaffolding code, allowing

a “function” to be deﬁned in-line in another statement,

as in the example provided by Listing 1. The λ ele-

ment in the square brackets is called the capture spec-

iﬁcation: It relays to the compiler that a λ function

is being created and that each local variable is being

captured by reference. The ﬁnal part is the function

body.

/ / R e t u r n s t h e r e s u l t o f a d di n g a v a l u e t o i t s e l f

t e m p l a t e <typename T> T t w i c e ( c o n s t T& t ) {

r e t u r n t + t ;

}

i n t n = 5 4 ; d o u b le d = 5 . 6 ; s t r i n g s = ” He l l o ” ;

/ / C a l l t h e f u n c t i o n on ea c h v a l u e c o n c u r r e n t l y

p a r a l l e l i n v o k e (

[&n ] { n = t w i c e ( n ) ; } ,

[&d ] { d = t w i c e ( d ) ; } ,

[& s ] { s = t w i c e ( s ) ; }

) ;

Listing 1: Parallel execution of 3 simple tasks.

Listing 1 also shows the limits of parallelism. It is

widely agreed that applications that may beneﬁt from

using more than one processor necessitate: (i) Oper-

ations that require a substantial amount of processor

time, measured in seconds rather than milliseconds,

and (ii), Operations that can be divided into signiﬁ-

cant units of calculation which can be executed inde-

pendently of one another. So the chosen example does

not ﬁt parallelization, but is used to illustrate the new

features introduced by multicore programming tech-

niques.

More details about parallel algorithms and the λ

calculus can be found in (Casali and Ernst, 2013).

3 DETECTING OUTLIERS

An outlier is an atypical or erroneous value corre-

sponding to a false measurement, an unwritten input,

etc. Outlier detection is an uncontrolled problem be-

cause values that are not extreme deviate too greatly

in comparison with the other data. In other words,

POP: A Parallel Optimized Preparation of Data for Data Mining

they are associated with a signiﬁcant deviation from

the other observations (Aggarwal and Yu, 2001). In

this section, we present some outlier detection meth-

ods associated to our approach using only uni-variate

data as input.

The following notations are used to describe out-

liers: X is a numeric attribute of a database relation,

and is increasingly ordered. x is an arbitrary value, X

is the i

value, N the size of X , σ its standard devi-

ation, µ its mean, and s a central tendency parameter

(variance, inter-quartile range, . . . ). X

and X

are re-

spectively the minimum and the maximum values of

X. p is a probability, and k a parameter speciﬁed by

the user, or computed by the system.

3.1 Related Work

We discuss hereafter four of the main uni-variate

outlier detection methods.

Elimination After Standardizing the Distribution:

This is the most conventional cleaning method (Ag-

garwal and Yu, 2001). It consists in taking into ac-

count σ and µ to determine the limits beyond which

aberrant values are eliminated. For an arbitrary dis-

tribution, the inequality of Bienaym

e-Tchebyshev in-

dicates that the probability that the absolute deviation

between a variable and its average is greater than p is

less than or equal to



x −µ



≥ p) ≤

(1)

The idea is that we can set a threshold probability as a

function of σ and µ above which we accept values as

non-outliers. For example, with p = 4.47, the risk of

considering that x, satisfying



x−µ



≥ p, is an outlier

is bounded by 0.05.

Algebraic Method: This method, presented in

(Grun-Rehomme et al., 2010), uses the relative dis-

tance of a point to the “center” of the distribution, de-

ﬁned by: d

−µ

. Outliers are detected outside of

the interval [µ −k ×Q

,µ + k ×Q

], where k is gener-

ally ﬁxed to 1.5, 2 or 3. Q

and Q

are the ﬁrst and

the third quartiles respectively.

Box Plot: This method, attributed to Tukey (Tukey,

1976), is based on the difference between quartiles

and Q

. It distinguishes two categories of extreme

values determined outside the lower bound (LB) and

the upper bound (UB):



LB = Q

−k ×(Q

−Q

)

UB = Q

+ k ×(Q

−Q

)

(2)

Grubbs’ Test: Grubbs’ method, presented in

(Grubbs, 1969), is a statistical test for lower or higher

abnormal data. It uses the difference between the

average and the extreme values of the sample. The

test is based on the assumption that the data have

a normal distribution. The statistic used is: T =

max(

−µ

µ−X

). The tested value (X

or X

) is not

an outlier is rejected at signiﬁcance level α if:

T >

N −1

√

n −2β

(3)

where β = t

α/(2n),n−2

is the quartile of order α/(2n)

of the Student distribution with n −2 degrees of free-

dom.

3.2 An Original Method for Outlier

Detection

Most of the existing outlier detection methods assume

that the distribution is normal. However, we found

that in reality, many samples have asymmetric and

multimodal distributions, and the use of these meth-

ods can have a signiﬁcant inﬂuence at the data mining

step. In such a case, we must process each “distribu-

tion” using the appropriate method. The considered

approach consists in eliminating outliers in each col-

umn based on the normality of data, in order to mini-

mize the risk of eliminating normal values.

Many tests have been proposed in the litera-

ture to evaluate the normality of a distribution:

Kolmogorov-Smirnov (Lilliefors, 1967), Shapiro-

Wilks, Anderson-Darling, Jarque-Bera (Jarque and

Bera, 1980), etc.. If the former gives the best re-

sults whatever the distribution of the analyzed data

may be, it is nevertheless much more time consuming

to compute then the others. This is why we chosed

the Jarque-Bera test (noted JB hereafter), much more

simpler to implement as the others, as shown below:

JB =

(γ

) (4)

This test follows a law of χ

with two degrees of

freedom, and uses the Skewness γ3 and the Kurtosis

statistics, deﬁned as follows:

= E[(

x −µ

)

] (5)

= E[(

x −µ

)

] −3 (6)

If the JB test is not signiﬁcant (the variable is nor-

mally distributed), then the Grubbs’ test is used at a

signiﬁcance level of systematically 5%, otherwise the

Box plot method is used with parameter k automati-

cally set to 3.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

Figure 1 summarizes the process we chose to

detect and eliminate outliers.

Normality Test

Normality?

Grubbs’ Test

Box Plot

YesNo

Figure 1: The outlier detection process.

Finally, the computation of γ3 and γ2 to evalu-

ate the value of JB, so as the Grubb’s and the Box

Plot statistics calculus, are performed in parallel in the

manner shown in Listing 1 (cf. Section 2). This in or-

der to fasten the response times. Other statistics used

in the next section are simultaneously collected here.

Because the corresponding algorithm is very simple

(the computation of each statistic is considered as a

single task), we do not present it.

4 DISCRETIZATION METHODS

Discretization methods, so as outlier management

methods, apply on columns of numerical values.

However, in a previous work, we integrated also

other types of column values, such as strings, by

performing a kind of translation of such values (based

on their frequency) into numerical ones (Ernst and

Casali, 2011). This is why our approach, a priori

dedicated to numerical values, can be easily extended

to any given database.

The discretization of an attribute consists in ﬁnd-

ing NbBins disjoint intervals which will further repre-

sent it in a consistent way. The ﬁnal objective of dis-

cretization methods is to ensure that the mining part

of the KDD process generates efﬁcient results. In our

approach, we use only direct discretization methods

in which NbBins must be known in advance and rep-

resents the upper limit for every column of the input

data. NbBins was a parameter ﬁxed by the end-user

in the mentioned previous work above. As an alterna-

tive, the literature proposes several formulas (Rooks-

Carruthers, Huntsberger, Scott, etc.) for computing

such a number. Therefore we use the Huntsberger for-

mula, the best from a theoretical point of view (Cau-

vin et al., 2008), and given by: 1 + 3.3 ×log

(N).

We apply the formula on the non null values of each

column.

4.1 Related Work

In this section, we only discuss the ﬁnal discretization

methods that have been kept for this work. This

is because other implemented methods have not

revealed themselves to be as efﬁcient as expected

(such as Embedded Means Discretization for ex-

ample), or are otherwise not a worthy alternative to

the presented ones (Quantiles based Discretization).

The methods we use are: Equal Width Discretization

(EWD), Equal Frequency Fisher-Jenks Discretization

(EFD-Jenks), AVerage and STandard deviation

based discretization (AVST), and K-Means. These

methods, which are unsupervised and static (Mitov

et al., 2009), have been widely discussed in the

literature: See for example (Cauvin et al., 2008) for

EWD and AVST, (Jenks, 1967) for EFD-Jenks, or

(Kanungo et al., 2002), (Arthur et al., 2011) and

(Jain, 2010) for KMEANS. For these reasons, we

only summarize their main characteristics and their

ﬁeld of applicability in Table 1.

Let us underline that the computed NbBins value

is in fact an upper limit, not always reached, depend-

ing on the applied discretization method. Thus, EFD-

Jenks and KMEANS generate most of the times less

than NbBins bins. This implies that other methods

which generate the NbBins value differently, for ex-

ample through iteration steps, may apply if NbBins

can be upper bounded.

Example 1. Let us consider the numeric attribute

representing the weight of several persons SX =

{59.04,60.13,60.93,61.81,62.42, 64.26,70.34, 72.89,

74.42,79.40,80.46,81.37}. SX contains 12 values,

so by applying the Huntsberger’s formula, if we aim

to discretize this set, we have to use 4 bins.

Table 2 shows the bins obtained by applying all

the discretization methods proposed in Table 1. Table

3 shows the number of values of SX belonging to each

bin associated to every discretization method.

As it is easy to understand, we cannot ﬁnd two

discretization methods producing the same set of bins.

As a consequence, the distribution of the values of SX

is different depending on the method used.

4.2 Discretization Methods and

Statistical Characteristics

As seen in the last Section, when attempting to dis-

cern the best discretization method for a column, its

shape is very important. We characterize the shape

of a distribution according to four criteria: (i) Mul-

timodal, (ii) Symmetric or Antisymmetric, (iii) Uni-

form, and (iv) Normal. This is done in order to deter-

POP: A Parallel Optimized Preparation of Data for Data Mining

Table 1: Summary of the discretization methods used.

Method Principle Applicability

EWD This simple to implement method

creates intervals of equal width.

Not applicable for asymmetric or

multimodal distributions.

EFD-Jenks Jenks’ method provides classes with,

if possible, the same number of val-

ues while minimizing the internal

variance of intervals.

The method is effective from all sta-

tistical points of view but presents al-

gorithmic complexity in the genera-

tion of the bins.

AVST Bins are symmetrically centered on

the mean and have a width equal to

the standard deviation.

Intended only for normal distribu-

tions.

KMEANS Based on the Euclidean distance, the

method determines a partition min-

imizing the quadratic error between

the mean and the points of each in-

terval.

One disadvantage of this method is its

exponential complexity, so the com-

putation time can be long. It is appli-

cable to any form of distribution.

Table 2: Set of bins associated to sample SX.

Method Bin

Bin

EWD [59.04, 64.62[ [64.62, 70.21[ [70.21, 75.79[ [75.79, 81.37]

EFD-Jenks [59.04; 60.94] ]60.94, 64.26] ]64.26, 74.42] ]74.42, 81.37]

AVST [59.04; 60.53[ [60.53, 68.65[ [68.65, 76.78[ [76.78, 81.37]

KMEANS [59.04; 61.37[ [61.37, 67.3[ [67.3, 77.95[ [77.95, 81.37]

Table 3: Population of each bin of sample SX.

Method Bin

Bin

EWD 6 0 3 3

EFD-Jenks 3 3 3 3

AVST 2 4 4 2

KMEANS 3 3 4 2

mine what discretization method(s) may apply. Some

tests use statistics introduced in Section 3.2. More

precisely, we perform the following tests, which have

to be performed in the presented order:

Multimodal Distributions: We use the Kernel

method presented in (Silverman, 1986) to character-

ize multimodal distributions. The method is based

on estimating the density function of the sample by

building a continuous function, and then calculating

the number of peaks using its second derivative. It in-

volves building a continuous density function, which

allows us to approximate automatically the shape of

the distribution. The multimodal distributions are

those which have a number of peaks greater than 1.

Symmetric and Antisymmetric Distributions: To

characterize antisymmetric distributions in a next

step, we use the Skewness, noted γ

(cf. Equation (5)).

The distribution is symmetric if γ

= 0. Practically,

this rule is too exhaustive, so we relaxed it by impos-

ing limits around 0 to set a fairly tolerant rule which

allows us to decide whether a distribution is consid-

ered antisymmetric or not. The associated method is

based on a statistical test. The null hypothesis is that

the distribution is symmetric.

Consider the statistic: T

Skew

(γ

). Under the

null hypothesis, T

Skew

follows a law of χ

with one

degree of freedom. In this case, the distribution is

antisymmetric if α = 5% if T

Skew

> 3.8415.

Uniform Distributions: We use then the normal-

ized Kurtosis, noted γ

, to measure the peakedness

of the distribution or the grouping of probability den-

sities around the average, compared with the normal

distribution. When γ

(cf. Equation (6)) is close to

zero, the distribution has a normalized peakedness: A

statistical test is used again to automatically decide

whether the distribution has normalized peakedness

or not. The null hypothesis is that the distribution has

a normalized peakedness.

Consider the statistic: T

Kurto

(

). Under the

null hypothesis, T

Kurto

follows a law of χ

with one

degree of freedom. The null hypothesis is rejected at

level of signiﬁcance α = 0.05 if T

Kurto

> 6.6349.

Normal Distributions: We use the Jarque-Bera test

(cf. Equation (4)).

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

These four successive tests allow us to character-

ize the shape of the (density function of the) distribu-

tion of each column. Combined with the main charac-

teristics of the discretization methods presented in the

last section, we get Table 4: This summarizes which

discretization method(s) can be invoked depending on

speciﬁc column statistics.

Example 2. Continuing Example 1, the Kernel Den-

sity Estimation method (Zambom and Dias, 2012) is

used to build the density function of sample SX (cf.

Figure 2).

−30 −20 −10 0 10 20 30 40 50 60

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

Density function

Figure 2: Density function of sample SX using Kernel Den-

sity Estimation.

As we can see, the density function has two modes,

is almost symmetric and normal. Since the den-

sity function is multimodal, we should stop at this

step. But, as shown in Table 4, only EFD-Jenks and

KMEANS produce interesting results according to our

proposal. For the need of this example, let us perform

the other tests. Since γ

= −0.05, the distribution is

almost symmetric. As mentioned in Section 4.2, it de-

pends on the threshold ﬁxed if we consider that the

distribution is symmetric or not. The distribution is

not antisymmetric because T

Skew

= 0.005. The distri-

bution is not uniform since γ

= −1.9. As a conse-

quence, T

Kurto

= 1.805, and we have to reject the uni-

formity test. The Kolmogorov-Smirnov test results in-

dicate that the probability that the distribution follows

a normal law is 86.9% with α = 0.05. Here again,

accepting or rejecting the fact that we can consider if

the distribution is normal or not depends on the ﬁxed

threshold.

4.3 A Multi-criteria Approach for

Finding the Best Discretization

Method

Discretization must keep the initial statistical charac-

teristics so as the homogeneity of the intervals, and

reduce the size of the ﬁnal data produced. So the dis-

cretization objectives are many and contradictory. For

this reason, we chose a multi-criteria analysis to eval-

uate the available applicable methods of discretiza-

tion. We use three criteria:

1. Entropy (H): The entropy H measures the unifor-

mity of intervals. The higher the entropy, the more

the discretization is adequate from the viewpoint

of the number of elements in each interval:

H = −

NbBins

∑

i=1

log

) (7)

where p

is the number of points of interval i

divided by the total number of points (N), and

NbBins is the number of intervals. The maximum

of H is computed by discretizing the attribute into

NbBins intervals with the same number of ele-

ments. In this case, H reduces to log

(NbBins).

2. Index of Variance (J): Which was introduced

in (Lindman, 2012), measures the interclass vari-

ances proportionally to the total variance. The

closer the index is to 1, the more homogeneous

the discretization is:

J = 1 −

Intra-intervals variance

Total variance

(8)

3. Stability (S): Corresponds to the maximum dis-

tance between the distribution functions before

and after discretization. Let F

and F

be the at-

tribute distribution functions respectively before

and after discretization:

S = sup

(

(x) −F

(x)

) (9)

The goal is to ﬁnd solutions that present a compro-

mise between the various performance measures.

The evaluation of these methods should be done

automatically, so we are in the category of a priori

approaches where the decision-maker intervenes just

before the evaluation process step.

Aggregation methods are among the most widely

used methods in multi-criteria analysis. The princi-

ple is to reduce to a unique criterion problem. In this

category, the weighted sum method involves building

a unique criterion function by associating a weight

to each criterion (Roy and Vincke, 1981), (Cl

ımaco,

2012) and (Pardalos et al., 2013). This method is

limited by the choice of the weight, and requires

comparable criteria. The method of inequality con-

straints is to maximize a single criterion by adding

constraints to the values of the other criteria (Zo-

pounidis and Pardalos, 2010). Its disadvantage is

the choice of the thresholds of the added constraints.

POP: A Parallel Optimized Preparation of Data for Data Mining

Table 4: Applicability of discretization methods = f(distribution’s shape).

Normal Uniform Symmetric Antisymmetric Multimodal

EWD * * *

EFD * * * * *

AVST *

KMEANS * * * * *

For these reasons, we chose the method that min-

imizes the Euclidean distance from the target point

(H = log

(NbBins), J = 1, S = 0).

Deﬁnition 1. Let D be an arbitrary discretization

method, and V

a measure of segmentation quality us-

ing the speciﬁed multi-criteria analysis:

−log

(NbBins))

+ (J

−1)

+ S

(10)

The following proposition is the main result of this

article: It indicates how we chose the best method

among all the available and applicable ones.

Proposition 1. Let DM be a set of discretization

methods; the one, noted D, that minimizes V

(see

Equation (10)), ∀D ∈ {DM}, is the most appropriate

discretization method.

Example 3. Continuing Example 1, Table 5 shows

the evaluation results for all the discretization meth-

ods at disposal. Let us underline that for the need

of our example, all the values are computed for every

discretization method, and not only for the ones which

should have been selected after the step proposed in

Section 4.2 (cf. Table 4).

Table 5: Evaluation of discretization methods.

H J S V

EWD 1.5 0.972 0.25 0.313

EFD-Jenks 2 0.985 0.167 0.028

AVST 1.92 0.741 0.167 0.101

KMEANS 1.95 0.972 0.167 0.031

The results show that EFD-Jenks and KMEANS

are the two methods that obtain the lowest values for

. The values got by the EWD and AVST methods

are the worst: This is consistent with the optimization

proposed in Table 4, since the sample distribution is

multimodal.

As a result of Table 4 and of Proposition 1, we de-

ﬁne the POP (Parallel Optimized Preparation of data)

method, see Algorithm 1. For each attribute, after

constructing Table 4, each applicable discretization

method is invoked and evaluated in order to keep

ﬁnally the most appropriate. The content of these

two tasks (three when involving the statistics com-

putations) are executed in parallel using the paral-

lel invoke template (cf. Section 2).

Algorithm 1: POP: Parallel Optimized Preparation

of Data.

Input: X set of numeric values to discretize,

DM set of discretization methods

applicable

Output: Best set of bins for X

1 Parallel Invoke For each method D ∈ DM do

2 Compute γ

, γ

and perform Jarque-Bera

test;

3 end

4 Parallel Invoke For each method D ∈ DM do

5 Remove D from DM if it does not satisfy

the criteria given in Table 4;

6 end

7 Parallel Invoke For each method D ∈ DM do

8 Discretize X according to D;

9 V

−log

(NbBins))

+ (J

−1)

+ S

;

10 end

11 D = argmin({V

,∀D ∈DM});

12 return set of bins obtained line 8 according to

5 EXPERIMENTAL ANALYSIS

In this section, we present some experimental results

by evaluating three samples. We decided to imple-

ment POP using the MineCor KDD software when

mining Correlation Rules (Ernst and Casali, 2011),

and using R Project when searching for Association

Rules. Sample

is a randomly generated ﬁle that con-

tains heterogeneous values. Sample

and Sample

correspond to real data representing measurements

provided by a microelectronics manufacturer (STMi-

croelectronics) after completion of the front-end pro-

cess. This is because the applicative aspect of our

work is to determine in this domain what parame-

ters have the most impact on a speciﬁc parameter, the

yield (a posteriori process control). So if the pro-

posed datasets seem a bit small, they correspond to

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

Table 6: Characteristics of the databases used.

Sample Number of columns Number of rows Type

Sample

10 468 generated

Sample

1281 296 real

Sample

8 727 real

effective data on which we perform mining. Table 6

summarizes the characteristics of the samples.

Experiments were performed on a 4 core com-

puter (a DELL Workstation with a 2.8 GHz processor

and 12 Gb RAM working under the Windows 7 64

bits OS). First, let us underline that we shall not focus

in this section on performance issues. Of course,

we have chosen to parallelize the underdone tasks

in order to improve response times. As it is easy

to understand, each of the parallel invoke loops has

a computational time which is closed to the most

consuming calculus inside of each loop. Parallelism

allows us to compute and then to evaluate different

“possibilities” in order to chose the most efﬁcient

one for our purpose. This, without waste of time,

when comparing to a single “possibility” processing.

Moreover, we can easily add other tasks to each

loop (statistics computations, discretization methods,

evaluation criteria), the last assertion remains true.

Some physical limits exist: No more then seven

tasks can be launched simultaneously within the

2010 C++ Microsoft .NET / PPL environment. And

each individual described task does not require more

than a few seconds to execute, even on the Sample

database.

Concerning outlier management, we recall that in

the previous versions of our software, we used the

single standardization method with p set by the user

(Ernst and Casali, 2011). With the new approach

presented in Section 3.2, we notice an improvement

in the detection of true positive or false negative

outliers by a factor of 2%.

We focus hereafter on experiments performed in

order to compare the different available discretization

methods on the three samples. Figures 3(a), 4(a) and

5(a) reference various experiments when mining As-

sociation Rules. Figures 3(b), 4(b) and 5(b) corre-

spond to experiments when mining Correlation Rules.

When ﬁnding Association Rules, the minimum conﬁ-

dence (MinCon f ) threshold has been arbitrarily set to

0.5. The different ﬁgures provide the number of Asso-

ciation or of Correlation Rules respectively while the

minimum support (MinSup) threshold varies. Each

ﬁgure is composed of ﬁve curves: One for each of the

four discretization methods presented in Table 4, and

one for our global method (POP). Each method is in-

dividually applied on each column of the considered

database.

Analyzing the Association Rules detection pro-

cess, experiments show that POP gives the best results

(few number of rules), and EWD is the worst. Using

real data, the number of rules is reduced by a factor

comprised between 5% and 20%. This reduction fac-

tor is even better using synthetic (generated) data and

a low MinSup threshold. When mining Correlation

Rules on synthetic data, the method which gives the

best results with high thresholds is KMEANS while

it is POP when the support is low. This can be ex-

plained by the fact that the generated data are sparse

and multimodal. When examining the results on real

databases, POP gives good results. However, let us

underline that the EFD-Jenks method produces unex-

pected results: Either we have few rules (Figures 3(a)

and 3(b)), or we have a lot (Figures 4(a) and 4(b)) with

a low threshold. We suppose that the high number of

used bins is at the basis of this result.

6 CONCLUSION AND FUTURE

WORK

In this paper, we presented a new approach for auto-

matic data preparation: No parameter has to be pro-

vided by the end-user. This step is generally split into

two sub-steps: (i) detecting and eliminating outliers,

and (ii) applying a discretization method in order to

transform any column into a set of bins. We show that

the detection of outliers is depending on if data distri-

bution is normal or not. As a consequence, the same

pruning method is not applied (Box plot vs. Grubb’s

test). Moreover, when trying to ﬁnd the best dis-

cretization method, what is important is not the law

followed by the column, but the shape followed by

its distribution law. This is why we propose an auto-

matic choice to ﬁnd the most appropriate discretiza-

tion method based on a multi-criteria approach. Ex-

perimental evaluations performed using real and syn-

thetic data validate our approach showing that we can

reduce the number of Association and of Correlation

Rules using an adequate discretization method.

For future works, we aim (i) To add other

discretization methods (Khiops, Chimerge, Fayyad-

POP: A Parallel Optimized Preparation of Data for Data Mining

0.005 0.010 0.015 0.020 0.025 0.030 0.035

MinSup

5000

10000

15000

20000

25000

30000

35000

40000

Number of Rules

EWD

AVST

EFD-Jenks

KMEANS

POP

(a) Results for APriori

0.12 0.14 0.16 0.18 0.20 0.22

MinSup

−1

Number of Rules

EWD

AVST

EFD-Jenks

KMEANS

POP

(b) Results for MineCor

Figure 3: Execution on Sample

0.15 0.20 0.25 0.30

MinSup

−1

Number of Rules

EWD

AVST

EFD-Jenks

KMEANS

POP

(a) Results for APriori

0.19 0.20 0.21 0.22 0.23 0.24 0.25

MinSup

Number of Rules

EWD

AVST

EFD-Jenks

KMEANS

POP

(b) Results for MineCor

Figure 4: Execution on Sample

0.005 0.010 0.015 0.020 0.025 0.030 0.035

MinSup

500

1000

1500

2000

2500

3000

3500

Number of Rules

EWD

AVST

EFD-Jenks

KMEANS

POP

(a) Results for APriori

0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

MinSup

Number of Rules

EWD

AVST

EFD-Jenks

KMEANS

POP

(b) Results for MineCor

Figure 5: Execution on Sample

Irani, etc.) to our system, (ii) To measure the qual-

ity of the obtained rules using classiﬁcation methods

(based on association rules or decision trees), (iii) To

apply our methodology with other data mining tech-

niques (decision tree, SVM, neural network) and (iv)

To perform more experiments using other databases.

REFERENCES

Aggarwal, C. and Yu, P. (2001). Outlier detection for high

dimensional data. In Mehrotra, S. and Sellis, T. K.,

editors, SIGMOD Conference, pages 37–46. ACM.

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and

Verkamo, A. I. (1996). Fast discovery of association

rules. In Advances in Knowledge Discovery and Data

Mining, pages 307–328. AAAI/MIT Press.

Arthur, D., Manthey, B., and R

oglin, H. (2011). Smoothed

analysis of the k-means method. Journal of the ACM

(JACM), 58(5):19.

Casali, A. and Ernst, C. (2013). Extracting correlated pat-

terns on multicore architectures. In Availability, Reli-

ability, and Security in Information Systems and HCI

- IFIP WG 8.4, 8.9, TC 5 International Cross-Domain

Conference, CD-ARES 2013, Regensburg, Germany,

September 2-6, 2013. Proceedings, pages 118–133.

Cauvin, C., Escobar, F., and Serradj, A. (2008). Cartogra-

phie th

ematique. 3. M

ethodes quantitatives et trans-

formations attributaires. Lavoisier.

ımaco, J. (2012). Multicriteria Analysis: Proceedings

of the XIth International Conference on MCDM, 1–6

August 1994, Coimbra, Portugal. Springer Science &

Business Media.

Ernst, C. and Casali, A. (2011). Data preparation in the

minecor kdd framework. In IMMM 2011, The First

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

International Conference on Advances in Information

Mining and Management, pages 16–22.

Grubbs, F. E. (1969). Procedures for detecting outlying ob-

servations in samples. Technometrics, 11(1):1–21.

Grun-Rehomme, M., Vasechko, O., et al. (2010). M

ethodes

de d

etection des unit

es atypiques: Cas des enqu

etes

structurelles ukrainiennes. In 42

emes Journ

ees de

Statistique.

Jain, A. K. (2010). Data clustering: 50 years beyond k-

means. Pattern Recognition Letters, 31(8):651–666.

Jarque, C. M. and Bera, A. K. (1980). Efﬁcient tests for

normality, homoscedasticity and serial independence

of regression residuals. Economics Letters, 6(3):255–

259.

Jenks, G. (1967). The data model concept in statistical map-

ping. In International Yearbook of Cartography, vol-

ume 7, pages 186–190.

Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko,

C. D., Silverman, R., and Wu, A. Y. (2002). An efﬁ-

cient k-means clustering algorithm: Analysis and im-

plementation. Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on, 24(7):881–892.

Lilliefors, H. W. (1967). On the kolmogorov-smirnov

test for normality with mean and variance un-

known. Journal of the American Statistical Associ-

ation, 62(318):399–402.

Lindman, H. R. (2012). Analysis of variance in experimen-

tal design. Springer Science & Business Media.

Mitov, I., Ivanova, K., Markov, K., Velychko, V., Stanchev,

P., and Vanhoof, K. (2009). Comparison of discretiza-

tion methods for preprocessing data for pyramidal

growing network classiﬁcation method. New Trends

in Intelligent Technologies, Soﬁa, pages 31–39.

Pardalos, P. M., Siskos, Y., and Zopounidis, C. (2013). Ad-

vances in multicriteria analysis, volume 5. Springer

Science & Business Media.

Pyle, D. (1999). Data Preparation for Data Mining. Mor-

gan Kaufmann.

Roy, B. and Vincke, P. (1981). Multicriteria analysis: sur-

vey and new directions. European Journal of Opera-

tional Research, 8(3):207–218.

Silverman, B. W. (1986). Density estimation for statistics

and data analysis, volume 26. CRC press.

Stepankova, O., Aubrecht, P., Kouba, Z., and Miksovsky, P.

(2003). Preprocessing for data mining and decision

support. In Publishers, K. A., editor, Data Mining

and Decision Support: Integration and Collaboration,

pages 107–117.

Tukey, J. W. (1976). Exploratory data analysis. 1977. Mas-

sachusetts: Addison-Wesley.

Zambom, A. Z. and Dias, R. (2012). A review of kernel

density estimation with applications to econometrics.

arXiv preprint arXiv:1212.2812.

Zopounidis, C. and Pardalos, P. (2010). Handbook of mul-

ticriteria analysis, volume 103. Springer Science &

Business Media.

POP: A Parallel Optimized Preparation of Data for Data Mining