EVALUATING PREDICTION STRATEGIES IN AN ENHANCED

META-LEARNING FRAMEWORK

Silviu Cacoveanu, Camelia Vidrighin and Rodica Potolea

Technical University of Cluj-Napoca, Cluj-Napoca, Romania

Keywords: Meta-learning Framework, Data Set Features, Performance Metrics, Prediction Strategies.

Abstract: Finding the best learning strategy for a new domain/problem can prove to be an expensive and time-

consuming process even for the experienced analysts. This paper presents several enhancements to a meta-

learning framework we have previously designed and implemented. Its main goal is to automatically

identify the most reliable learning schemes for a particular problem, based on the knowledge acquired about

existing data sets, while minimizing the work done by the user but still offering flexibility. The main

enhancements proposed here refer to the addition of several classifier performance metrics, including two

original metrics, for widening the evaluation criteria, the addition of several new benchmark data sets for

improving the outcome of the neighbor estimation step, and the integration of complex prediction strategies.

Systematic evaluations have been performed to validate the new context of the framework. The analysis of

the results revealed new research perspectives in the meta-learning area.

1 INTRODUCTION

Ever since the beginning of the information age,

companies, organizations, universities all around the

world have been taking advantage of technological

advances to store large amounts of data. The

increase of virtual space made available by new and

cheaper storage devices has encouraged people to

keep records specific to all activities taking place in

their institutions. Advances in database systems have

made searching and organizing the stored data easier

for experts – but the technology for automatically

obtaining valuable information from this data has

only recently started to gain popularity. Data mining

approaches employ methods from different fields,

such as statistics, artificial intelligence, meta-

learning, to induce models from large amounts of

data. These models enable the characterization of

data by summarizing the class under study in general

terms, discovering frequently occurring

subsequences (in transactional datasets), classifying

new data, predicting numeric values, grouping

similar items in a database, analyzing outliers for

fraud detection and analyzing the trends of objects

whose behavior changes over time.

An important step in the data mining process is

selecting the right learning algorithm for the

analyzed data. This initial assessment is time

consuming since one has to decide which of the

learning strategies is most suited given the context.

No definite way of discovering the best learning

algorithm for a new problem has been devised yet,

but many proposals for selecting a good technique

exist in the literature. Selecting a suitable learning

algorithm for a new data set is a complex task even

for an experienced data analyst. Moreover, some

hidden knowledge could be present in data. Such

knowledge can sometimes be surmised by the

domain experts, yet not so often by the data analyst.

Therefore, an initial assessment should always be

performed, in order to identify the most promising

knowledge extraction methodology for the given

problem. The process usually involves training

several models with different learning algorithms

whose parameters settings vary and evaluating their

performance (perhaps under several metrics) with

respect to the requirements of the problem. The

analyst can then choose the learning algorithm and

the settings which register the best performance. The

time required to build a model increases with the

complexity of the model and with the size of the

input data. Running and evaluating all known

learning algorithms is therefore unfeasible.

A more suitable approach involves comparing

the new problem with a set of problems for which

the learning algorithm performance is already

148

Cacoveanu S., Lemnaru C. and Potolea R. (2010).

EVALUATING PREDICTION STRATEGIES IN AN ENHANCED META-LEARNING FRAMEWORK.

In Proceedings of the 12th International Conference on Enterprise Information Systems - Artiﬁcial Intelligence and Decision Support Systems, pages

148-156

DOI: 10.5220/0002975401480156

 SciTePress

known (Aha,1992) (Bensusan,2000) (Giraud-

Carrier,2000) (Vilalta,2004). The analyst must

identify the problem which resembles the analyzed

data the most. Consequently, the same learning

algorithm and settings that obtained the best results

on the former problem is expected to achieve similar

performance on the new problem. To make use of

this approach, the expert must have access to a large

set of problems that have already been evaluated

with various techniques. Also, the success of the

selected learning algorithm on the new problem

depends on the expert’s strategy for selecting similar

problems.

Creating a framework which brings together all

the tools necessary to analyze new problems and

make predictions related to the learning algorithms’

performance would automate the analyst’s work.

This will result in a significant speed-up and an

increased reliability of the learning algorithm

selection process. Such a framework would prove to

be valuable for an experienced data analyst and

could also help new users discover a good classifier,

regardless of their knowledge of the data mining

domain and the problem context (for users that do

not know if maximizing the accuracy or minimizing

the false-positive rate is preferable in the given

problem context). We have already proposed such a

tool in (Cacoveanu, 2009) and developed an

implementation based on classifiers provided by the

Weka framework (Witten,2005). Our initial focus

was on selecting a wide range of dataset features and

improving the classifier prediction time. We also

wanted to facilitate the addition of new datasets such

that our system continuously improves its

performance.

This paper presents several enhancements to a

meta-learning framework we have previously

designed and implemented. Its main goal is to

automatically identify the most reliable learning

schemes for a particular problem, based on the

knowledge acquired about existing data sets, while

minimizing the work done by the user but still

offering flexibility.

The rest of the paper is organized as follows: In

section 2 we present the meta-learning frameworks

presented in literature. Section 3 presents a formal

model of our tool, the characteristics we use to

describe datasets and the metrics implemented in the

system. Section 4 describes the prediction strategies

we tested with our tool and the results. We conclude

the paper by restating the improvements added to

our framework and proposals for future

development.

2 RELATED WORK

Aha (Aha,1992) proposes a system that constructs

rules which describe how the performance of

classification algorithms is determined by the

characteristics of the dataset.

Rendell et al. describe in (Rendel,1987) a system

called VBMS. Their system tries to predict which

algorithms will perform better for a given

classification problem using the problem

characteristics (number of examples and number of

attributes). The main disadvantage of VBMS is that

it is trained as new classification tasks are presented

to it, which makes it slow.

The approach applied in the Consultant expert

system (Sleemann et al (Sleeman,1995)) relies

heavily on a close interaction with the user.

Consultant poses questions to the user and tries to

determine the nature of their problem from the

answers. It does not examine the user’s data.

Schaffer (Schaffner,1993) proposes a brute force

method for selecting the appropriate learner: execute

all available learners for the problem at hand and

estimate their accuracy using cross validation. His

system selects the learner that achieves the highest

score. This method has a high demand of

computational resources.

STATLOG (Michie,1994) extracts several

characteristics from datasets and uses them together

with the performance of inducers (estimated as the

predictive accuracy) on the datasets to create a meta-

learning problem. It then employs machine learning

techniques to derive rules that map dataset

characteristics to inducer performance. The

limitations of the system include the fact that it

considers a limited number of datasets; it

incorporates a limited set of data characteristics and

uses accuracy as the sole performance measure.

P. B. Brazdil et al. propose in (Brazdil,2003) an

instance-based learner for obtaining an extensible

system which ranks learning algorithms based on a

combination of accuracy and time.

3 A FRAMEWORK FOR

SEARCHING FOR SUITABLE

LEARNERS

This section starts by presenting a formal model for

a classifier prediction framework. The framework is

built over a database containing stored problems

already analyzed and classifiers used in the analysis

EVALUATING PREDICTION STRATEGIES IN AN ENHANCED META-LEARNING FRAMEWORK

149

process. The IO part of the framework consists of

the problem and requirements provided by the user

and the result, a selection of suitable classifiers,

returned by the framework. The second part of the

section describes the features we used to describe

data mining problems. The last part of this section

presents the metrics we use to evaluate the

performance of a classification model.

Figure 1: System diagram.

3.1 A formal Model of our Tool

In this section we present a formal model for an

automated learner selection framework (Figure 1).

The basic functionality of a learner selection system

is to evaluate, rank and suggest an accurate learning

algorithm for a new problem submitted to the

system, together with the user’s requirements. The

suggested algorithm can then be used on the

problem to induce a model which achieves the best

performance while meeting the user’s requirements.

The user involvement is limited to providing the

input problem (i.e. a data set which the user needs to

classify) and specifying the requirements. The result

is presented to the user as an ordering of learning

algorithms, arranged in decreasing performance

order. Recommending more than a single learning

algorithm is important as it allows the user to decide

on the specific strategy they want to follow (they

might prefer a faster but least accurate algorithm to a

very accurate one, or an algorithm they are

accustomed to). The system should also minimize

the time it takes to provide the results.

The process of obtaining the predictions is

roughly divided into selecting the similar problems

and obtaining predictions from similar problems. In

order to be able to provide accurate predictions for

new datasets the system relies on a database

containing the problems and the solutions (classifier

+ performance) obtained on those problems. The

system must have the ability to increase its

knowledge by adding new problems and the

corresponding solutions. To do this, the first

important functionality of the system is the ability to

run learning algorithms and evaluate those

algorithms. This occurs mainly in the initialization

phase of the system.

After a significant collection of problems has

been stored in the database, the system is ready for

the prediction phase. In this phase, a new problem is

submitted to the system, along with its requirements.

The system must now find similar problems in its

database. This is done by computing the distance

between the analyzed problem and the stored

problems. Once the distances have been evaluated, a

subset of the nearest stored problems is selected as

neighbors of the analyzed problem.

The results obtained by the learning algorithms

for every neighbor problem are already present in

the database. The performance score of each

classifier is obtained by evaluating the results

obtained by that classifier from the perspective of

the user requirements. The framework then predicts

the performance score for the classifier on the

analyzed problem as a combination of the

performance scores obtained by that classifier on the

neighboring problems. The final list of

recommended learning algorithms is ordered by

their predicted performance scores.

After a new prediction is performed, during the

time the system is idle (it does not have to perform

another prediction), it continues with an extension of

the initialization phase. More specifically, it trains

and evaluates models on each new problem added to

the system and saves the new data to the database.

This way, the system’s knowledge increases and its

prediction capabilities improve.

3.2 Problem

Characterization – Meta-features

In order to estimate the similarity (i.e compute the

distance) between problems (i.e. data sets), a series

of meta-features are extracted from the data sets.

The meta-features we employ in this framework can

be divided into four categories. One of these

categories is focused on the type of the attributes in

the data sets. It contains the total number of

attributes of a data set (Michie,1994), the number of

nominal attributes (Kalousis,2002), the number of

boolean attributes (Kalousis,2002) (Michie,1994)

and the number of continuous (numeric) attributes

(Kalousis,2002). Another category is focused on

analyzing the properties of the nominal and binary

attributes of the data sets. This category contains the

maximum number of distinct values for nominal

attributes (Kalousis,2002) (Linder,1999), the

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

150

minimum number of distinct values for nominal

attributes (Kalousis,2002) (Linder,1999), the mean

of distinct values for nominal attributes

(Kalousis,2002) (Linder,1999), the standard

deviation of distinct values for nominal attributes

(Kalousis,2002) (Linder,1999) and the mean entropy

of discrete variables (1) (Kalousis,2002).































(1)

Similar to the previous category, the next category

focuses on the properties of the continuous attributes

the data sets have. It includes the mean skewness of

continuous variables (2) (Kalousis,2002)

(Michie,1994), which measures the asymmetry of

the probability distribution, and the mean kurtosis of

continuous variables (3) (Kalousis,2002)

(Michie,1994) representing the peak of the

probability distribution.



















∑



























∑























(2)



















∑

























∑























(3)

A final category gives a characterization of the

dimensionality of the data set. It contains the overall

size, represented by the number of instances, and

imbalance rate information (relative size)

(Japkowicz,2002). The mean (5) and maximum (4)

imbalance rates of the classes in the dataset are

computed (in case there are only 2 classes, the mean

and maximum imbalance rates are equal).





max









|1,







, 













,















,







I = number of instances, 



=number of instances

belonging to class i, c=number of classes

(4)



















(5)

3.3 Classifier Performance Metrics

The performance score of a classifier depends on the

problem requirements provided by the user. When

designing the system, we have focused on

minimizing its dependence of user input. We still

need to provide the user with a method of guiding

the search for a suitable learning algorithm. For this

purpose, we employ nine metrics, divided into three

categories, as proposed in (Caruana,2004). The

metrics in Table 1, along with a general purpose

metric are described in this section.

able 1: Performance metrics.

Threshold Rank Probability

Accuracy,

Recall, False

Positive Rate,

True Negative

Rate, False

Negative Rate

Area Under

ROC, Precision

Geometric Mean,

Generalized

Geometric Mean

Most classifier performance metrics are generated

from the confusion matrix produced by the induced

model on a test sample. The confusion matrix is the

most general indicator of the way a model identifies

the right label of instances. An example of a

confusion matrix for c classes is provided in Figure

2. The entry from the i

row j

column represents

the number of instances from class i that were

labeled by the model as belonging to class j.





,



,





,



,



Figure 2: Confusion matrix for dataset with c classes.

While most users will be concerned with the

accuracy of the generated models, in some cases

they might prefer to focus on improving a different

performance criterion (for instance to maximize the

sensitivity or specificity). This difference usually

comes from the different costs associated to specific

errors. As an example, the cost of an incorrect

medical diagnosis is different for false positive and

false negative errors.

Problem requirements are the way users set the

focus of the search for a learning algorithm. Users

provide problem requirements in terms of

performance metrics. When a classifier’s

performance is measured, the resulting score is

computed from the model’s confusion matrix or area

under the ROC curve, based on the user

requirements. Both the confusion matrix of a

classifier and the area under ROC are computed in

the initialization phase, when evaluating the

performance of a trained model (using 10-fold cross

validation).

The Accuracy (6) of a classifier is the percentage

of test instances that are correctly classified by the

model (also referred to as the recognition rate).

Accuracy selects learning algorithms with the

highest rate of correct classification but it is a weak

metric for imbalanced problems. For example, if the

number of negative cases in the test sample is much

larger than the positive cases, even if all positive

cases are misclassified, the accuracy will still be

very high.

EVALUATING PREDICTION STRATEGIES IN AN ENHANCED META-LEARNING FRAMEWORK

151

The Recall (true positive rate) (7) is the

proportion of positive cases that were correctly

identified. This metric favors models which focus on

identifying the positive cases, even if this leads them

to misclassifying a number of negative cases as

positive. The system also provides metrics that

maximize the false positive rate (8), the true

negative rate (9) and the false negative rate (10).



∑



,





∑∑



,









(6)





,

∑



,



















(7)







∑



,





…



\

∑∑



,





…







…



\

, 

∑

















(8)







∑∑



,





…



\



…



\

∑∑



,





…







…



\

, 

∑

















(9)







∑



,





…



\

∑



,





…



, 

∑

















(10)

The precision is the proportion of the predicted

positive cases that are correctly classified. By

maximizing precision, a model rarely classifies

negative cases as being positive, but may still

misclassify positive cases as negative.

The ROC graph is a representation of the true

positive rate (sensitivity) as a function of the false

positive rate (1-specificity). In case of a perfect

classifier, the area under the ROC curve is 1,

because the false positive rate is 0 while the true

positive rate is 1. This metric is usually employed

when searching for a classifier that maximizes the

number of correctly classified instances.

Besides the fundamental metrics the system

evaluates for a data set, some combined metrics are

available as well. The geometric mean metric (11) is

used to maximize the true positive and the false

positive rate at the same time.













(11)

We also propose the generalized geometric mean – a

generalization of the geometric mean metric for

datasets with more than two classes (12).









,

∑

,















(12)

For allowing users from different areas of expertise

which are interested in discovering a generally good

classifier, we propose a metric which combines the

accuracy, the geometric mean and area under the

ROC curve (13). This metric is obtained by

computing the average of three other metrics, each

being the best in its category, as observed in

(Caruana,2004).



(13)

4 EXPERIMENTS ON AN

ENHANCED VERSION OF THE

FRAMEWORK

When first proposing a framework for classifier

selection in (Cacoveanu, 2009) we focused on

selecting a wide range of dataset features and

improving classifier prediction time. In the attempt

to improve the prediction capabilities of our

framework, we automated the voluntary addition of

new data sets. The prediction was performed by

using a KNN classifier which computed the

distances between the analyzed data set and all data

sets stored in the system. It then selected the three

closest data sets as neighbors and estimated the

predicted performance as the mean between the

performances obtained by a classifier on the

neighboring data sets.

As an improvement to our data set classification

system, we considered the possibility of using

different strategies when selecting the similar

problems to the one we are analyzing and the way

neighboring problems affect the final performance

predictions. We are also trying to answer the

question of how different strategies behave when

making performance predictions for different

metrics. An ideal solution would be to find a set of

approaches that gives the best results for all metrics,

and always use them. However, if the selection of a

given tactic influences the predictions for different

metrics generated by the framework, we may

consider always using the best strategies for the

metric selected by the user.

We have divided the classifier accuracy

prediction process into three phases: distance

computation, neighbor selection and prediction

computation (or voting). For each of these phases we

propose several strategies (detailed in the next sub-

sections).

4.1 Distance Computation

The distance computation phase consists in

computing the distance between the currently

analyzed dataset and all the datasets stored in the

system database. The distance is computed by using

the data set meta-features (all having numeric

values) as coordinates of the data set. By

representing a data set as a point in a vector space,

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

152

the distance can be evaluated using any metric

defined on a vector space (14).





(





,





,…,





), m=number of features

(14)

The first distance computation strategy implemented

is the normalized Euclidean distance (E). The

Euclidean distance is the ordinary distance between

two points in space, as given by the Pythagorean

formula (15). The obtained distances are normalized

in the [0,1] set.









,



























(15)

Another distance evaluation available in our

framework is the Chebyshev distance (C). The

Chebyshev distance is a metric defined on a vector

space where the distance between two vectors is the

greatest of their differences along any coordinate

dimension. In our system, the largest difference

between dataset features is the distance between two

datasets (16). These distances are also normalized.









,



max











1,





(16)

The advantage of the Chebyshev distance

computation strategy is that it takes less time to

decide the distances between datasets. A possible

problem with the Chebyshev distance is allowing

one single feature to represent a dataset. One single

largest feature might not offer enough description of

the dataset to lead to accurate neighbor selection and

final predictions. As a solution we propose a

variation of the Chebyshev distance (C3), where the

largest three differences between features are

selected and their mean is computed (17). By using

this strategy, the dataset information that is used in

the distance computation process increases but the

computation remains more efficient than the

Euclidean distance.













1,



,





max







1,1













,





∑









(17)

4.2 Neighbor Selection

Neighbor selection decides which datasets will

influence the performance predictions for the

analyzed dataset. While considering as neighbors the

datasets closer to the analyzed dataset is justified,

how many datasets should be considered is an open

problem. If we choose a fixed number of neighbors

we make sure the prediction computation will

always take the same amount of time. We implement

the Top 3 neighbor selection strategy (T3) which

selects the 3 closest datasets as neighbors. Selecting

the closest n neighbors is a sensible decision for a

strategy, but there could be cases in which the

closest n datasets are still quite far. Another

approach would be setting a threshold distance after

which a dataset is not considered a neighbor

anymore. By using a fixed threshold value we risk

getting into situations in which not one dataset will

be considered a neighbor and we will not be able to

compute the performance predictions. A solution

would be to have a variable threshold depending on

the average distance between every two datasets in

the system at the moment a new dataset arrives, but

this solution means updating the average distance

every time a new dataset arrives in the system and

holding a different threshold value for all the

distance computation strategies in the system. We

chose to implement a strategy similar to the variable

threshold considered above, only this time we

compute the average distance between the analyzed

dataset and all the datasets in the system and select

as neighbors only datasets closer than this average

distance. We call this strategy Above-Mean

neighbor selection (AM). This strategy selects a

variable number of neighbors every time and the

performance predictions computation time increases

considerably.

4.3 Prediction Computation (Voting)

In this last phase of the prediction process the

classifier performance predictions are generated.

Voting strategies define the way neighboring

datasets influence the final predictions. Each

neighbor casts a performance score to each classifier

in the system. The performance score for a classifier

depends on the performance of its model on the

dataset (the confusion matrix, the area under ROC)

evaluated from the point of view of the metric

selected by the user. These scores are combined to

obtain the final score for each classifier. This final

score is the actual predicted performance for that

classifier. Classifiers are then ordered based on the

predicted performances.

We have implemented two voting strategies in

the system, the first one of them being equal, or

democratic, voting (D). Each neighbor selected in

the previous phase predicts a performance for the

current classifier. We sum all the performances and

divide them by the number of neighbors (18). The

result is the predicted performance of the current

classifier. Each neighbor has the same influence in

deciding the final performance of a classifier.

EVALUATING PREDICTION STRATEGIES IN AN ENHANCED META-LEARNING FRAMEWORK

153







∑













,

n = number of neighbors







= performance obtained on dataset i,

with classifier c,





= predicted performance for classifier c

(18)

The second voting strategy is weighted voting (W).

For this, the distances between the analyzed dataset

and its neighbors act as the weight of the neighbor

vote – a closer dataset will have a higher weight and

more influence on the final prediction (19).







1



,





∑

1



,

















∑















, = analyzed dataset, 



= neighbor i

(19)

4.4 Experimental Results

This section presents the results of the evaluations

performed with our framework to find the

combination of strategies that works best in

predicting performance scores for the learning

algorithms in the system.

We initialized our system with 26 benchmark

datasets that range from very small to medium sizes

(up to 4000 instances) (UCI, 2010). Also, the

following classifiers are available: BayesNet, J48,

MultilayerPerceptron, AdaBoost, NaiveBayes,

SMO, PART, libSVM.

We have performed evaluations with all the

possible combinations of the implemented strategies

for distance computation, neighbor selection and

performance score prediction (Table 2).

Table 2: Strategy combinations.

Distance

computation

strategy

Neighbor

selection

strategy

Voting

strategy

Notation

E T3 D E-T3-D

E T3 W E-T3-W

E AM D E-AM-D

E AM W E-AM-W

C T3 D C-T3-D

C T3 W C-T3-W

C AM D C-AM-D

C AM W C-AM-W

C3 T3 D C3-T3-D

C3 T3 W C3-T3-W

C3 AM D C3-AM-D

C3 AM W C3-AM-W

E = Normalized Euclidean distance

C = Normalized Chebyshev distance

C3 = Normalized Top 3 Chebyshev distance

T3 = Top 3 neighbor selection

AM = Above Mean neighbor selection

D = Equal (Democratic) voting

W = Weighted voting

For a test, we selected a performance metric and

executed the following steps:

1. Selected a strategy combination

a. Selected a dataset and used it as the analyzed

dataset

i. Used the remaining 25 datasets as datasets

stored in the system

ii. Used the selected strategy combination to

predict performances

iii. Compare the predicted performances with

the actual performances obtained in the

initialization stage on the selected dataset

b. Select next dataset

2. compute the deviation mean and the absolute

deviation mean on all datasets and classifiers for

this strategy

3. select next strategy combination

We have applied the above strategy for the

following metrics: accuracy, geometric mean,

generalized geometric mean, area under ROC,

general purpose metric.

In total, we ran 312 prediction tests for each

selected metric.

We have computed the deviation between the

predicted and true performance as the difference

between the performance prediction and the actual

performance (20). If the system predicted a classifier

will obtain a higher performance than it actually

obtains, this value will be negative.













|











=performance prediction, 



=actual performance

(20)

The absolute deviation between a performance

prediction and the actual performance is the absolute

value of the difference between the two (20).

Figure 3: Absolute deviation mean.

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

154

Our interest is to select the strategies that minimize

the deviation means for all the metrics. By studying

the absolute deviation mean results (Figure 3), we

observe that voting strategies do not influence the

final predictions very much. Weighted voting (W)

obtains better results than democratic voting (D), but

the difference is so small that it does not justify the

additional resources needed to compute the weight

of each neighbor. Moreover, we observe that the

distance computation and neighbor selection

strategies that obtain the smallest absolute deviations

are, in order: (1) Top 3 Chebyshev distance with

Top 3 neighbor selection, (2) Chebyshev distance

with Top 3 neighbor selection and (3) Chebyshev

distance with Above Mean neighbor selection.

By analyzing the deviation mean results table

(Figure 4) we can deduce more details on how each

of the three selected strategy combinations work.

Our first choice, Top 3 Chebyshev distance with Top

3 neighbor selection, has negative deviation means

for all the metrics. From this we deduce that the

strategy combination is overly optimistic – most of

the time it will predict performances that are not met

when we derive the model and evaluate it. We

would prefer that our system slightly underestimates

the performance of a model on a new dataset. We

can observe that the second strategy combination,

Chebyshev distance with Top 3 neighbor selection,

makes optimistic

Figure 4: Deviation mean.

predictions for all metrics except accuracy. This

strategy combination is the best choice when

predicting classifier accuracy, but is not appropriate

for the other metrics in the system. The last strategy

combination, Chebyshev distance with Above Mean

neighbor selection, obtains positive deviation means

for all metrics. This is the preferred behavior for our

system and we can conclude that this is the best

combination of strategies.

We can observe from both Figure 4 and Figure

3 that the deviation mean of the general purpose

metric is close to the average deviation means of the

other metrics. Therefore, we can confirm the

conclusion in (Caruana,2004) that a general-purpose

metric has the best correlation with the other

metrics.

5 CONCLUSIONS AND FUTURE

WORK

This paper describes the architecture of an

automated learner selection framework. We focus on

the enhancements considered and tested for the

system described in (Cacoveanu, 2009). These

enhancements consist in increasing the dataset pool,

adding new performance metrics and meta-features

and improving the prediction accuracy of the

system. The increase in metrics widens the

evaluation criteria and allows a more problem-

specific assessment of classifiers. Two of the newly

added metrics, the generalised geometric mean and

the general purpose metric, both of them

representing original proposals. Moreover, the

general-purpose metric proposed has suggested a

new approach in dealing with data sets inputs with

no associated metrics. Another enhancement was the

addition of new benchmark data sets. The increase in

the data available to the system improves the

outcome of the neighbor estimation step. We also

implemented the context for adding complex

prediction strategies. We implemented and evaluated

12 strategy combinations for computing the final

performance predictions for classifiers. The analysis

of the results suggest as a best strategy the

Chebyshev distance with Above Mean neighbor

selection and Democratic voting. This strategy will

predict performances close to the actual

performances, without surpassing them. The tests

also reveal that the voting strategies do not

significantly influence the final results. Moreover, in

the case of the accuracy metric we can improve the

performance of our system by using Chebyshev

distance with Top 3 neighbor selection and

Democratic voting.

Since the Chebyshev distance computation

strategy obtains the best results in our system, our

present focus is on discovering the most relevant

data set features, by performing feature selection on

EVALUATING PREDICTION STRATEGIES IN AN ENHANCED META-LEARNING FRAMEWORK

155

the meta-data set or deriving new and more relevant

features (Niculescu-Mizil, 2009). We attempt to

improve the Above Mean neighbor selection

strategy, by computing and constantly updating a

mean distance between every two datasets in our

database. Limiting the neighbor selection strategy as

the number of problems in the system increases is

another present concern. We also want to improve

the system by generating a “best-possible” model.

For this we intend to use different classifiers, each

classifier optimized to increase the true positive rate

on its class, thus maximizing the prediction power of

the model.

ACKNOWLEDGEMENTS

Research described in this paper was supported by

the IBM Faculty Award received in 2009 by Rodica

Potolea from the Computer Science Department of

the Faculty of Computer Science, Technical

University of Cluj-Napoca, Romania.

REFERENCES

Aha D. W., 1992, Generalizing from Case Studies: A Case

Study, Proceedings of the Ninth International

Conference on Machine Learning, pp. 1 -10

Bensusan H., Giraud-Carrier C., Kennedy C. J., 2000, A

Higher-Order Approach to Meta-Learning,

proceedings of the ECML-2000 Workshop on Meta-

Learning: Building Automatic Advice Strategies for

Model Selection and Method Combination, pp. 33-42

Brazdil P. B., Soares C., Da Costa J. P., 2003, Ranking

Learning Algorithms: Using IBL and Meta-Learning

on Accuracy and Time Results, Machine Learning 50,

pp. 251-277

Cacoveanu S., Vidrighin C., Potolea R., 2009, Evolutional

Meta-Learning Framework for Automatic Classifier

Selection, proceedings of the 5th International

Conference on Intelligent Computer Communication

and Processing, Cluj-Napoca, pp. 27-30

Caruana R., Niculescu-Mizil A., 2004, Data Mining in

Metric Space: An Empirical Analysis of Supervised

Learning Performance Criteria, Proceedings of the

Tenth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pp. 69-78

Giraud-Carrier C., Bensusan H., 2000, Discovering Task

Neighbourhoods Through Landmark Learning

Performances, Proceedings of the Fourth European

Conference of Principles and Practice of Knowledge

Discovery in Databases, pp. 325-330

Japkowicz N., Stephen S., 2002, The Class Imbalance

Problem: A Systematic Study, Intelligent Data

Analysis, Volume 6, Number 5, pp. 429-450

Kalousis A., 2002, Algorithm Selection via

Meta_Learning, PhD Thesis, Faculte des sciences de

l’Universite de Geneve

Linder C., Studer R., 1999, Support for Algorithm

Selection with a CBR Approach, Proceedings of the

16th International Conference on Machine Learning,

pp. 418-423

Michie D., Spiegelhalter D. J., Taylor C. C., 1994,

Machine Learning. Neural and Statistical

Classification, Ellis Horwood Series in Artificial

Intelligence

Niculescu-Mizil A., et al, 2009, Winning the KDD Cup

Orange Challenge with Ensemble Selection, JMLR

Workshop and Conference Proceedings 7

Rendel L., Seshu R., Tcheng D., 1987, Layered concept

learning and dynamically variable bias management,

10th Internatinal Joint Conf. on AI, pp. 308-314

Schaffner C., 1993, Selecting a classification method by

cross validation, Machine Learning 13, pp. 135-143

Sleeman D., Rissakis M., Craw S., Graner N., Sharma S.,

1995, Consultant-2: Pre and post-processing of

machine learning applications, International Journal

of Human Computer Studies, pp. 43-63

UCI Machine Learning Data Repository,

http://archive.ics.uci.edu/ml/, last accessed Jan. 2010

Vilalta R., Giraud-Carrier C., Brazdil P., Soares C., 2004,

Using Meta-Learning to Support Data Mining,

Iternational Journal of Computer Science &

Applications, pp. 31-45

Witten I. H., Frank E., 2005, Data Mining: Practical

Machine Learning Tools and Techniques, 2nd edition,

Morgan Kaufmann Publishers, Elsevier Inc.

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

156