Experiments with Lazy Evaluation of Classiﬁcation Decision Trees Made

with Genetic Programming

so Karakati

c, Marjan Heri

cko and Vili Podgorelec

Institute of Informatics FERI, University of Maribor, Smetanova 17, 2000, Maribor, Slovenia

Keywords:

Classiﬁcation, Machine Learning, Genetic Programming, Lazy Evaluation, Dynamic Weighting.

Abstract:

In this paper, we present a lazy evaluation approach of classiﬁcation decision trees with genetic programming.

We describe and experiment with the lazy evaluation that does not evaluate the whole population but evaluates

only the individuals that are chosen to participate in the tournament selection method. Further on, we used

dynamic weights for the classiﬁcation instances, that are linked to the chance of that instance getting picked

for the evaluation process. These weights change based on the misclassiﬁcation rate of the instance. We test

our lazy evaluation approach on 10 standard classiﬁcation benchmark datasets and show that not only lazy

evaluation approach uses less time to evolve the good solution, but can even produce better solution due to

changing instance weights and thus preventing the overﬁtting of the solutions.

1 INTRODUCTION

Genetic algorithm (GA) is a metaheuristic process

which uses the theory of evolution to generate the

ﬁnal solution to a problem with the gradual evolu-

tion of solutions inspired by natural selection process

observed in nature. The basic evolution loop of all

variations of GA is the following. The loop starts

with the evaluation process of the solutions in the

whole population, then continues with the selection

process of evaluated solutions. Next, the mating pro-

cess (crossover) between various solutions generates

new offspring solutions, and then other evolutionary

operators can be applied (such as mutation and elite

selection) (Espejo et al., 2010). The whole process

of the genetic algorithm is computationally intensive,

and every speed up of the evolution without sacriﬁc-

ing the quality of the ﬁnal solution is welcomed.

The processing time of every genetic operator dur-

ing the evolution process is dependent on the type of

problem we are solving. In this paper, we focus on

classiﬁcation problem, which is one of the supervised

methods of machine learning ﬁeld.

One way to solve classiﬁcation problem with GA

is by using genetic programming (GP), which is a

variation of GA where solutions are programs rep-

resented in a hierarchical decision tree structure (Es-

pejo et al., 2010). The classical techniques of building

these decision trees used in the industry and academia

are CART (Breiman et al., 1984), C4.5 (Quinlan,

2014), ID3 (Cheng et al., 1988) and ensembles of

these methods (Liaw et al., 2002) (Ganjisaffar et al.,

2011). With GP we can utilize the power of evolu-

tion to build these decision trees as has been used nu-

merous times before. In the evolutionary process of

construction classiﬁcation decision trees, the compu-

tationally most intensive part is the evaluation of the

decision trees (Zhang and Cho, 1998).

In this paper, we propose a lazy evaluation of the

classiﬁcation decision trees made with GP, where so-

lutions from the population are evaluated only when

needed and on the limited amount of classiﬁcation

instances. We extend our previous work on this

ﬁeld (Podgorelec and Zorman, 2015), (Podgorelec

et al., 2013), and add the dynamic evaluation pro-

cess – it can be expanded if the evaluation process

does not differentiate between the quality of differ-

ent solutions. The process of lazy dynamic evalua-

tion also changes the importance of classiﬁcation in-

stances through the evolution, giving the more impor-

tance to the instances, that are more often misclassi-

ﬁed and less importance to those that are more often

correctly classiﬁed.

Not much similar work has been done before, but

we deﬁnitely build on the ideas from the existing lit-

erature. The most notable impact to our work was

introduced by Gathercole and Ross (Gathercole and

Ross, 1994). In their paper, they proposed a dynamic

training subset selection for the supervised problem

(such as classiﬁcation) where they proposed three dif-

KarakatiÄ S., HeriÄ ko M. and Podgorelec V.

Experiments with Lazy Evaluation of Classiﬁcation Decision Trees Made with Genetic Programming.

DOI: 10.5220/0006515203480353

In Proceedings of the 9th International Joint Conference on Computational Intelligence (IJCCI 2017), pages 348-353

ISBN: 978-989-758-274-5

ferent subset selection processes where they heuristi-

cally change the testing classiﬁcation set in each gen-

eration. Our proposed method of lazy and dynamic

evaluation builds on their idea, where we weight in-

stances through the evolution, but we expand this with

the lazy evaluation, where an individual is tested only

when it is needed and only on it’s testing set. Zhang

and Cho introduced the idea that incrementally se-

lected testing subsets can reduce evaluation time with-

out sacriﬁcing generalization accuracy of evolved so-

lutions (Zhang and Cho, 1998).

Sprogar introduced

the idea, that even excluding the ﬁtness of the genetic

solutions can improve the robustness of evolution pro-

cess (

Sprogar, 2005). His proposal eliminates the op-

erator for evaluation and can also speed up the process

of evolution.

The rest of the paper is organized as follows. We

start with the section where we analyze the process-

ing time of genetic programming method for classiﬁ-

cation purposes. Next is the section where we present

the idea of the lazy evaluation method and describe

it in detail. In the following section, we describe the

layout of the experiment and present the results of the

implemented method of lazy evaluation. We conclude

with the ﬁnal remarks, the interpretation of the results

and present our plans for the future research.

2 PROCESSING TIME ANALYSIS

OF GP FOR CLASSIFICATION

Let us analyze the individual processing times of each

genetic operator in the GP process. Figure 1 shows

the standard evolutionary process in the GP, where the

evolution loop is highlighted with the gray rectangle.

As it is evident from the ﬁgure, most of the pro-

cessing time is spent in the evolution loop – the

amount of this time is mostly dependent on the stop-

ping criteria. If there is a dynamic stopping criterion,

such as an amount of stagnating generations (genera-

tions without improvement), the total processing time

varies from run to run and is mostly due to chance.

If we have ﬁxed number of generations, then we can

approximate the processing time more precisely.

If we use the big Omicron (big O) notation, the ap-

proximation of processing time goes as it is presented

in the following. The big O notation is used to express

the upper bound of the processing time growth rate of

a process, or in other words, the time complexity of

an algorithm.

The evolution loop starts with the selection of the

individuals, that are chosen to go through the mating

process. There are many different selection methods,

but here we are only exploring the tournament selec-

Figure 1: Flowchart of genetic programming with the evo-

lution loop highlighted in the light gray background.

tion method. Tournament selection chooses k random

individuals from the populations and the best indi-

vidual (ﬁtness wise) wins the tournament and is se-

lected to participate in the mating process. The order

of growth for one generation is O(mk), where m is

the number of parents chosen (usually the same as the

population size, when parents produce two offspring),

and k is the tournament size (usually between 2 to 10).

The mating process is heavily dependent on the

genotype representation. In GP we present one indi-

vidual in a form of a tree, or more speciﬁcally here

one individual is a classiﬁcation decision tree. So nat-

urally, the processing time of the crossover process

is also dependent on the representation. In our im-

plementation, we set the chance of crossover happen-

ing to 100%. In the regular form GP (without heuris-

tic crossover) the crossover process chooses a random

node in the tree from the ﬁrst parent, and the random

node from the second parent and exchanges the sub-

trees, creating one or two offspring. The choosing of

the nodes happens two times (two parent trees) and

the loop of node choosing runs from minimal of 1 to

maximum depth of the tree. The maximum depth of

the tree is again a heavily dependent, this time on the

classiﬁcation problem – simpler classiﬁcation prob-

lems permit shallower trees and more complex ones

demand bigger decision trees. So the order of growth

for the crossover process in one generation is O(2d)

(2 because of two parent trees), or simpler just O(d),

where d is the maximum depth of the tree.

After the crossover, the mutation process is next.

Here we must consider the chance of the mutation

happening. Standard chances vary from 1% and all

the way to 50%. The time complexity of mutation

operator is similar than in crossover. When the new

child is created and it is determined that it was cho-

sen by chance to go through the mutation, the node is

chosen in a random fashion and is replaced by a ran-

dom subtree or the node content (the decision rule)

is changed. The node picking itself has the order of

growth O(d) as does the creation of the random sub-

tree O(d) (if the maximum depth of the tree is set to

d). So the time complexity is O(2d), or simpler just

O(d), but keep in mind, that this doesn’t happen to

every new individual.

Now for the most time-consuming process in GP

loop – evaluation process. Here we take every in-

dividual and use every classiﬁcation instance in the

training set to classify that instance. From these clas-

siﬁcation results, the classiﬁcation metrics can be cal-

culated (accuracy, F-score, recall, precision, AUC,

and others), which are then part of the ﬁtness of

that individual. The process of calculating the clas-

siﬁcation metrics is another time-consuming process.

The calculation of accuracy is straightforward by just

counting the correctly classiﬁed instances, but calcu-

lating the F-score is a more time-consuming process

(as we have to calculate recall and precision for every

class, and then calculate individual F-score for each

class and then aggregate it to get the ﬁnal F-score).

Let’s assume we have n number of new offspring in-

dividuals, the maximum depth of each individual tree

is d, we have t number of training classiﬁcation in-

stances that are to be classiﬁed by each tree, and that

we calculate only the accuracy for the ﬁtness. The

time complexity of evaluation for every individual in

one generation is O(ndt).

If we combine the time complexities of individual

operators we get the following time complexity of an

evolution loop for one generation:

O(mk) + O(d) + O(d) + O(ndt)

m =number of parents from selection

k =tournament size

d =maximum depth of the tree

n =number of offspring individuals

t =number of classiﬁcation instances

in the training set

It is evident, that all of the times are linear and

thus the total time should be mostly dependent on the

highest factor in the equation. We also ran the GP

and timed each genetic operator in the evolution loop

multiple times. In Figure 2 is the pie chart that shows

the proportions of each processing times in one gen-

eration on average of 100 independent runs.

Figure 2: Pie chart showing the average processing times

of genetic operators in one generation of GP for classiﬁca-

tion decision tree construction. The experiment was made

on car dataset with 1382 instances in the training set, 150

solutions in the population, 2000 generations and 100 inde-

pendent runs of GP.

As is evident from the pie chart in Figure 2, 94%

of the total processing time in the evolution loop is

spent the evaluation process. Note that this cannot

be generalized to all GAs or even all GPs and is spe-

ciﬁc for GP construction classiﬁcation decision trees.

Even using different data set (we used the car data set)

could produce slightly different results. Despite this,

we see that shortening evaluation time should signiﬁ-

cantly impact the running time of the GP in general.

3 LAZY EVALUATION METHOD

As we found out in the previous section, the most

time-consuming process in evolution loop is the eval-

uation process, thus we propose an approach which

shortens this time.

In our proposed approach, which we named lazy

evaluation, we do not evaluate the whole population

on all of the classiﬁcation instances. Instead, we eval-

uate only decision trees chosen to participate in the

tournament of the selection operator and only some

classiﬁcation instances.

The Figure 3 shows two separated plots. On the

top is the probability of one decision tree from the

population to be chosen to be evaluated in one tour-

nament, dependent on the size of the tournament and

the population size. On the bottom, there is the num-

ber of evaluations (number of classiﬁcation instances

X population size) in one generation dependent on

lazy evaluation tournament size (and standard GP)

and population size.

As it is evident from the Figures 3, the number

of evaluations per generation is drastically lower as

Figure 3: Exploring the number of evaluations.

Top: Line chart showing the probability of one decision tree

being chosen for the evaluation, dependent on the tourna-

ment size and the population size.

Bottom: Number of evaluations in generation dependent on

a number of classiﬁcation instances and population size.

Standard GP denotes the standard GP without extensions.

in traditional evaluation (all decision trees on all clas-

siﬁcation instances). Let us take the example where

we have population size 100 and number of classi-

ﬁcation instances is 1000. If we use lazy evalua-

tion with tournament size of 2, we have 400 eval-

uations (2 ∗ 100 ∗ 2), for evaluation with tournament

size of 5 we get 1000 evaluations (5 ∗ 100 ∗ 2) and if

we have the lazy evaluation with tournament size of

10, we have 2000 (10 ∗ 100 ∗ 2). But the standard GP

always has 100,000 evaluations (100 ∗ 1000), much

more than any of the lazy evaluations.

Of course, this is only in theory, so time saved

with this approach would not be directly in the same

proportion in practice. We will test the real time saved

in the next section with the experiment.

3.1 Weighting the Classiﬁcation

Instances

Based on the proposition and results from the paper

(Gathercole and Ross, 1994), we decided to include

the weighting of classiﬁcation instances through the

evolution process. The weights of classiﬁcation in-

Figure 4: The line in the charts shows the number of times

the particular instance was chosen for evaluation process in

the 2500th generation (top) and 5000th generation (bottom).

The area in the bottom part of charts shows the weight of

that particular instance. Note that the number of picks of

any particular instance for evaluation is linked to its weight.

stances determine the probability of that instance get-

ting chosen to be used for the evaluation process –

the higher the weight, more chance of that instance

getting picked. The weights change based on the

difﬁculty of classifying of that particular instance –

more times the instance is misclassiﬁed, the higher its

weight becomes and more chance it has to be chosen

again for the evaluation. This forces the GP to focus

on more difﬁcult instances and compensates for a few

instances used in the evaluation.

In contrast to the paper by Gathercole and Ross

(Gathercole and Ross, 1994), we choose different in-

stances for every tournament in one generation and

not the same instances for every evaluation in one

generation. This raises the chance of one instance get-

ting chosen for the evaluation and further diversiﬁes

the search space without raising the number of total

evaluation in one generation.

Our weighting strategy is as follows. In the begin-

ning, all of the instances have the same initial weight

of 1. For every misclassiﬁcation of that instance, its

weight is increased by the amount 1/n, where n is the

number of classiﬁcation instances in the test set.

The Figure 4 shows instance weights and num-

ber of picks, ﬁrst chart in the middle of the evolu-

tion (2500th generation) and the last chart for the last

Figure 5: Classiﬁcation metric of the resulting classiﬁcation decision trees on all 10 datasets with 5 fold cross validation.

LE = Lazy evaluation; inst = number of instances used in the evaluation process.

(5000th) generation. As we can see from the charts,

some of the weights (and consequently a number of

picks) stay the same through the evolution, but others

increase throughout the evolution.

4 FIRST EXPERIMENTS WITH

LAZY EVALUATION OF GP

FOR CLASSIFICATION

We conducted ﬁrst set of experiments with our pro-

posed approach with lazy evaluation of evolutionary

classiﬁcation decision trees. We used 10 classiﬁca-

tion benchmark datasets from UCI repository (Lich-

man, 2013) and we measured following metrics: total

accuracy, average F-score (β = 1) and total running

time of the whole process from start to ﬁnish. All of

the tests were done using 5 fold cross-validation.

Datasets used in te experiments were the follow-

ing: autos, balance-scale, breast-cancer, breast-w,

car, credit-a, diabetes, heart-c, iris and vehicle.

The GP settings were set to the following values:

• selection method: tournament

• ﬁtness function:

(1 − accuracy)+ (0.02 ∗ numberO f Nodes)

• population size: 150

• elite size: 1

• number of generations: 2000

• crossover probability: 100%

• mutation probability: 10%

• number of runs: 10

Although some operators in GP can be paral-

lelized, we compared non-parallelized version in our

experiments, but parallelizing the lazy evaluation is

one of our goals in the future.

4.1 Classiﬁcation Results

Figure 5 shows two classiﬁcation metrics: overall ac-

curacy and average F-score for all of the 10 datasets.

As it is evident from the number from the table in Fig-

ure 5, there are slight differences in both metrics be-

tween different settings. The best performing GP ac-

curacy wise is GP with lazy evaluation (0.70) where

we used 10 instances in the evaluation process, fol-

lowed by the standard GP with no lazy evaluation.

The same lead is shown in the F-score metric, where

GP with lazy evaluation scored 0.54.

Kruskal-Wallis test for multiple groups returned

that there are statistically signiﬁcant differences be-

tween groups for accuracy (χ

= 370.465, p < 0.001).

Post-hoc test of pair-wise comparison for accuracy

with Holm-Bonferroni correction for multiple com-

parisons shows, that there are no statistically signiﬁ-

cant differences between two of the best performing

GPs (between Standard GP and GP with lazy evalua-

tion with 10 instances. p = 0.852).

Similar are the results of Kruskal-Wallis test

for average F-score metric, where there are statisti-

cally signiﬁcant differences between different settings

(χ

= 191.558, p < 0.001). Here the post-hoc test

with the correction shows that there are statistically

signiﬁcant differences between standard GP and GP

with lazy evaluation with 10 instances (p < 0.001).

4.2 Evolution Time Analysis

Figure 6: Average total running times of the whole evolu-

tion process in milliseconds for all settings.

LE = Lazy evaluation; inst = number of instances used in

the evaluation process.

Looking at the running time shows a clear lead

of the lazy evaluation GPs in comparison to standard

GP. The slowest lazy evaluation GP is the one with

10 instances used in the evaluation process that used

on average 4495.92 milliseconds, which is just 62.6%

of the average total running time of the standard GP

with the average total running time of 7182,97 mil-

liseconds.

We see that the running times are not proportion-

ally smaller in comparison to the theoretical saving

due to fewer evaluations, but they are still smaller.

5 CONCLUSIONS

We proposed a lazy evaluation approach in genetic

programming process of creating the classiﬁcation

decision trees that uses dynamic choosing of the in-

stances.

Results of the ﬁrst experiments show, that this ap-

proach has great potential and should be explored fur-

ther on. Not only that all of the lazy evaluation GPs

took less processing time to ﬁnish the whole evolution

process in comparison to standard GP, some settings

(with more instances in evaluation process) returned

comparable results (in accuracy and average F-score).

One of the lazy evaluation settings included (with 10

instances in evaluation) in the experiment even re-

turned better results than the standard GP. This can be

contributed to changing environment of the GP, thus

preventing to overﬁt the solutions and to the weight-

ing process that gives more importance (more chance

to be involved in evaluation process) to harder to clas-

sify instances.

We are planning to research lazy evaluation fur-

ther and to test the importance of the tournament size

and explore the number of evaluation instances fur-

ther on. Other than that, there is already an ongoing

implementation of parallel lazy evaluation that should

be directly comparable to parallel GP for decision tree

creation.

REFERENCES

Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A.

(1984). Classiﬁcation and regression trees. CRC

press.

Cheng, J., Fayyad, U. M., Irani, K. B., and Qian, Z. (1988).

Improved decision trees: a generalized version of id3.

In Proc. Fifth Int. Conf. Machine Learning, pages

100–107.

Espejo, P. G., Ventura, S., and Herrera, F. (2010). A

survey on the application of genetic programming to

classiﬁcation. IEEE Transactions on Systems, Man

and Cybernetics, Part C: Applications and Reviews,

40(2):121–144.

Ganjisaffar, Y., Caruana, R., and Lopes, C. V. (2011). Bag-

ging gradient-boosted trees for high precision, low

variance ranking models. In Proceedings of the 34th

international ACM SIGIR conference on Research and

development in Information Retrieval, pages 85–94.

ACM.

Gathercole, C. and Ross, P. (1994). Dynamic training subset

selection for supervised learning in genetic program-

ming. Parallel Problem Solving from NaturePPSN III,

pages 312–321.

Liaw, A., Wiener, M., et al. (2002). Classiﬁcation and re-

gression by randomforest. R news, 2(3):18–22.

Lichman, M. (2013). UCI machine learning repository.

Podgorelec, V.,

Sprogar, M., and Pohorec, S. (2013). Evo-

lutionary design of decision trees. Wiley Interdisci-

plinary Reviews: Data Mining and Knowledge Dis-

covery, 3(2):63–82.

Podgorelec, V. and Zorman, M. (2015). Decision tree learn-

ing. In Encyclopedia of Complexity and Systems Sci-

ence, pages 1–28. Springer.

Quinlan, J. R. (2014). C4. 5: programs for machine learn-

ing. Elsevier.

Sprogar, M. (2005). Excluding ﬁtness helps improve ro-

bustness of evolutionary algorithms. In Knowledge-

Based Intelligent Information and Engineering Sys-

tems, pages 905–905. Springer.

Zhang, B.-T. and Cho, D.-Y. (1998). Genetic programming

with active data selection. In Asia-Paciﬁc Conference

on Simulated Evolution and Learning, pages 146–153.

Springer.