Towards Automatic Grammatical Evolution for Real-world Symbolic

Regression

Muhammad Sarmad Ali, Meghana Kshirsagar, Enrique Naredo and Conor Ryan

Biocomputing and Developmental Systems Lab, University of Limerick, Ireland

Keywords:

Grammatical Evolution, Grammar Pruning, Effective Genome Length.

Abstract:

AutoGE (Automatic Grammatical Evolution) is a tool designed to aid users of GE for the automatic estimation

of Grammatical Evolution (GE) parameters, a key one being the grammar. The tool comprises of a rich

suite of algorithms to assist in ﬁne tuning a BNF (Backus-Naur Form) grammar to make it adaptable across

a wide range of problems. It primarily facilitates the identiﬁcation of better grammar structures and the

choice of function sets to enhance existing ﬁtness scores at a lower computational overhead. This research

work discusses and reports experimental results for our Production Rule Pruning algorithm from AutoGE

which employs a simple frequency-based approach for eliminating less useful productions. It captures the

relationship between production rules and function sets involved in the problem domain to identify better

grammar. The experimental study incorporates an extended function set and common grammar structures

for grammar deﬁnition. Preliminary results based on ten popular real-world regression datasets demonstrate

that the proposed algorithm not only identiﬁes suitable grammar structures, but also prunes the grammar which

results in shorter genome length for every problem, thus optimizing memory usage. Despite utilizing a fraction

of budget in pruning, AutoGE was able to signiﬁcantly enhance test scores for 3 problems.

1 INTRODUCTION

Grammatical Evolution (GE), since its inception

twenty years ago, has found wide acceptance in the

research communities (Ryan et al., 2018). It is a bio-

inspired population-based methodology from the do-

main of evolutionary computation which heavily re-

lies on the core aspect for its implementation: the def-

inition of context-free grammar (CFG). By deﬁning

grammars in any language of choice, GE can evolve

valid program of arbitrary length. This ﬂexibility

makes GE a powerful tool in genetic programming

(GP) and it has gained a wide-scale appeal.

Grammar is a key input to grammatical evolution

and it has been known that the performance of GE

is signiﬁcantly inﬂuenced by the design and structure

of the grammar (Nicolau and Agapitos, 2018). How-

ever, when it comes to deﬁning grammar, there is lit-

tle guidance in the literature. This task is generally

performed by the users of GE, solutions developers,

or domain experts and the grammar is hand-crafted.

Choice of terminals and non-terminals, and their com-

position to form production rules is largely based on

expertise. For a novice user, there is no tool or frame-

work which can assist them in deﬁning the grammar.

A related problem, faced even by the experienced

users, is that of the choice of function set. Func-

tions or operators are represented as productions in

the grammar. Choosing an appropriate function set is

a key decision in applying GP as it can have a vital

impact on the performance of GP (Gang and Soule,

2004; Uy et al., 2013). However, there is not enough

guidance in selecting a function set and no system-

atic approach exists (Nicolau and Agapitos, 2021). To

date, it is also largely considered a decision made by

domain experts.

Automatic Grammatical Evolution (AutoGE) (Ali

et al., 2021) is a system that can aid users of GE to

explore and identify grammar structures to smoothly

adapt according to the underlying problem domain. It

can aid users in identifying appropriate terminals in-

volved in forming production rules. It is being de-

veloped with a rich suite of algorithms which can

adapt (prune or extend) user provided grammar, or

even generate an appropriate grammar from scratch if

certain pieces of information about problem at hand

are known. Besides deﬁnition and/or ﬁne-tuning of

the grammar, AutoGE will facilitate in adapting other

evolutionary parameters such as mutation/crossover

probabilities and tree depths. Depending upon the

Ali, M., Kshirsagar, M., Naredo, E. and Ryan, C.

Towards Automatic Grammatical Evolution for Real-world Symbolic Regression.

DOI: 10.5220/0010691500003063

In Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI 2021), pages 68-78

ISBN: 978-989-758-534-0; ISSN: 2184-3236

nature of the problem and its complexity, it can as-

sist in the selection and deﬁnition of correct ﬁtness

function, which can be composed of single, multiple

or many objectives, and can be hierarchical in nature

(Ryan et al., 2020).

This work reports preliminary results with our

Production Rule Pruning approach applied to real-

world symbolic regression problems. For a given

grammar structure and a generic larger function set, it

reduces the grammar by pruning useless productions.

It helps in evolving individuals of shorter lengths

thereby optimizing memory usage (Kshirsagar et al.,

2020). The algorithm and related production rank-

ing scheme is discussed in section 4. Section 5 shows

our experimental setup and section 6 presents and dis-

cusses the results. In the coming section 2 and 3, we

brieﬂy outline theoretical background and the related

work.

2 BACKGROUND

2.1 Grammatical Evolution

Grammatical Evolution is a variant of Genetic Pro-

gramming (GP) in which the space of possible solu-

tions is speciﬁed through a grammar. Although differ-

ent types of grammars have been used (Ortega et al.,

2007; Patten and Ryan, 2015), the most commonly

used is Context Free Grammar (CFG), generally writ-

ten in Backus-Naur Form (BNF). GE facilitates a

modular design, which means that any search engine

can be used, although typically a variable-length Ge-

netic Algorithm (GA) is employed to evolve a popu-

lation of binary strings.

In GE, each population individual has a dual rep-

resentation, a genotype and a phenotype. When the

underlying search engine is a genetic algorithm, the

genotype is a sequence of codons (usually a group

of 8-bit substrings), while the phenotype expresses

an individual’s representation in the solution space.

Mapping, a key process in GE, maps a given genotype

to the phenotype. While subsequently consuming

each codon, it selects a production from the available

set of alternative productions in a rule through mod

operations and builds the derivation tree (Ryan et al.,

1998). Although there are other mapping schemes

(Fagan and Murphy, 2018), the conventional scheme

follows left-most derivation. An important measure

in the mapping process is the effective genome length,

which is equal to the number of codons consumed

to generate a fully mapped individual (the one which

does not contain any non-terminals in its phenotype).

The actual genome length is the total number of

Figure 1: Schematic of Evolutionary Process in GE.

codons in the genome, some of which may remain

unused.

2.2 Grammar Design

Since GE exploits the expressive power of grammars,

it can be applied to a multitude of problem domains,

for instance in Symbolic Regression (SR) where the

purpose is to search the space of mathematical ex-

pressions to ﬁnd a model that best ﬁts a given dataset

(Koza, 1993). To construct valid and useful mathe-

matical expressions in GE, the grammar needs to be

well designed.

A grammar is formally deﬁned as the tuple (T, N,

P, S) where T is a set of terminal symbols, N is a set of

non-terminal symbols, P is a set of production rules,

and S is the start symbol . While the set of terminals

outline the building blocks of a solution, the choice of

non-terminals and deciding how exactly to organize

those into a set of rules and productions is a design

task. By designing an ’appropriate’ grammar, one

speciﬁes the syntactic space of possible solutions (it

is worth noting that there are an inﬁnite number of

possible grammars which specify the same syntax).

Although grammar design is an important consider-

ation, yet majority of the research works provide lit-

tle to no justiﬁcation for the design decisions related

to the choice of (non)terminals and the formation of

production rules.

2.3 Grammar Structures

Instead of designing grammar from scratch, a com-

mon approach is to utilize and adapt existing gram-

mar designs for that domain. For example, in gram-

matical evolution based symbolic regression (GESR),

typical grammar structures are shown in Table 1. The

most widely used structure, which we call mixed-arity

grammar, combines operations of multiple arities in a

single rule. A contrasting structure is that of arity-

Towards Automatic Grammatical Evolution for Real-world Symbolic Regression

Table 1: Grammar Structures.

based grammars where productions relevant to arity-

1 and arity-2 operations are grouped in separate rules.

A balanced grammar version balances the proba-

bilities of selecting recursive (non-terminating) pro-

ductions and terminating productions (Nicolau and

Agapitos, 2018).

It is important to note how operators and functions

are represented as productions in the grammar. Be-

sides embodying arithmetic operators, a number of

common mathematical functions are represented as

alternative recursive productions.

3 RELATED WORK

In this work, we exercised our approach on problems

in symbolic regression which is the most common ap-

plication domain for GP-like systems. Since there is

a large amount of work in relation to symbolic regres-

sion, we skip that discussion due to space limitation

and rather focus on the following relevant research di-

rections:

3.1 Function Set Selection

It is important to select appropriate function set in

order to achieve good performance in GE/GP. Not

many works appeared which speciﬁcally address the

problem of function set selection. (Gang and Soule,

2004) experimented with various function sets and

highlighted that function groups exist and functions

in the same group have same effect on performance.

(Uy et al., 2013) examined characteristics of the ﬁt-

ness landscapes generated by various function sets

and the performance of GP. They concluded that the

autocorrelation function can be used as an indicator to

select a function set. Recently, (Nicolau and Agapi-

tos, 2021) also studied the effect of various groups

of function sets on generalisation performance of GP

and GE. With a detailed review and experimentation

over a large set of symbolic regression problems, they

concluded that protected functions should be avoided.

They also indicate that full set (comprising of all con-

sidered function primitives) performed consistently

well in training across all problems. Our earlier study

(Ali et al., 2021) support their ﬁnding, while we use a

larger generic function set at the start of evolutionary

process.

3.2 Encapsulation

In a normal setup of canonical GE, grammar is a static

artefact which never changes during the execution.

However, in this work, we modify the grammar by re-

moving productions from the grammar which we term

as pruning. Several strands of research modify the

grammar dynamically during the evolutionary pro-

cess. The idea of automatically deﬁned functions, in-

stead of striving to choose an optimal set in advance,

is about identifying, encapsulating, and reusing useful

functionality discovered during the evolution (Koza,

1994). (O’Neill and Ryan, 2000) used grammar-

based approach to automatically deﬁne new function

for the Santa Fe trail problem. (Harper and Blair,

2006) introduced a meta-grammar into grammatical

evolution allowing the grammar to dynamically de-

ﬁne functions without the need for special purpose

operators or constraints. More recently, (Murphy

and Ryan, 2020) utilized covariance between traits to

identify useful modules which are added to the gram-

mar.

3.3 Probabilistic GE

In our work, we assign weight called rank to a pro-

duction, which at a stage is used to decide upon its

fate: whether or not to stay in the grammar. Although

our ranks do not bias the selection of a production dur-

ing the mapping process, the probabilistic approach to

GE does . It uses probabilistic grammar, also known

as stochastic context-free grammar (SCFG) to assign

selection probabilities to each production. Although

a huge set of research explores probabilistic gram-

mars in connection to GP and Estimation of Distribu-

tion Algorithms (EDA), there aren’t attempts to uti-

lize SCFG in GE with genetic operations, except the

recent work from (Megane et al., 2021). This paper

does not compare our ranking approach with SCFG,

which is a deﬁnite future work.

ECTA 2021 - 13th International Conference on Evolutionary Computation Theory and Applications

4 METHODOLOGY

We discuss our approach to rank grammar produc-

tions and subsequent pruning of unworthy produc-

tions in this section. Prior to that, we present our hy-

pothesis underlying this approach.

4.1 Hypothesis

It is well known that with the correct conﬁguration

and ﬁtness criteria, an evolutionary process is geared

towards convergence. Increasingly, the evolved solu-

tions contain more and more of the right ingredients

or building blocks (in our case, grammar productions)

(Koza, 1993). We hypothesize that the structural com-

position of evolved solutions carries information that

can be useful in identifying the right ingredients.

In GE, each individual in the population is com-

posed of terminals, which appear in an order deﬁned

by the derivation tree constructed during genotype to

phenotype mapping. By traversing the derivation tree,

it is possible to obtain a list of grammar productions

used in the mapping process to generate an individ-

ual. Such a list is termed as the production-list. Once

identiﬁed, the frequency of usage of each production

in the production-list can be easily determined.

Productions can be weighed or ranked based on

how frequently they are used in the construction of

individuals in the population. As evolution proceeds,

ﬁtter individuals survive, and the productions which

more frequently shape the structures are the ones that

are considered to be worthy being part of the gram-

mar. Such productions should be assigned a high

rank. Conversely, productions which harm individ-

ual’s ﬁtness to an extent they become extinct, gen-

erally do not enjoy high usage frequency (although

rarely zero, due to hitch-hiking effects) in the popula-

tion.

To test our hypothesis, we devised a simple

frequency-based approach to rank productions, which

we present next.

4.2 Production Ranking

Figure 2 describes the overall process of production

ranking. At the end of a generation, individuals in

the population are structurally analyzed and assigned

ranking scores based on the frequency count of pro-

ductions in the production-list. requency ofjth pro-

duction

Let P be the set of productions in the grammar G.

⊂ P is the set of productions in the production-list

of the ith individual. If n = |P|, number of produc-

tions in the P, and k = |P

|, then k < n for practically

Figure 2: Schematic of production ranking at each Stage.

all individuals. The ranking score assigned to the jth

production in the production-list is given by:

(nfr)





(1)

( f pr)

= (nfr)

× ρ

(2)

where φ

is the frequency of jth production, l

is the

effective codon length, and ρ

is the ﬁtness of ith indi-

vidual. Equation 1 deﬁnes the normalized frequency

rank (nfr) of a production, while Equation 2 com-

putes the ﬁtness-proportionate rank (fpr). As a con-

sequence of the above two deﬁnitions, the following

two properties hold for an ith individual:

∑

j=1

(nfr)

= 1 , and

∑

j=1

( f pr)

= ρ

Once individual ranking scores have been com-

puted, we accumulate the scores of all u individuals

in the population to compute generation worth (gw)

of jth production in the production-list for mth gen-

eration, and then across all g generations to compute

the overall run worth (rw).

(gw)

∑

i=1

( f pr)

(3)

(rw)

∑

m=1

(gw)

(4)

To minimize the computational cost of production

ranking, we track the production-list during the map-

ping process, so it does incur a small memory over-

head. However, since the ranking scores are com-

puted at the end of a small number of evolutionary

Towards Automatic Grammatical Evolution for Real-world Symbolic Regression

Figure 3: Fitness-proportionate production ranks across

runs for redwine dataset (runs: 30, ngen: 5, popsize 250).

runs (called a stage), and the operations deﬁned by

the above equations are trivial, those can be efﬁciently

performed with minimal overhead.

An important consideration is to decide how much

of the population to select for ranking. We experi-

mented with three possible choices: 1) the whole pop-

ulation, 2) only unique individuals, 3) top X% of the

population (we use X=20). The second option turned

out to be the best choice based on our empirical eval-

uations. Potential issue with the ﬁrst option is that the

rankings can be biased due to redundancy, and with

the third option there is a chance of pruning impor-

tant production which is not yet picked because of the

small number of evolutionary iterations.

Figure 3 shows a sample box and whisker plot of

fpr ranking for the redwine dataset. It gives a nice

picture of the utility of each production in the evolu-

tionary cycle.

4.3 Grammar Pruning

According to Occam’s razor, “no more things should

be presumed to exist than are absolutely necessary.”

Following this principle, we try limiting the complex-

ity of the models and favour simpler ones to take part

in the evolution. Grammar is a key model of the so-

lution space, so the idea is to remove unnecessary or

less worthy productions (or functions) from the gram-

mar to tune the grammar design.

One of the key driver in grammar tuning

is the

It is worth mentioning that in AutoGE, tuning may in-

volve pruning as well as extension of the grammar, although

this work only reports on pruning approach.

Algorithm 1: Production Rule Pruning (PRP).

input : grammar G, number of trials/runs T,

available budget B

, number of

generations for pruning runs gen

output: pruned grammar G

initialize MB

to a high value;

← G;

while B

> gen

do T runs for gen

with grammar G

;

curr

← get current mean-best;

prune

← PRUNABLES();

decrement B

;

if MB

curr

< MB

then

← PRUNE();

← MB

curr

;

else

REVERT();

increment gen

;

end

pruning strategy and algorithm. There can be a num-

ber of strategies for pruning and we look at two here.

The core idea they have in common is a staged ap-

proach; that is, at each stage solutions are evolved

over a small number of generations, then one or more

productions are pruned, and then subsequent stages

(if any) are conducted. Every next stage is a com-

plete restart with the newly modiﬁed grammar (more

on this in section 6.3). The number of stages may vary

depending upon the strategy being employed. We ex-

perimented with the following two strategies:

• Strategy 1: Prune for the maximum pruning bud-

get (20%). The remaining runs will verify if it was

fruitful.

• Strategy 2: Only proceed with pruning if it results

in improving mean training score at each stage. If

it degrades performance, stop.

Strategy 1 has slightly less overhead as it is com-

posed of only one stage, but suffers from blind prun-

ing which in many cases fails to reap any beneﬁts.

Strategy 2 incorporates a feedback loop which in-

forms on its usefulness. In our preliminary experi-

ments, we have observed it to be yielding a much bet-

ter overall outcome. Coupled with the pruning poli-

cies deﬁned in section 5.4, our Production Rule Prun-

ing algorithm achieved good results.

The pseudocode listed in Algorithm 1 outlines

Production Rule Pruning (PRP) algorithm. Choice

of pruning budget B

and number of generations for

pruning runs gen

determine the maximum possible

stages. In our experimentation, We set B

to 20 and

gen

as 5. PRUNE is the key procedure in the algo-

rithm. It performs two important functions:

ECTA 2021 - 13th International Conference on Evolutionary Computation Theory and Applications

1. It analyses production ranking scores and identi-

ﬁes the least worthy productions. Based on the

pruning policy, it identiﬁes how many productions

to prune at a given stage and returns that many

productions as candidates to be pruned.

2. It removes productions from the grammar and

adds them to S

prune

which is implemented as a

stack. At each stage, pruned productions are

pushed to the stack.

The REVERT function undoes the last pruning ac-

tion by popping the last productions from S

prune

and

adding them back to the grammar. When a pruning

stage reverts and the budget is still remaining, gen

incremented, in our case from 5 to 10.

The output of the PRUNABLE function are pruning

suggestions. In our earlier work (Ali et al., 2021),

we empirically evaluated the consistency of ﬁrst and

second pruning suggestions (we only prune 2 produc-

tions at max in a stage) and found those to be 99%

and 95% consistent respectively over 100 experimen-

tal runs for certain randomly chosen problems.

5 EXPERIMENTAL SETUP

5.1 Dataset

Table 2 lists the problems considered in this work.

All the datasets correspond to the real-world sym-

bolic regression benchmark problems which have

been widely studied in several esteemed publication

venues, as also noticed by (Oliveira et al., 2018; Ray-

mond et al., 2020). Except for Dow Chemical dataset,

which was sourced from gpbenchmarks.org web-

site

, all other datasets were obtained from the UCI

Machine Learning Repository

and CMU StatLib

Archive

The collection of datasets is diverse including

problems having 5 to 57 input features, with sample

size varying from 60 to nearly 4900. There are no

missing values in the dataset, and we utilize the raw

values without any normalization. Each dataset is re-

ferred to with a short name (in distinct font), which

will be used in the rest of the paper.

5.2 Parameters

Table 3 presents the evolutionary parameters used in

all experimental runs. Note that we utilized repeated

http://gpbenchmarks.org/?page id=30

https://archive.ics.uci.edu/ml/datasets.php

http://lib.stat.cmu.edu/datasets/

Table 2: List of Datasets.

Dataset Short name Features Instances

Airfoil Self-Noise airfoil 5 1503

Energy Efﬁciency - Heating heating 8 768

Energy Efﬁciency - Cooling cooling 8 768

Concrete Strength concrete 8 1030

Diabetes diabetes 10 442

Wine Quality - Red Wine redwine 11 1599

Wine Quality - White Wine whitewine 11 4898

Boston Housing housing 13 506

Air Pollution pollution 15 60

Dow Chemical dowchem 57 1066

Table 3: Parameter Settings.

Parameter Value

Number of Runs 30

Population Size 250

Number of Generations 100

Search Engine Steady-State GA

Cross Validation 10-fold (r=3)

Crossover Type Effective Crossover

Crossover Probability 0.9

Mutation Probability 0.01

Selection Type Tournament

Initialization Method Sensible Initialization

Max Pruning Budget 20% (4 stages at max)

Extended function set + − × / − x x

sin cos tan sinh cosh

tanh e

−x

ln|x|

|x|

k-fold cross validation, with k = 10 and repeat factor r

= 3. For repetition we used a different seed to prepare

a different training-test data split each time. The rep-

etition ensure that we further minimize the chances of

overﬁtting (Wong and Yeh, 2020).

We ran all experiments on the libGE system

which is an efﬁcient C/C++ implementation of canon-

ical GE and provides capabilities to effectively exam-

ine grammar productions.

The purpose of the ﬁtness function is to measure

the performance of the algorithm against a predeﬁned

objective. A common ﬁtness function in symbolic re-

gression is Root Mean Squared Error (RMSE), which

is deﬁned as:

RMSE =

∑

i=1

( ˆy

− y

)

where n is the number of data points, y

is the target

value, and ˆy

is the predicted value. RMSE assesses

the mean extent of deviation from the desired value,

so the goal of evolution is to minimize this error met-

ric across generations.

http://bds.ul.ie/grammatical-evolution/

Towards Automatic Grammatical Evolution for Real-world Symbolic Regression

5.3 Grammars and Function Set

We deﬁned an extended function set (see Table 3),

which is the superset of all mathematical functions

commonly used in symbolic regression. It includes

arithmetic operators, trigonometric functions, expo-

nential and power functions. We do not use protected

division. However we did include functions such as

, e

−x

, tan, sinh, and cosh. These functions grow

exponentially and are usually avoided (Nicolau and

Agapitos, 2021). However, we kept those in our func-

tion set in order to validate if our approach of produc-

tion ranking and grammar pruning was able to remove

such functions from the grammar.

The grammar which contains productions em-

bodying whole extended function set is called Ex-

tended Grammar in this work (referenced with the

letter ’E’ in the results). The three grammar struc-

tures considered are shown in Table 1. The <var> rule

includes as many alternative terminal productions as

there are number of input variables in the dataset.

5.4 Pruning Policies

For the problems we examined, the usage frequency

of arithmetic operators, input features, and constant

terminals was low, irrespective of the grammar struc-

ture. We therefore do not consider their correspond-

ing productions, and the productions where the right

hand side is only composed of non-terminals (for

example productions in the start rule of arity-based

grammar in Table 1). This resulted in 14 prunable

productions (excluding productions embodying arith-

metic operators).

It is important to highlight a few other policies

adopted while pruning which server as parameters to

the PRP algorithm:

• We do not consume more than 20% of the compu-

tational budget on pruning. In our case, it meant

consuming at most 20 generations;

• Pruning takes place in stages. At a stage, prune

only 10% of the productions;

• In pruning runs, we evolve for 5 generations to

maximize pruning. If 5 generation runs terminate

with a positive REVERT decision and part of the

pruning budget is remaining, we proceed with 10

generations.

Note that the grammar which results after pruning

is termed Pruned Grammar in this work and is refer-

enced with the letter ‘P’ in Table 4.

Figure 4: Productions pruned in ﬁrst two stages of pruning

across all datasets.

6 RESULTS

We conducted an extensive set of experiments over

the 10 datasets. A single experiment comprised of 30

runs for each given problem and grammar structure.

Table 4 presents a summarized view of the results. It

shows the impact of using various grammar structures

alongside extended function set and pruning approach

on training performance, test performance, and mean

effective genome length for the best-of-run solution.

The letter on the right of each cell indicates which

grammar (‘E’ for the grammar with extended function

set, ‘P’ for pruned) achieved better results. Results

in italics indicate which grammar structure scored the

highest. When the numbers are in boldface, the differ-

ences are statistically signiﬁcant. The cells in yellow

indicate that pruning results in better scores, regard-

less of signiﬁcance.

6.1 Statistical Comparisons

Since the assumption of normality and dependence, as

required by parametric tests, does not hold in general

in experimental results of evolutionary computing ap-

proaches, we decided to use non-parametric tests, for

which we followed the guidelines presented by (Der-

rac et al., 2011). In our work, we do pairwise as well

as multiple comparisons. All results were compared

at the 0.05 statistical signiﬁcance level.

For each problem, and each grammar structure,

we compared the best training, test, and effective

genome length scores (in 30 runs) among extended

and pruned grammars using a non-parametric Mann-

Whitney U-test (2-tailed version). Scores in boldface

indicate that the p-value, while comparing outcomes

resulting from extended vs. pruned grammar for a

ECTA 2021 - 13th International Conference on Evolutionary Computation Theory and Applications

Table 4: Effective genome length, test, and training performance comparisons. ‘E’ stands for extended, ‘P’ for pruned;

numbers in italics indicate which grammar structure scored best; bold indicates statistical signiﬁcance; yellow highlights

indicate improvement due to pruning.

ﬁxed grammar structure, was less than 0.05 and the

null hypothesis was therefore rejected.

To compare among three grammar structures

(mixed, arity-based, balanced), we utilized Friedman

test, where the null hypothesis assumes no effect or

difference for adopting any of the grammar struc-

tures with respect to training performance, test perfor-

mance or genome length. Where the null hypothesis

was rejected, the post-hoc analysis was carried out us-

ing Shaffer’s static procedure (Derrac et al., 2011) to

signify which pair of grammars yielded statistically

different results. The outcome is discussed in next

section.

6.2 Effect of Grammar Structures

It is evident that some grammar structures are more

appropriate for certain types of improvement or for

certain problems. When comparing raw numbers

from Table 4:

• Mixed-arity grammar results in lower genome

length for all problems except whitewine, for

which arity-based structure was the best choice

achieving signiﬁcantly better training as well as

test performance;

• Balanced arity-based grammar achieve better re-

sults in 7 problems (in training) and 5 problems

(in test) out of 10 problems;

• An interesting set of results appear for pollution

and airfoil datasets where mixed-arity grammar

achieved much better results as compared to other

two structures. This trend was observable from

very early on in the evolution.

For both extended and pruned grammars, we com-

pared among three groups (named as, M for Mixed-

arity, A for Arity-based, and B for Balanced arity-

based) of results for each type of the three outcomes

(training, test, and effective size) using Friedman test.

For instance, considering extended grammar, we ran

a test which compared mean training scores for all

10 problems in case of mixed-arity, arity-based, and

balanced-arity grammar structures. In this way, we

ran 6 Friedman tests, comparing among results of

3 grammar structures in each. In case of training

and test groups, no signiﬁcant difference was found.

However, for the effective size group, the null hypoth-

esis was rejected (p <0.05) which indicates there are

signiﬁcant differences among the outcomes of three

grammar structures.

For post-hoc analysis, we utilized Shaffer’s static

procedure, as recommended in (Derrac et al., 2011).

With the pair-wise multiple comparisons among

grammar structures, following two pairs were iden-

tiﬁed as producing signiﬁcant difference:

• Mixed-arity vs. Arity-based for extended gram-

mar with adjusted p-value (APV) of 0.02534;

• Mixed-arity vs. Balanced for pruned grammar

with APV of 0.00346.

Since in both pair mixed-arity achieved better re-

sults, it was concluded that the mixed-arity structure

signiﬁcantly improves effective genome length with

or without pruning. Many symbolic regression stud-

ies using GE or Grammar Guided Genetic Program-

ming (GGGP) use mixed-arity grammar structure, for

instance (Nicolau et al., 2015). Based on our ﬁndings,

we state that the choice is fruitful in exploring small-

sized less bloated solutions, but it does not warrant

gains in generalization or approximation.

6.3 Effect of Grammar Pruning

Pruning was applied in all experiments. The number

of productions pruned varied from 2 to 6, out of 14

prunable productions. Figure 4 shows for how many

different problems a production was pruned in ﬁrst

two stages of pruning with different grammar struc-

tures. sinh, cosh, and x

were actively recognized

Towards Automatic Grammatical Evolution for Real-world Symbolic Regression

Figure 5: Impact of grammar pruning on test performance

of airfoil.

as unuseful productions even in the initial genera-

tions. It would be interesting to explore why Mixed-

arity grammars rank log and −x so low. It is evident

from the highlighted cells in Table 4 that pruning fur-

ther improves/reduces genome length, especially with

mixed-arity or balanced arity-based structures. Figure

6 shows the same effect in box plot for some of the

problems.

Table 5 shows the percentage improvement with

grammar pruning approach comparing extended

grammar and pruned grammar results. Again, these

differences are mostly signiﬁcant in case of effective

size evaluations. Besides, following inferences can be

drawn from these results:

• Genome length improvements due to pruning are

more prominent with mixed-arity grammar.

• Pruning resulted in improved test performance

in housing, diabetes, redwine, and airfoil

datasets, though gains are non-signiﬁcant.

• For three datasets (cooling, heating, and

whitewine), pruning, with the currently exper-

imented strategy, could not improve test perfor-

mance. Also, for the rest of the problems, the drop

in test performance was not signiﬁcant.

However, it is worth noting that since we keep

the same computational budget, trials with the pruned

grammar lasted for 80 (in some cases 85) generations.

Had the pruned grammar also exercised for 100 gen-

erations, it is likely that it would have achieved better

performance. Figure 5 shows a sample convergence

plots (for airfoil problems) where grammar prun-

ing enhanced generalization performance when com-

pared with extended grammar. The three spikes in the

plot in case of pruned grammar depict that pruning

runs were carried out in three stages in the ﬁrst 15

generations.

Table 5: Percentage Improvement with Pruning.

Mixed-arity Balanced

Dataset Test Eff. Size Test Eff. Size

housing 4.32% 39.73% -5.28% 2.09%

diabetes 1.09% 37.04% 0.94% 22.05%

redwine 0.08% 29.33% 0.20% 18.62%

airfoil -0.53% 39.42% 3.13% 25.44%

dowchem -1.56% 30.90% -0.15% -4.06%

concrete -3.96% 31.01% -0.45% 5.26%

cooling -4.53% 39.37% -4.69% -3.88%

heating -8.29% 39.77% -2.16% 18.50%

pollution -12.59% 19.74% -6.81% 28.43%

whitewine -1.99% -7.25% 1.68% 47.20%

7 CONCLUSION

We propose a new algorithm as part of the AutoGE

tool suite being developed. The proposed Produc-

tion Rule Pruning algorithm is an approach that com-

bines an extended function set and a frequency count-

ing mechanism for ranking production rules. AutoGE

achieved signiﬁcantly better genome length in 7 out

of 10 problems (with improvements the other three

also), without signiﬁcantly compromising on test per-

formance of any, while in three of the problems, Au-

toGE shows a signiﬁcant improvement on test per-

formance. Our results highlighted that mixed-arity

grammar structure or balanced arity-based structure

can be a better choice for real-world symbolic regres-

sion problems.

7.1 Future Directions

An immediate extension to the current work is to

improve and trial grammar pruning approach for

feature selection and how it impacts test perfor-

mance. Besides, dynamic approach to pruning and

improved ranking schemes and pruning strategies are

also planned. We also intend to explore other problem

domains for instance program synthesis, and Boolean

logic. The PRP algorithm performance can be fur-

ther enhanced by investigating other search mecha-

nisms, for example particle swarm optimization or

ant colony optimization. We aim to extend AutoGE’s

suite of algorithms and to make it more robust by ex-

ploring approaches like grammar-based EDAs.

ACKNOWLEDGEMENTS

This work was supported with the ﬁnancial support of

the Science Foundation Ireland grant 13/RC/2094 2

and co-funded under the European Regional Devel-

opment Fund through the Southern & Eastern Re-

ECTA 2021 - 13th International Conference on Evolutionary Computation Theory and Applications

Figure 6: Impact of Grammar Pruning on Effective Size.

gional Operational Programme to Lero - the Sci-

ence Foundation Ireland Research Centre for Soft-

ware (www.lero.ie)

REFERENCES

Ali, M. S., Kshirsagar, M., Naredo, E., and Ryan, C. (2021).

AutoGE: A Tool for Estimation of Grammatical Evo-

lution Models. In Proceedings of the 13th Inter-

national Conference on Agents and Artiﬁcial Intelli-

gence, pages 1274–1281. SCITEPRESS - Science and

Technology Publications.

Derrac, J., Garc

ıa, S., Molina, D., and Herrera, F. (2011). A

practical tutorial on the use of nonparametric statisti-

cal tests as a methodology for comparing evolutionary

and swarm intelligence algorithms. Swarm and Evo-

lutionary Computation, 1(1):3–18.

Fagan, D. and Murphy, E. (2018). Mapping in Grammati-

cal Evolution. In Ryan C., O’Neill M., C. J., editor,

Handbook of Grammatical Evolution, pages 79–108.

Springer International Publishing, Cham.

Gang, W. and Soule, T. (2004). How to Choose Appropri-

ate Function Sets for Genetic Programming. In Kei-

jzer, M., O’Reilly, U.-M., Lucas, S. M., Costa, E.,

and Soule, T., editors, Genetic Programming 7th Eu-

ropean Conference, EuroGP 2004, Proceedings, vol-

ume 3003 of LNCS, pages 198–207, Coimbra, Portu-

gal. Springer-Verlag.

Harper, R. and Blair, A. (2006). Dynamically Deﬁned Func-

tions In Grammatical Evolution. In Proceedings of the

2006 IEEE Congress on Evolutionary Computation,

pages 9188–9195, Vancouver. IEEE Press.

Koza, J. R. (1993). Genetic Programming: On the Pro-

gramming of Computers by Means of Natural Selec-

tion. MIT Press.

Koza, J. R. (1994). Genetic Programming II: Automatic

Discovery of Reusable Programs. MIT Press, Cam-

bridge Massachusetts.

Kshirsagar, M., Jachak, R., Chaudhari, P., and Ryan, C.

(2020). GEMO: Grammatical Evolution Memory Op-

timization System. In Proceedings of the 12th Inter-

national Joint Conference on Computational Intelli-

gence, pages 184–191. SCITEPRESS - Science and

Technology Publications.

Megane, J., Lourenco, N., and Machado, P. (2021). Proba-

bilistic Grammatical Evolution. In Hu, T., Lourenco,

N., and Medvet, E., editors, EuroGP 2021: Proceed-

ings of the 24th European Conference on Genetic Pro-

gramming, volume 12691 of LNCS, pages 198–213,

Virtual Event. Springer Verlag.

Murphy, A. and Ryan, C. (2020). Improving module iden-

tiﬁcation and use in grammatical evolution. In 2020

IEEE Congress on Evolutionary Computation (CEC),

pages 1–7.

Nicolau, M. and Agapitos, A. (2018). Understanding Gram-

matical Evolution: Grammar Design. In Ryan, C.,

O’Neill, M., and Collins, J. J., editors, Handbook of

Grammatical Evolution, pages 23–53. Springer Inter-

national Publishing, Cham.

Nicolau, M. and Agapitos, A. (2021). Choosing function

sets with better generalisation performance for sym-

Towards Automatic Grammatical Evolution for Real-world Symbolic Regression

bolic regression models. Genetic Programming and

Evolvable Machines, 22(1):73–100.

Nicolau, M., Agapitos, A., O’Neill, M., and Brabazon, A.

(2015). Guidelines for deﬁning benchmark problems

in Genetic Programming. In 2015 IEEE Congress on

Evolutionary Computation (CEC), pages 1152–1159.

IEEE.

Oliveira, L. O. V. B., Martins, J. F. B. S., Miranda, L. F., and

Pappa, G. L. (2018). Analysing symbolic regression

benchmarks under a meta-learning approach. In Pro-

ceedings of the Genetic and Evolutionary Computa-

tion Conference Companion, pages 1342–1349, New

York, NY, USA. ACM.

O’Neill, M. and Ryan, C. (2000). Grammar based func-

tion deﬁnition in Grammatical Evolution. In Whit-

ley, D., Goldberg, D., Cantu-Paz, E., Spector, L.,

Parmee, I., and Beyer, H.-G., editors, Proceedings

of the Genetic and Evolutionary Computation Con-

ference (GECCO-2000), pages 485–490, Las Vegas,

Nevada, USA. Morgan Kaufmann.

Ortega, A., de la Cruz, M., and Alfonseca, M. (2007).

Christiansen grammar evolution: Grammatical evolu-

tion with semantics. IEEE Transactions on Evolution-

ary Computation, 11(1):77–90.

Patten, J. V. and Ryan, C. (2015). Attributed Gram-

matical Evolution using Shared Memory Spaces and

Dynamically Typed Semantic Function Speciﬁcation.

In Machado, P., Heywood, M. I., McDermott, J.,

Castelli, M., Garcia-Sanchez, P., Burelli, P., Risi, S.,

and Sim, K., editors, 18th European Conference on

Genetic Programming, volume 9025 of LNCS, pages

105–112, Copenhagen. Springer.

Raymond, C., Chen, Q., Xue, B., and Zhang, M. (2020).

Adaptive weighted splines: a new representation to

genetic programming for symbolic regression. In Pro-

ceedings of the 2020 Genetic and Evolutionary Com-

putation Conference, pages 1003–1011, New York,

NY, USA. ACM.

Ryan, C., Collins, J., and O’Neill, M. (1998). Grammati-

cal evolution: Evolving programs for an arbitrary lan-

guage. In Banzhaf, W., Poli, R., Schoenauer, M.,

and Fogarty, T., editors, Lecture Notes in Computer

Science (including subseries Lecture Notes in Artiﬁ-

cial Intelligence and Lecture Notes in Bioinformatics),

volume 1391, pages 83–96. Springer, Berlin, Heidel-

berg.

Ryan, C., O’Neill, M., and Collins, J. (2018). Introduction

to 20 Years of Grammatical Evolution. In Handbook

of Grammatical Evolution, pages 1–21. Springer In-

ternational Publishing, Cham.

Ryan, C., Raﬁq, A., and Naredo, E. (2020). Pyramid: A hi-

erarchical approach to scaling down population size in

genetic algorithms. In 2020 IEEE Congress on Evolu-

tionary Computation (CEC), pages 1–8.

Uy, N. Q., Doan, T. C., Hoai, N. X., and O’Neill, M.

(2013). Guiding Function Set Selection in Genetic

Programming based on Fitness Landscape Analysis.

In GECCO 2013 - Companion Publication of the 2013

Genetic and Evolutionary Computation Conference,

number 2, pages 149–150.

Wong, T. and Yeh, P. (2020). Reliable Accuracy Estimates

from k-Fold Cross Validation. IEEE Transactions on

Knowledge and Data Engineering, 32(8):1586–1594.

ECTA 2021 - 13th International Conference on Evolutionary Computation Theory and Applications