Deriving Realistic Mathematical Models from Support Vector
Machines for Scientific Applications
Andrea Murari
1
, Emmanuele Peluso
2
, Saeed Talebzadeh
2
, Pasqualino Gaudio
2
, Michele Lungaroni
2
,
Ondrej Mikulin
3
, Jesus Vega
4
and Michela Gelfusa
2
1
Consorzio RFX (CNR, ENEA, INFN, Università di Padova, Acciaierie Venete SpA),
Corso Stati Uniti 4, 35127 Padova, Italy
2
Associazione EURATOM-ENEA, University of Rome “Tor Vergata”, Roma, Italy
3
Institute of Plasma Physics AS CR, ZaSlovankou 3, Prague, Czech Republic
4
Asociación EURATOM/CIEMAT para Fusión, Avda. Complutense, 22. 28040, Madrid, Spain
Keywords: Machine Learning Tools, Support Vector Machines, Symbolic Regression, Genetic Programming.
Abstract: In many scientific applications, it is necessary to perform classification, which means discrimination
between examples belonging to different classes. Machine Learning Tools have proved to be very
performing in this task and can achieve very high success rates. On the other hand, the “realism” and
interpretability of their results are very low, limiting their applicability. In this paper, a method to derive
manageable equations for the hypersurface between classes is presented. The main objective consists of
formulating the results of machine learning tools in a way representing the actual “physics” behind the
phenomena under investigation. The proposed approach is based on a suitable combination of Support
vector Machines and Symbolic Regression via Genetic Programming; it has been investigated with a series
of systematic numerical tests, for different types of equations and classification problems, and tested with
various experimental databases. The obtained results indicate that the proposed method permits to find a
good trade-off between accuracy of the classification and complexity of the derived mathematical equations.
Moreover, the derived models can be tuned to reflect the actual phenomena, providing a very useful tool to
bridge the gap between data, machine learning tools and scientific theories.
1 THE NEED FOR DATA MINING
TOOLS IN BIG PHYSICS
EXPERIMENTS
In many fields of science, the complexity of the
problems investigated is such that it can become
difficult, if not impossible, to describe the
phenomena to be studied with theoretical models
based on first principles. A typical example in
physics is the case of magnetic confinement
thermonuclear fusion, whose plasmas are so
complex that various levels of modelling (particle,
fluid, kinetic etc) coexist without providing a
satisfactory description of many aspects of the
physics (Wesson, 2004). On the other hand, in the
last decades much more data have become available,
due to the diffusion of cheap sensors and powerful
computers. For example, the Big Physics European
experiments are affected by a data deluge. At
CERN, the ATLAS detector can produce Petabytes
of data per year. The Hubble space telescope
managed to send to earth Gigabytes of data per day
and the data warehouse of the Joint European Torus
exceeds 350 Terabytes. Therefore the inadequacies
of theoretical models and the vast amounts of
information available have motivated the
development of data driven tools, to complement
hypothesis driven theories. In this perspective,
various machine learning methods have been
developed. They range from Neural Networks and
Support Vector Machines to Fuzzy Logic classifiers;
a series of examples from the field of thermonuclear
fusion can be found in (Rattà, 2010; Vega, 2014;
Murari, 2009). Manifold learning tools, such as Self
Organising Maps and Generative Topographic
Maps, and simple classifiers based on the Geodesic
distance on Gaussian manifolds, have provided very
Murari A., Peluso E., Talebzadeh S., Gaudio P., Lungaroni M., Mikulin O., Vega J. and Gelfusa M.
Deriving Realistic Mathematical Models from Support Vector Machines for Scientific Applications.
DOI: 10.5220/0006517401020113
In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR 2017), pages 102-113
ISBN: 978-989-758-271-4
Copyright
c
2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
good results also in terms of describing the space in
which the relevant physics takes place (Cannas,
2013; Murari (A), 2013; Vega, 2009).
Even if these data driven tools are providing
quite impressive performance, their main problem is
the mathematical formulation of their models. They
have shown the potential to learn very efficiently
from the provided examples but their results are
expressed in such a way that does not allow an easy
interpretation of the physics behind the phenomena
under study. This aspect is quite worrying and has
hampered the penetration of many machine learning
tools in scientific disciplines such as physics. Some
of the main problems are: a) poor “physics fidelity”
i.e. excessive difference between the mathematical
form of the models and the physical reality of the
phenomena investigated b) difficulties to interpret
the results in terms of traditional mathematical
formulations c) consequent impossibility to compare
the obtained results with traditional mathematical
models and theories d) lack of extrapolability of the
results.
In order to overcome these limitations, a new
methodology has been developed to profit from the
knowledge acquired by the machine learning tools,
but presenting it in a more traditional format, in
terms of manageable formulas. The techniques,
developed in the framework of the activities
presented in this paper, address the basic goal of
classification. This is a very important task in many
scientific applications, both “per se” and as a
preliminary step to more sophisticated
investigations. The main idea behind this work
consists therefore of combining the learning
capabilities of the machine learning tools with the
“realism” and interpretability of more traditional
mathematical formulations.
The sequence of steps required to implement the
proposed technique is:
1. Training of the SVM with the available data;
2.
Populating the hypersurface on a suitable grid
of points
3.
Identification of the hypersurface equation
with Symbolic Regression via Genetic
Programming
4.
Double checking of the obtained equation
using the SVM
This approach reconciles the prediction and
knowledge discovery capability of machine learning
tools with the need to formulate the results in such a
way that they can be related to scientific theories and
models. It is worth emphasizing that the objective of
the present work is not simply improving
interpretability of machine learning tools, on which
significant work has already been done. The most
important aspect indeed is “physics fidelity” i.e. the
formulation of the results in mathematical terms,
which can be compared with theories and models of
the various scientific disciplines. Therefore, the
proposed method must have the potential to derive
mathematical expressions, which reflect the
underlining dynamics of the phenomena to be
investigated.
In the presented approach, the first knowledge
discovery step is based on Support Vector Machines
(SVM), whose mathematical background is
summarized in Section 2. The choice of SVM is
mainly due to their structural stability, their
capability to maximize the safety margins in the
classification. Indeed in many applications, SVM
can classify with a success rate well in excess of
95%; therefore their hyperplane can be considered a
good approximation of the boundary between the
classes (Murari (A), 2013). As a consequence, the
equation of their hypersurface in the original space
can be considered an excellent approximation of the
boundary. On the other hand, their mathematical
representation of the boundary is extremely non
intuitive (see Section 2). Again referring to a
complex system of the complexity of the Joint
European Torus, in the case of disruptions, the
equation of the hypersurface can comprise hundreds
of support vectors and therefore the equation of the
hypersurface contains an equal number of addends.
More importantly, in addition to presenting serious
problems for human understanding, the SVM
models do not reflect the actual dynamics of the
phenomena under study. It has indeed been shown,
also with many numerical examples (see Section 6),
that the models provided by SVM have absolutely
no relation with the ones generating the data. A
simple methodology has already been proposed and
applied to complex problems, to recover the
equation of the boundary in the case of linear kernels
(Gaudio, 2014). In this paper, a new technique is
developed, which is fully general. Indeed the
proposed method can be applied to SVM with any
type of kernel and therefore it has a much wider
range of applications than the more traditional
techniques. This aspect is very important in many
scientific fields, whose phenomena cannot be simply
modelled by linear tools or logistic regression.
To formulate the outputs of SVM in way suitable
for scientific investigations, extensive used is made
of Symbolic Regression (SR) via Genetic
Programming (GP); these tools are therefore
described in Section 3. Symbolic regression is
basically used to fit points on the hypersurface found
by the SVM, which is the boundary between the
classes.
The actual combination of the various tools, to
provide the equation of the boundary between two
regions of the operational space in a physically
relevant form, is described in detail in Section 4. The
results of a systematic series of numerical tests,
proving the potential of the proposed methodology,
are the subject of Section 5. Some examples of
application to experimental databases, covering
different scientific disciplines, are provided in
Section 6. Discussions and lines of future
developments are the subject of the last Section 7.
2 INTRODUCTION TO SVM FOR
CLASSIFICATION
SVM are mathematical tools which can perform
more general tasks, such as regression, but which are
used as classifiers for the studies described in this
paper. In intuitive terms, given a set of input
examples, which belong to two different classes,
SVM map the inputs into a high-dimensional space
through some suitable non-linear mapping. In this
high dimensional feature space, an optimal
separating hyperplane is constructed in order to
minimize the risk of misclassification. The
minimization of the error risk is obtained by
maximizing the margins between the hyperplane and
the closest points, the support vectors, of each class.
This is achieved by a careful selection of the
constraints of a suitable functional to minimize. In
the case of non-separable problems, the points to
classify are projected into a higher dimensional
space with the help of suitable kernels. The
minimization of the error risk and the maximization
of the margins is then performed in this projected
space. The hyperplane is identified by a subset of
points of the two classes, named Support Vectors
(SV).
In mathematical terms, given a training set of
samples
(
,
)
,
(
,
)
, where
∈ℛ
, for a
binary classification problem (i.e.
∈{+1,1}),
the SVM estimates the following decision function:
(
)
=
(
,)

(1)
Where (
,) is a kernel function and the
parameters
,,,
are the solutions of the
following quadratic optimization with linear
constraints.
Maximization of the functional:
(
)
=

+
,
(
,)
(2)
subject to the constraints:
=0

/ 0≤
,∀=1,…,
(3)
Where C is a regularization parameter (Vapnik,
2013).
The data points
associated with nonzero
values of the coefficients α
are called support
vectors, which give the name to the technique. Once
the support vectors have been determined, the SVM
boundary between the two classes can be expressed
in the form
(
)
=
(
,)

=0
(4)
(
)
is the distance (with sign) from the input
x
to the hyper-plane that separates the two classes and,
hence, the hyper-plane points satisfy
(
)
=0.
The rule to classify a feature vector
u
as class C
1
or class C
2
is given by:
if sgn (D(u)) 0
u ε C
1
otherwise
u ε C
2
where sgn(t) is the sign function.
A comment on the nomenclature is in place at
this point. The SVM operates in the transformed
space and finds a separating hyperplane in that
space. On the other hand, the hyperplane is
expressed in terms of Support Vectors in the original
space, in which the boundary is a hypersurface.
Since typically in scientific applications scientists
are interested in equations in the original space, and
not in the transformed one, the boundary between
the two classes will be indicated with the term
hypersurface and not hyperplane in the following.
3 SYMBOLIC REGRESSION VIA
GENETIC PROGRAMMING
As mentioned in the first Section, this paper
describes a technique to present the results of
machine learning tools in a mathematical form
describing realistically the actual phenomena to be
studied. In the case of classification with SVM, this
task consists of representing the hypersurface
separating the classes in a more meaningful way
than the sum of hundreds of terms as in (4). To this
end, the main tool used is Symbolic regression via
Genetic Programming. The methods developed, on
the one hand, allow identifying the most appropriate
mathematical expression for the hypersurface
without “a priori” hypotheses. In this way therefore
the potential of SVM is fully exploited and no
unnecessary restrictions are imposed on the form of
the solutions. On the other hand, the complexity of
the obtained solutions can be controlled, allowing to
find the best trade-off between complexity, success
rate of classification and realism of the final models,
depending on the objectives of the study.
The method of SR via GP consists of testing
various mathematical expressions to fit a given
database. The main steps to perform such a task are:
1. Identification of the best mathematical form
for the model with SR via GP
2.
Optimization of the models with nonlinear
fitting
3.
Qualification and selection of the best model
with the Pareto Frontier impleemnented with
statistical criteria (e.g BIC, KLD)
First of all, the various candidate formulas are
expressed as trees, composed of functions and
terminal nodes. The function nodes can be standard
arithmetic operations and/or any mathematical
functions, squashing terms as well as user-defined
operators (Schmidt, 2009; Koza, 1992). This
representation of the formulas allows an easy
implementation of the next step, symbolic regression
with Genetic Programming (GP). Genetic Programs
are computational methods able to solve complex
optimization problems (Schmidt, 2009; Koza, 1992).
They have been inspired by the genetic processes of
living organisms. They work with a population of
individuals, e.g mathematical expressions in our
case. Each individual represents a possible solution,
a potential boundary equation in our case. An
appropriate fitness function (FF) is used to measure
how good an individual is with respect to the
database. Genetic operators (Reproduction,
Crossover and Mutation) are applied to individuals
that are probabilistically selected on the basis of the
FF, in order to generate the new population. That is,
better individuals are more likely to have more
children than inferior individuals. When a stable and
acceptable solution, in terms of complexity, is found
or some other stopping condition is met (e.g., a
maximum number of generations or acceptable error
limits are reached), the algorithm provides the
solution with best performance in terms of the FF
(Murari (D), 2015; Murari (C), 2015; Peluso, 2014;
Murari, 2016). It is worth emphasizing that AIC is a
criterion to be minimised; the lower the AIC, the
better the model.
The fitness function is a crucial element of the
genetic programming approach and it can be
implemented in many ways. To derive the results
presented in this paper, the AIC criterion (Akaike
Information Criterion) has been adopted (Hirotugo,
1974) for the FF. The form of the AIC indicator
defined here is:
 = 2 +
(

)

(5)
In equation (5), MSE is the Mean Square Error
between the data and the model predictions, k is the
number of nodes used for the model and "n" the
number of "y
data
" provided, i.e. the number of entries
in the database (DB). The FF parameterized above
allows considering the goodness of the models,
thanks to the MSE, and at the same time their
complexity is penalized by the dependence on the
number of nodes.
To assess the quality of the various equations and
to select the final model, for each level of
complexity the three best models are retained. This
subset of very performing equations is used to build
a Pareto Frontier, a plot of the quality of the
equation versus its complexity (Lotov 2009). To
quantify the quality of the various equations the
well-known criteria of BIC (Bayesian Information
Criterion) and Kullback-Leibler (KLD) divergence
have been implemented. The Pareto Frontier, made
up using the BIC and the complexity of the models,
typically shows a trend resembling a “L”; above a
certain level, improving the complexity does not
increase the accuracy of the models significantly.
Therefore, near the inflection point the models better
describing the trade-off between complexity and
interpretability can be found. Once selected, the
KLD is used to perform the final sifting. The model
with the lowest value of the KLD is in fact the one
finally chosen.
Coming to the indicators used to build the Pareto
Frontier, in practice the BIC criterion (Hirotugo,
1974) is typically defined as:
 = 
(
)
+⋅
(
)

(6)
where =

−

are the residuals,
(
)
their variance and the others symbols are
defined in analogy with the AIC expression. Again
the better the model, the lower its BIC.
The aim of the KLD is to quantify the difference
between the computed probability distribution
functions, in other words to quantify the information
lost when (

())is used to approximate
(

()) (Murari (D), 2015). The KLD is
defined as:
(||) =
(
)
⋅ln
(
)
(
)

(7)
Where the symbols are defined as above. The
Kullback Leibler Divergence assumes positive
values and is zero only when the two probability
distribution functions (pdfs), p and q, are exactly the
same. In our application p is the pdf of the data,
considered the reference, and q the pdf of the model
estimates. Therefore the smaller the KLD is, the
better the model approximates the data, i.e. the less
information is lost by representing the data with the
model. A detailed overview of SR via GP for
scientific applications is provided in (Kenneth,
2002).
4 SVM AND SYMBOLIC
REGRESSION FOR
BOUNDARY EQUATIONS
This Section describes in detail the combination of
SVM technology with SR via GP to obtain the
equations of the boundary between classes in a form
appropriate for scientific investigations. Subsection
4.1 introduces the proposed way to find points on
the hypersurface identified by the SVM. Subection
4.2 describes the use of symbolic regression for the
derivation of the actual formula of the boundary
between the classes.
4.1 How to Find Points on the SVM
Hypersurface
In order to interpret the results produced by the
SVM, the first step consists of determining a
sufficient number of points on the hypersurface
separating the two classes. These points can be then
given as inputs to the SR to obtain a more
manageable equation for the hypersurface. To obtain
the SVM hypersurface points, a mesh is built first,
with resolution equal or better than the error bars of
the measurements used as inputs to the SVM. In this
step, a suitable mesh throughout the domain defined
by the ranges of variables is generated; therefore, if
the problem presents n dimensions and m grid points
are generated for each dimension, the grid will
consist of m
n
grid points.
After building the grid, the algorithm starts from
the closest points to the SVs on the positive side of
the hypersurface and moves towards the closest
points of the grid to the SVs on the other side, one
point of the mesh at the time. At each step, the
distance to the hypersurface is computed using the
already trained SVM. If the distance remains
positive, the process is repeated since the new point
remains on the same side of the hypersurface. When
the distance of a new point changes sign, the two
points with different signs are considered points on
the hypersurface. This assumption is more than
reasonable because, by construction of the mesh,
these points, for which the distance changes sign, are
within a distance from the hypersurface equal or
smaller than the error bar of the features (typically
measurements).
Figure 1: Illustrative example of the methodology to find
the SVM hypersurface points. In red or blue the two
classes (points either of the mesh or the class themselves
and classified according to the SVM). In black the points
found for the hypersurface.
Therefore, for all practical purposes, the points
found as previously described are sufficiently close
to the hypersurface to be considered on it. This way
to obtain SVM hypersurface points for synthetic data
is shown pictorially in Figure 1.
It is good practice to repeat the process also
starting from the other side of the hypersurface, in
order to avoid possible bias in the selection of the
points on the hypersurface. An adequate number of
points is typically a multiple of the support vectors.
One order of magnitudes more points than SVs is a
safe choice, in the sense that all the numerical tests
performed have always provided more than
satisfactory results with this number of points or
higher. If a lower number of points on the
hypesurface are considered, the final equation can be
too smooth and might not fully represent the
complexity of the boundary between the classes.
Attention can also be usefully paid to the fact that the
density of the points reflects the density of the SVs in
the feature space. In any case, it is easy to incraese the
number of points up to the number necessary. The
main limitation here is mainly computational time
(see Section 6) not any principle difficulty.
Once obtained the candidate points sufficiently
near to the hypersurface, before proceeding, it is in
any case good practice to perform some checking.
This can be easily achieved using again the already
trained SVM. It is sufficient to input to the SVM the
candidate points, obtained with the previously
described procedure, and verify that the distance to
the hypersurface is smaller than the error bars.
4.2 How to Derive the Equation of the
Hypersurface via SR
Once it has been verified that sufficient points close
to the hypersurface have been found, the equation of
the hypersurface itself can be estimated using SR via
GP. Indeed the points identified with the procedure
described in the previous subsection are on the
boundary between the two classes. Therefore the
equation of that surface is the equation of the
boundary between the two classes.
An efficient way of retrieving the equation of the
hypersurface from the points consists of regressing
them with SR, using the quantity with the largest
dynamic range as the independent variable. The
quality of the obtained equation can be assessed first
with the statistical indicators described in Section 3.
Moreover, an additional and more conclusive test
can be performed, exploiting again the trained SVM.
In this case, it is indeed possible to generate a series
of points from the candidate formula and insert them
in the SVM. If the distance from these points and the
hypersurface is sufficiently close to zero, it can be
confirmed that indeed the equation is a good
representation of the boundary between the two
classes. As a criterion of closeness to the boundary,
typically the value of the error bars of the
measurements can be taken: if the points generated
by the equation are at a distance from the
hypersurface smaller than the error bars, for all
practical purposes the obtained equation can be
considered a sufficient approximation of the
boundary between the two classes.
5 NUMERICAL TESTS AND
RELATED RESULTS
The procedure described in the previous section has
been subjected to a systematic series of numerical
tests. The results have always been positive and the
proposed technique has always allowed recovering
the original equations describing the boundary
between the two classes. In the following, the
detailed procedures for these numerical tests are
described and some results presented. For clarity’s
sake, mainly low dimensional cases are illustrated in
the following, but it has been verified that the
approach is equally valid for high dimensional cases
(up to 8 or 9 independent variables), provided of
course a sufficient number of examples and adequate
computational resources are available.
5.1 Overall Procedure for Producing
Synthetic Data
The main technique to produce synthetic data and to
test the methodology consists of the following 6
steps:
1- Definition of an initial function for the
boundary;
2- Generating samples of the two classes from
the function
3- Training the SVM for classification
4- Building an appropriate mesh on the domain
5- Determining a sufficient number of points on
the hyper-surface identified by the SVM
6- Deploying SR to identify the equation of the
hypersurface from the points previously
obtained
In the following more details about this
procedure are provided. To fix the ideas, the
discussion is particularized for the case of two
independent variables x
1
and x
2
.
In the first step, an initial function as a
combination of arithmetic, trigonometric, and
exponential operators of independent variables x
i
is
defined. In general, this function can be written as
follows:
=
(
,
,…,
)

∈(
,
)
(8)
In the second step, for the case of two
independent variables, it is typically sufficient to
generate about 4000 random points in the range of
variables for the x
i
and to calculate y for them. Then,
a positive offset and some random values are added
to the y for half of the data to produce the first class;
a negative offset and some random values are added
to y for the other half to produce the second class.
The equations for producing the two classes (
,
)
can be summarized as follow:
=+
(
0,
)
+
=+
(
0,
)
−
(9)
Where c stands for the arbitrary offset and
U(0,L) stands for a random uniform distribution
between 0 and the bulk thickness of data L
Table 1: General GP parameters for the calculation of the
boundary equations.
GP Parameters Value(s)
Population size 500
Selection method Ranking and Tournament
Fitness function AIC
Constant range Integers between -10 and 10
Maximum depth of trees 7
Genetic operators
(Probability)
Crossover (45 %)
Mutation (45 %)
Reproduction (10 %)
In the third step, an SVM with "Gaussian Radial
Basis Function kernel" is trained. The method used
to find the separating hyperplane is "Sequential
Minimal Optimization". Depending on the level of
random noise, different success rates can be
obtained. For the numerical tests presented in the
following, the success rate in the classification of the
SVM is always very close to 100%.
In the fourth step, a mesh on the domain has to
be built in order to identify points sufficiently close
to the hypersurface. For this reason, each dimension
of the domain has been subdivided in one hundred
intervals, producing one million mesh points ( 100
3
).
The fifth step consists of the identification of the
points sufficiently close to the hypersurface, with the
algorithm described in Section 4.
In the sixth step, the selected hypersurface points
are used as inputs to the symbolic regression code,
to find the appropriate formula for describing the
hypersurface. The settings adopted to run the GP
implementing the SR are reported in Table 1.
In the next sections, some examples are provided
to illustrate the applicability and capability of the
presented methodology for systems of increasing
dimensionality and complexity.
5.2 Examples for Two Independent
Variables
As a first test, a quite complex function comprising
exponential, arithmetic, and power operators has
been assumed for the boundary between the two
classes. The function and ranges of the variables are
reported in equation (10):
=
⋅

(
0,1
)
,
(
1,3
)
(10)
After carrying out the six-step procedure
previously described, the expression in equation (11)
has been obtained:
= 0.974
⋅
(11)
SR via GP converges on a final expression that is
in excellent agreement with the initial function
describing the boundary between the two classes,
even without making recourse to the non-linear
fitting step.
As an additional test, a more complex function
comprising trigonometric and arithmetic operators
has been defined and 4% noise was added to the
database. The function and ranges for the variables
are reported in equation (12):
=sin
(
)
+
(
−3,3
)
,
(
−2,2
)
(12)
After carrying out the six-step procedure
previously described, the expression in equation (13)
has been obtained:
= 0.985
(
sin
(
)
+
)
(13)
Again SR via GP converges on a final expression
that is in excellent agreement with the initial
function describing the boundary between the two
classes, even without making recourse to the non-
linear fitting step. Figure 2 presents the results of
this example in pictorial form.
Figure 2: Points and surfaces of the example of equation
(12). Green rectangles are points generated from the initial
function, Cyan points are the points belonging to the first
class, Magenta points are the points belonging to the
second class, and the Yellow surface identifies the hyper-
surface obtained with the SR via GP.
5.3 Effect of Noise and High
Dimensional Data
The numerical examples presented previously
include cases where the success rate of the SVM
classification is close to 100%. This is certainly an
interesting situation from a scientific point of view;
the SVM has learned almost perfectly the boundaries
between the classes and therefore the main issue
remaining consists of formulating the equations of
these boundaries in a mathematical form appropriate
for understanding the phenomena.
On the other hand, it has been checked with
extensive numerical tests that, if the success rate of
classification of the SVM is significantly lower than
100%, the proposed method works well anyway,
since its objective is the reformulation of the
boundary equation found by the SVM. The success
rate required for the SVM and the interpretation of
the results is an issue which depends on the
application and the objectives of the analysis but
does not impact on the validity of the developed
technique.
Table 2: The range of the variables used to generate the
data with function (15).
Steps: Values:
Initial Function Eq.(15)
Ranges of Variables
(
0,2
)
,
(
1.5,3
)
(
−2,4
)
,
(
0,6
)
(
4,12
)
,
(
1,4
)
Number of Nodes for
Each Class
2000
It is worth also emphasizing that the task of SR
in this context is not to improve the success rate of
the SVM classification. The real goal consists of
representing the equations of the boundary between
the classes in more realistic and interpretable
mathematical forms, so that they can be used by the
scientists for actual understanding (for example for
comparison with theories and first principle models).
To achieve this, a reasonable degradation of
classification success rate is tolerable and typically
not a major issue. In any case, with an appropriate
implementation of the proposed method, typically
the performance of SVM can be preserved by the
final equations obtained with symbolic regression.
As mentioned, there is no conceptual difficulty in
applying the proposed methodology to higher
dimensional problems. Of course, the computational
resources required increase exponentially with the
number of independent variables (the so called curse
of dimensionality). Also the number and quality of
the examples must be adequate. But these are
problems related to the available computational
power and/or the quality of the data; in no way they
affect the applicability of the proposed technique.
Indeed it has been verified with a series of
systematic tests that, with adequate level of
computer time, problems in higher dimensions can
also be solved. A quite demanding example is
reported in the following, for an equation involving
7 variables. The equation used to generate the data
is:
=
+
(
)
+
(
)
(14)
It is worth mentioning that in many applications
in physic and chemistry one has to deal with
problem of a dimensionality not higher than 7.
Equation (15) is therefore of realistic complexity for
many applications. A total of 4000 points, 2000 per
class, has been generated starting from equation
(14); more details about the synthetic data are
provided in Table 2. After generating the grid,
training the SVM and finding the hyper-surface
points, SR via GP Genetic has been applied and the
expression for the obtained hyper-surface is reported
in equation (15):
= 0.9(
)+sin
(
)
+cos
(
)
(15)
The equation found by the method is practically
the original one. The slightly different multiplicative
factor in front is not to be ascribed to a weakness of
the method but to the dataset provided as input,
since the accuracies of both the SVM and the
mathematical equation obtained are equal to 100%.
Again, this example proves that, provided the
surface of the boundary between the cases is
sufficiently regular, the dimensionality is not an
insurmountable issue, if enough computational
power is available.
5.4 Computational Requirements
As an indication about the computational resources
required for the application of the proposed
technique, the run time for an example of 5 variables
has been calculated. Using a computer with 8 cores
and 24 gigabyte of RAM (an Intel Xeon E5520, 2.27
GHz, 2 processors), with Windows 64 bit operating
system, finding the hyper-surface points takes 3
hours and the SR calculation 48 hours. The number
of points on the grid is 16
4
* 51; 16 for the four
independent variables and 51 for the dependent one.
In this respect, the run time to train the SVM is not a
major problem, since it is typically of the order of
minutes and therefore negligible compared to the
other steps of the procedure. Moreover, the
calculation of the grid is also not a major issue since
the step requiring by far most of the computational
resources is the SR. On the other hand, it should be
mentioned that the codes used to obtain these results
had not been parallelized. Therefore, since both the
building of the grid and the Genetic Programs can be
easily parallelized, reduction of the computational
resources of orders of magnitude could be easily
achievable.
6 REAL WORLD EXAMPLES
To show the potential of the proposed methodology
to attack real life problems, in this section its
application to some experimental databases is
reported. The data have been collected in the
framework of various disciplines but the original
measurements have all been obtained via remote
sensing. The term remote sensing indicates the set of
techniques aimed at obtaining information about
objects without being in contact with them. These
techniques can be used to monitor various aspects of
the atmosphere and also the effects of human
activities on the environment.
6.1 Botany: “Wilt” Database
As an example of application to a real-world
problem, first a database related to botany named
“wilt” has been selected. This database was prepared
by Brian Johnson from the Institute of Global
Environmental strategies in Japan in 2013 and
contains the results of a remote sensing study about
detecting diseased trees with Qickbird imagery
(Johnson, 2013). The data set consists of image
parts, generated by segmenting the pansharpened
pictures. The segments contain spectral information
from the Quickbird multispectral image bands and
texture information from the panchromatic (Pan)
image band. In the following, the entries of this
database are listed:
Class: “w„ (diseased trees) or “n„(all other
land cover)
GLCM_Pan: GLCM mean texture (Pan
band)
Mean_G: Mean green value
Mean_R: Mean red value
SD_Pan: Standard deviation (Pan band)
This database contains 4339 samples: 74 of them
related to diseased trees and the rest related to all
other land cover. The new proposed methodology
has been applied to this database for finding the
classification hyper-surface between the two
mentioned classes. The entries have been classified
first with the SVM (with the RBF kernel). The
subsequent application of our technique, grid plus
SR, has allowed to find the following equation:

=22.39
(

)
.
(16)
The previous equation provides a Train Accuracy
equals to 99.4% and a Test Accuracy of 99.5 %,
which are practically the same as the SVM, not only
in terms of global statistics but also with regard to
the individual cases properly or improperly
classified. Given the success rate in excess of 99%,
the derived equation (16) indicates that the important
attributes for classifying this database are the Mean
Figure 3: Distribution of data in the “wilt” database. The
red points are diseased trees and the blue points indicate
all other types of land cover. The black line indicates the
equation obtained for the hyper-surface.
Figure 4: Examples of LIDAR back scattered signals: a)
Clear atmosphere (blue line) b) strong smoke plume
(green line) c) widespread smoke (red line).
green values and the Mean red values. Figure 3
reports the entries of the database projected on the
plane of these two variables, together with the
hyper-surface obtained with equation (16).
It is also worth mentioning that, to obtain the
same success rate, the SVM has to utilize 1299
support vectors. Therefore the application of the
proposed methodology results in a simplification of
orders of magnitude in the complexity of the
equation, without any significant loss in terms of
classification accuracy. Moreover, the obtained
formula is susceptible of comparison with models
and theoretical considerations, whereas the SVM
model is practically intractable from this point of
view. It is worth noting that a metric taking into
account the unbalance in the data, such as positive-
predictive ratio, could in principle be considered,
given the fact that examples of diseased trees are
about two orders of magnitude fewer than the
healthy ones. On the other hand, given the accuracy
already achieved, this would not add much to the
present treatment.
6.2 Remote Sensing of the Atmosphere:
Detection of Widespread Smoke
with LIDAR
One of the remote sensing techniques, which is
gaining increasing importance, is LIDAR an
acronym of Light Detection And Ranging. Lidar
originated in the early 1960s, shortly after the
invention of the laser, and combines laser-focused
imaging with radar's ability to calculate distances by
measuring the time for a signal to return. Its first
deployment was in meteorology and now it is
popularly used as a technology to make high-
resolution maps, with applications in geomatics,
archaeology, geography, geology, geomorphology,
seismology, forestry, remote sensing, atmospheric
physics, laser altimetry and contour mapping.
Wild fires have become a very serious problem
in various parts of the world. The LIDAR technique
has been successfully applied to the detection of the
smoke plume emitted by wild fires, allowing the
reliable survey of large areas (Fiocco, 1963;
Andreucci, 1993; Bellecci, 2007; Bellecci, 2010;
Vega, 2010; Gelfusa, 2014; Gelfusa, 2015).
Recently, mobile compact systems have been
successfully deployed in various environments. Up
to now, the attention has been devoted to early
detection of quite concentrated smoke plumes,
characterizing the first stage of fires, as soon as
possible. The main operational approach consists of
continuously monitoring the area to be surveyed
with a suitable laser and, when a significant peak in
the backscattered signal is detected, an alarm is
triggered. In these applications, the backscattered
signal presents strong peaks, which are detected with
various techniques. In other applications, it would be
interesting also to detect the non concentrated,
widespread smoke, which can be the consequence of
strong wind dispersion or non concentrated sources
(Marrelli, 1998). In this case, the signature of the
presence of the smoke is not a strong peak in the
detected power but an overall increase of large
regions of the curve. Typical examples of
backscattered signals for the alternatives of no
smoke, strong smoke plume and widespread smoke
are shown in Figure 4. Starting from the typical
Lidar equation (Andreucci 1993), it has been
decided to fit the backscattered signal intensity with
a mathematical expression of the form:
=
∙
∙
∙
(17)
where K
1
and K
2
are constants and R is the
range. The data of Figure 4 have been fitted with this
formula. The result of the non linear fit, for the
widespread smoke is reported in equation (18) and
for clear atmosphere in equation (19):
=
2.64810

∙
.∙

∙
(18)
=
1.734 10

∙
.∙

∙
(19)
The results of the fit, equations (18) and equation
(19), indicate quite clearly that the parameter K
2
are
very similar for both the case of widespread smoke
and clear atmosphere. On the other hand, there is a
clear difference, of the order of 25% in the constants
K
1
. This is expected since K
1
includes the effect of
the coefficient β, which indeed quantifies the
backscattering properties of the atmosphere
(Andreucci, 1993; Bellecci, 2007; Bellecci, 2010;
Vega, 2010).
Table 3: Main characteristics of the database used for the
LIDAR application.
Total number of data: 521
Number of non-smoke data 312
Number of widespread smoke
data
209
Number of train data (~80%) 431
Number of test data (~20%) 90
Since the attempt to identify the presence of
widespread smoke is a quite pioneering application
of the LIDAR technique, it is important not only to
be able to discriminate between the two situations
but also to provide models for the interpretation of
the physics. In particular, the identification of the
boundary in the space of the parameters K
1
and K
2
for the two cases is considered an essential piece of
information for comparison with theories. The
proposed methodology has therefore been applied to
a quite substantial database summarized in Table 3.
For the SVM, a radial basis functions kernel has
been used. The best equation found is with SR:
= 0.1083 sin15.61
+
+cos1.59
.

(20)
The above equation provides a Train Accuracy
of 89.33 % and a Test Accuracy of 91.11 %,
practically the same as the SVM success rate. The
equation of the boundary between clear atmosphere
and widespread smoke, in the space of the
parameters K
1
and K
2,
is shown in Figure 5. To
understand the importance of the results obtained, it
should also be considered that the model of the SVM
consists of 154 support vectors. Therefore the level
of simplification obtained with equation (20) is
substantial.
Figure 5: Equation (18), describing the boundary between
the boundary between the cases of clear atmosphere and
widespread smoke, in the space of the parameters K
1
and
K
2
.
7 CONCLUSIONS
An original methodology has been devised to obtain
the equation of the boundary between two classes,
starting from an SVM classifier. In this way, using
SR via GP, the power of machine learning tools is
combined with the realism, physics fidelity and
interpretability of equations expressed in the usual
formalism of typical scientific theories. In particular,
the choice of SVM ensures that their structural
stability, their capability to maximize the safety
margins in the classification, is fully retained in the
final result. On the other hand, symbolic regression
allows finding the best trade-off between accuracy
of the classification and complexity of the final
equations of the boundary, depending on the
application. Moreover, “a priori” information can
also be exploited in order to steer the solutions
towards mathematical expressions, which reflect the
actual dynamics of the phenomena under study. This
can be achieved for example by selecting properly
the basis functions or by constraining the structure
of the trees. Given the fact that the objectives of the
approach are realism and interpretability, a
reasonable reduction of the classification
performance is not a major issue and can be
tolerated. It is also true that symbolic regression via
genetic programming can reproduce the accuracy of
the classification by the SVM, provided a
sufficiently high number of mesh nodes and the
necessary complexity of the SR are allowed for.
The numerical tests shown have proved the
effectiveness of the proposed technique to identify
the real equation of the boundary between classes
even in relatively high dimensions, provided the
shape of the boundary is a sufficiently regular
surface. Again, this seems to be fully adequate since,
in the majority of the scientific applications, the
boundaries between the various classes are quite
regular functions. This has been confirmed by the
application of the technique to experimental
databases of different scientific disciplines.
On the other hand, the method is susceptible of
various improvements. First of all, the technique
should be extended to other machine learning tools,
such a neural networks. More fundamentally, the
approach is now limited to identifying the
mathematical expressions of boundaries which can
be expressed as functions. It is a topic of future
investigations to apply the method to the
investigation of more complex boundaries (for
example multiply connected hypersurfaces).
Moreover, the task of regression, and not only
classification, should also be tackled (Murari (D),
2015; Murari (C), 2015; Peluso, 2014; Murari,
2016). Also applications to various aspects of
tomography inversion and disruptions are envisaged
(Martin, 1997; Murari (B), 2013).
REFERENCES
Andreucci F and Arbolino M., 1993, Il Nuovo Cimento,
16, 1, 35 (1993).
Bellecci C et al, 2007, Appl. Phys. B, 87, 373.
Bellecci C et al., 2010, Optical Engineering, 49 (12),
124302.
Cannas B. et al, 2013, Nucl. Fusion 53 093023
Fiocco G. and Smullin L. D., 1963, Nature, 199, 1275 .
Gaudio et al., 2014, Plasma Phys. Control. Fusion, 56
114002.
Gelfusa M. et al, 2014, Review Scientific Instr., 85,
063112
Gelfusa M. et al, 2015 “First attempts at measuring
widespread smoke with a mobile lidar system”, IEEE
Xplore, ISBN: 978-1-78561-068-4
Hirotugo A., 1974, IEEE Transactions on Automatic
Control 19 (6): 716–723, 1974
Johnson B.et al, 2013, International Journal of Remote
Sensing, 34 (20), 6969-6982.
Kenneth P. B and Anderson D. R., 2002, “Model Selection
and Multi-Model Inference: A Practical Information-
Theoretic Approach”, Springer (2nd ed)
Koza J.R., 1992, “Genetic Programming: On the
Programming of Computers by Means of Natural
Selection”, MIT Press, Cambridge, MA, USA.
Lotov A. V. et al, 2009Interactive Decision Maps:
Approximation and Visualization of Pareto Frontier”,
Springer, ISBN 978-1-4020-7631-2.
Marrelli L. et al, 1998, “Total radiation losses and
emissivity profiles in RFX”, Nucl. Fusion 38 (5), 649
Martin P. et al, 1997, Review of scientific instruments 68
(2), 1256-1260
Murari A., et al, 2009, Nucl. Fusion, 49 055028 (11pp)
Murari A., et al. (A), 2013, Nucl. Fusion 53 033006 (9pp)
Murari A., et al, (B), 2013 Nucl. Fusion, 53 043001
doi:10.1088/0029-5515/53/4/043001
Murari A. et al (C), 2015, Plasma Physics. Control.
Fusion, 57 014008, doi: http://dx.doi.org/10.1088/
0741-3335/57/1/014008
Murari A et al (D), 2015, Nucl. Fusion 55 073009 (14pp)
doi:10.1088/0029-5515/55/7/073009
Murari A. et al., 2016, Nucl. Fusion 56 026005,
doi:http://dx.doi.org/10.1088/0029-5515/56/2/026005
Peluso E. et al, 2014, Plasma Phys. Control. Fusion, 56
114001,doi:http://dx.doi.org/10.1088/0741-
3335/56/11/114001
Rattà G. A. et al., 2010, Nucl. Fusion. 50 025005 (10pp).
Schmidt M. and Lipson H., 2009 April, Science, Vol 324
Vapnik V., 2013 “The Nature of Statistical Learning
Theory”, Springer Science & Business Media, ISBN
1475724403, 9781475724400
Vega et al, 2010, Review of Scientific Instruments, 81 (2),
p. 023505
Vega J et al, 2014, Nucl. Fusion 54 123001
Vega J. et al, 2009, Nucl. Fusion. 49 085023 (11pp)
Wesson J., “Tokamaks”, Clarendon Press, Oxford, 2004.
Third edition.