Evaluation Platform for Artiﬁcial Intelligence Algorithms

Zoltan Czako, Gheorghe Sebestyen and Anca Hangan

Department of Computer Science, Technical University of Cluj-Napoca, Romania

Keywords:

Artiﬁcial Intelligence Algorithms, Evaluation Platform, Methodology.

Abstract:

Currently, artiﬁcial intelligence (AI) algorithms have been receiving a lot of attention from researchers as well

as from commercial product developers. Hundreds of different AI algorithms aiming different kind of real-life

problems have qualitatively different results, based on the nature of data, the nature of the problems and based

on the context in which they are used. Choosing the most appropriate algorithm to solve a particular problem

is not a trivial task. The goal of our research is to create a platform, which can be used in the early stage of

problem solving. With this platform, the user could be able to quickly train, test and evaluate several artiﬁcial

intelligence algorithms and also they will be able to ﬁnd out which is the algorithm that performs best for a

speciﬁc problem. Moreover, this platform will help developers to tune the parameters of the chosen algorithm

in order to get better results on their problem. We will demonstrate our approach by running different types

of algorithms initially in the case of breast cancer sample dataset and after that we will use the platform for

solving an anomaly detection problem.

1 INTRODUCTION

In the last decade, because of some new data acquisi-

tion technologies (e.g. sensor networks, IoT, Industry

4.0) and the wide spread of Internet-based applica-

tions, huge amount of data is becoming available. Ef-

ﬁcient and higher level processing of such "big data"

requires intelligent algorithms, not just for academic

purposes, but also for real software products. More

and more companies are trying to use artiﬁcial in-

telligence to solve complex problems, which earlier

required a human expert. The hardest thing in en-

terprise applications using artiﬁcial intelligence is to

estimate the time needed to create a minimum viable

product. The difﬁculty of the estimation arise from

the fact that before starting the project it is very hard

to know exactly which of the algorithms will be the

best match for that particular problem and for that par-

ticular dataset.

Algorithms can have very different and surpris-

ing results based on the nature of the problem and

the nature of the training data. A good approach to

overcome this difﬁculty is to get a small part of the

dataset, implement multiple artiﬁcial intelligence re-

lated algorithms and benchmark those algorithms us-

ing the small sample dataset. This can be a very time

consuming process and usually for enterprise prod-

ucts you don’t have such time. Our intention is to

create a platform which can be used to test multiple

algorithms and evaluate them using appropriate met-

rics, this way reducing the time to production.

Another problem which usually occur is the tidi-

ness of data. Raw data can have multiple problems

(e.g. noise, sample misses, etc.), which occur be-

cause of bugs in the data collecting process. Prepro-

cessing of the data can be a challenging task, which

may require multiple subsequent tasks; therefore our

platform contains a set of preprocessing algorithms

which can be applied by a domain specialist, with-

out speciﬁc programming knowledge. The data can

also be visualized in multiple way (2D-3D charts, his-

tograms, FFT, etc.), which can help spotting problems

within the raw data. Another important aspect of AI

algorithms is that someone can combine multiple al-

gorithms to improve the quality of the results. In order

to combine different algorithms the platform should

offer the possibility to pipeline the partial results in a

reconﬁgurable manner.

As a practical result, in the following sections we

will describe our platform, which can be used to eval-

uate different AI techniques with different types of

datasets, conﬁguring the most effective pipeline in

different contexts. We will show results obtained with

different algorithms with and without preprocessing,

showing in this way the effectiveness of the prepro-

cessing step. We will tune and evaluate multiple algo-

Czako, Z., Sebestyen, G. and Hangan, A.

Evaluation Platform for Artiﬁcial Intelligence Algorithms.

DOI: 10.5220/0006888900390046

In Proceedings of the 10th International Joint Conference on Computational Intelligence (IJCCI 2018), pages 39-46

ISBN: 978-989-758-327-8

rithms and the results will be compared. So the pur-

pose of this article is to describe a platform that allows

a user to select, tune and pipeline multiple AI algo-

rithms as well as preprocessing and displaying tasks.

All these operations may be done without writing any

code and without programming skills. This tool is

very useful in the early stage of a project when the

feasibility of a given approach must be measured or

decided. The tool will give an estimate of the qual-

ity level that may be obtain using different AI algo-

rithms. This will reduce the time to production and

it will increase the chance for a successful implemen-

tation. Furthermore, we will introduce a step-by-step

process for choosing the right sequence of algorithms.

The rest of the paper is organized as follows. Sec-

tion 2 shows a brief description of related works. Sec-

tion 3 contains the theoretical classiﬁcation of differ-

ent algorithms and the decision tree which can be used

to choose the proper algorithm for a given problem.

The architecture of the platform is described in Sec-

tion 4. Experiments can be found in Section 5 and

Section 6 concludes the article.

2 RELATED WORK

Choosing the best algorithm for a speciﬁc problem

domain, taking into consideration multiple factors

(e.g. the type of the data involved, the nature of the

problem etc.) is a general problem in developing in-

telligent products. There are multiple articles which

are trying to classify AI algorithms, based on different

points of view.

In (Dasgupta and Nath, 2016) we can ﬁnd a typical

classiﬁcation of machine learn algorithms, based on

the nature of the training data as follows:

- supervised, if the training data is labeled;

- unsupervised, if there are no labels;

- semi-supervised, if some of the class labels are

missing.

The problem with this classiﬁcation is the fact that

it considers only the nature of the training data, with-

out taking in consideration the context of the problem.

This approach reduces the searching space, but each

class has too many algorithms to be considered, tested

and evaluated. Other articles describes speciﬁc prob-

lems, but only in the context of classiﬁcation (Ilias

et al., 2007), regression (Gulden and Nese, 2013) or

only in case of one speciﬁc algorithm. There are lots

of articles focusing on a speciﬁc problem and only

one speciﬁc algorithm to resolve that problem. For

example, in the case of anomaly detection, there are

a plenty of articles discussing different scenarios and

comparing results of different algorithms.

In (Zareapoor et al., 2012) the authors make a

comprehensive evaluation of the most popular algo-

rithms used in the context of credit card fraud detec-

tion. This article contains a brief description of al-

gorithms like Bayesian Networks, Neural Networks,

SVM, etc. and at the end of the article there is a ta-

ble comparing the results, using metrics like accuracy,

speed and cost. Other types of articles has a differ-

ent approach, creating a survey of all the algorithms

which can be used for a speciﬁc problem.

In (Varun et al., 2009) there is a survey of all the

algorithms which can be used to effectively detect

anomalies. This article doesn’t narrow down its view

to only supervised or only unsupervised algorithms.

On the contrary, the authors describe an extensive list

of algorithms, which can be used in a larger context,

in anomaly detection. There are plenty of research pa-

pers comparing two or more different algorithms, as

in (Juan et al., 2004) (Murad et al., 2013) (Agrawal

and Agrawal, 2016) (Gupta et al., 2014).

The papers mentioned above were trying to help

engineers in ﬁnding the best algorithm for speciﬁc

problems, but none of them managed to create a step-

by-step process, a useful methodology on how to

choose the right algorithm for any type of problem

and any type of dataset. In this paper we will pro-

pose a step-by-step methodology that can be applied

to choose an algorithm. Moreover we will present our

platform, which can be used in order to evaluate the

chosen algorithms.

3 THE PROPOSED

METHODOLOGY FOR

CHOOSING AI ALGORITHMS

There are no universally good or bad algorithms, each

algorithm is speciﬁc to a context, a problem or a type

of dataset. This idea was demonstrated by David H.

Wolpert and William G. Macready in (Wolpert and

Macready, 1997) in the so called "No Free Lunch

Theorems for Optimization". The NFLT are a set of

mathematical proofs and general framework that ex-

plores the connection between general-purpose algo-

rithms that are considered “black-box” and the prob-

lems they solve. This states that any algorithm that

searches for an optimal cost or ﬁtness solution is not

universally superior to any other algorithm. Wolpert

and Macready wrote in their paper that "If an algo-

rithm performs better than random search on some

class of problems then it must perform worse than

random search on the remaining problems."

In the real world, we need to decide on engineer-

IJCCI 2018 - 10th International Joint Conference on Computational Intelligence

Figure 1: Algorithm Selection Decision Tree.

ing solutions to build practical models that solve real

problems. For certain pattern recognition problems,

we tend to ﬁnd that certain algorithms perform better

than others which is a consequence of the algorithm's

ﬁtness or cost function ﬁt to the particular problem.

For our particular given problem, in order to ﬁnd the

best algorithm, NFLT should remind us that we need

to focus on the particular problem at hand, the as-

sumptions, the priors (extra information), the data and

the cost.

What the NFLT is trying to tell us is, we are gener-

ally not going to ﬁnd off the shelf algorithms that ﬁts

perfectly to our data. We have to architect the algo-

rithm to better ﬁt the data. This means that there are

no universally applicable algorithms, so in order to

ﬁnd optimized solutions, we must have a methodol-

ogy to architect our model. Figure 1 shows a decision

tree to help engineers in choosing the right algorithm

based on the problem context and the nature of the

data.

Based on Figure 1 the ﬁrst step in choosing the

appropriate AI algorithm is to determine the nature of

the problem, establish the goal, what do we want to

obtain. Based on this, the algorithms can be grouped

in three categories: Classiﬁcation, Clustering, Predic-

tion or Visualization (or Dimensionality Reduction).

After deﬁning the nature of the problem, the next step

is to focus on the nature of the training data. In this

case we should ask questions about the dimension of

the available training data, whether the dataset has la-

bels or not or questions about the importance of the

features, can we transform the features to get more

relevant ones or should we keep the original features.

Answering these questions can spectacularly reduce

the search space, in some cases resulting in only one

algorithm, this way making the developer’s job much

easier. These questions are very important, because

for example, if we don’t have enough examples in the

training set (Less than 100K), then Neural Networks

can have poor results and linear SVC can do a better

job. As another example, if the features are important,

so we should preserve the original values, then PCA is

not a good choice, because this algorithm modiﬁes the

features, using orthogonal transformation, so a better

choice would be Kernel Approximation or GA.

4 ARCHITECTURE OF OUR

PLATFORM

The platform developed by our team is meant to offer

a pragmatic tool for specialists involved in some kind

of artiﬁcial intelligence related projects. This tool is

not meant as a ﬁnal solution but as a starting point

in the process of ﬁnding the best method that ﬁts the

quality and efﬁciency criteria of a given application

domain. Therefore the platform contains those basic

functionalities that allows a specialist to collect, pro-

cess with AI algorithms and visualize data. Although

the platform was meant for general use, most of our

experiments were performed in the area of anomaly

and outlier detection in big datasets. Here are the

functionalities we considered necessary for such a

platform:

• Data harvesting tools - that allows acquisition of

data from a variety of datasets having different

formats (Excel, CSV, ARFF) or from different

physical sources (sensor networks, smart devices,

IoT); a special treatment is given to real-time data

Evaluation Platform for Artiﬁcial Intelligence Algorithms

Figure 2: Application Diagram.

collections and datasets containing anomalies;

• Data preprocessing tools - instruments meant to

transform the raw data into a noise free and nor-

malized data; typical methods included in this cat-

egory are: parameterized ﬁlters, transforms (e.g.

FFT, wavelet) or histograms; there are also meth-

ods for determining some statistical parameters of

the input data such as: min-max, median value,

standard deviation, etc.

• Artiﬁcial Intelligence algorithms - a wide group

of methods that try to cover the most representa-

tive nodes of the presented taxonomy; our goal

is to offer a rich set of possibilities from were to

choose and compare; the open nature of the plat-

form allows new methods to be added to the ex-

isting library; methods can be tested without pro-

gramming knowledge, because the platform offers

an easy to use user interface.

• Visualization tools - very important in the pro-

cess of ﬁnding the best artiﬁcial intelligence meth-

ods because they offer a bi-dimensional represen-

tation of otherwise multidimensional data, much

easier to understand for the human eye;

• Generators of datasets and artiﬁcial signals - nec-

essary in the process of validating or measur-

ing the quality of some new artiﬁcial intelligence

methods; special techniques are applied in order

to combine "normal" data with artiﬁcially created

anomalies.

The tools mentioned above are integrated into an

open architecture platform allowing continuous ex-

tension with new methods. A unique internal pre-

deﬁned data format assures interoperability and inter-

changeability between the existing tools and function-

alities.

The high level overview of the included function-

alities is shown in Figure 2.

To create a more complex model, using multiple

preprocessing techniques and/or multiple artiﬁcial in-

telligence algorithms, the platform supports creation

of a pipeline. Each step of the pipeline is saved in

.csv ﬁles, so the results of each step can be used sep-

arately. This way the pipeline is highly conﬁgurable,

it can have multiple steps, some steps can be skipped,

reordered or reused. Figure 3 shows a high level view

of the pipeline, where steps can be reconﬁgured, re-

placed, skipped or reordered.

The most important library used to create this plat-

form is (SciKit-Learn, 2017). Similar tool is Weka

(Weka, 1997)

5 EXPERIMENTS

The purpose of this section is to show through some

examples the usage possibilities of the artiﬁcial intel-

ligence platform. Here we emphasize two aspects:

the possibility to choose between different anomaly

detection techniques and the ability to tune the pa-

rameters of a given method in order to obtain better

results.

5.1 Experiments using the Breast

Cancer Dataset

The next experiments were made on the same dataset

that reﬂects information related to breast cancer

IJCCI 2018 - 10th International Joint Conference on Computational Intelligence

Figure 3: Overview of the pipeline.

(Lichman, 2013). This dataset contains 569 data

points and each data point has 32 attributes. The

dataset was labeled showing normal (e.g. benign) and

abnormal (e.g. malign) data points. A part of the

dataset was used in the training (learning) phase and

the rest in the testing phase.

The next ﬁgures show how our platform can be

used by someone that has no programming knowl-

edge, only by choosing the options in the graphical

user interface.

The ﬁrst step is to load our dataset. At the begin-

ning we select a given AI algorithm that can classify

data in benign and malign samples. Our ﬁrst choice is

the SVM (support vector machine) algorithm. In the

next step we will conﬁgure and train our SVM model

using the loaded dataset, without preprocessing of the

data. In the case of the SVM, we can conﬁgure the C

parameter, which stands for penalty parameter of the

error term, the value of gamma, which is the kernel

coefﬁcient for rbf, poly and sigmoid and the kernel

type to be used in the algorithm. It must be one of lin-

ear, poly, rbf or sigmoid. Next step of the process is

to evaluate the trained model. The result of the eval-

uation is shown in Figure 4. As we can see in this

table, the scores of this model are not very promising,

which is caused by the fact that no preprocessing was

made.

In the last step we apply Min-Max scaler as a

preprocessing algorithm and the conﬁguration of the

SVM model remains the same. From Figure 4 we can

clearly see the effect of applying the correct prepro-

cessing algorithm, the increase in the model’s scores

are spectacular. This demonstrates the importance of

understanding our training dataset and the importance

of using proper algorithms not just for classiﬁcation,

but also for preprocessing.

This process can be repeated using other types of

algorithms, distance based, density based, hierarchi-

cal or even using neural networks. The following ﬁg-

ures shows the usage of other algorithms.

In the case of KNN you can choose the number

of neighbor nodes, the algorithm used to compute the

nearest neighbors, the weight function used in pre-

diction and the distance metric to use for the tree.

The default metric is Minkowski, and with power=2

is equivalent to the standard Euclidean metric. Figure

4 shows the results in a table.

We can use the platform to conﬁgure more difﬁ-

cult models, like Neural Networks. Conﬁguring the

Neural Network model means that you can set the

Alpha value or the L2 penalty (regularization term)

parameter, set the solver for weight optimization (al-

gorithm), choose the correct activation function, set

the number of the hidden layer and the number of the

nodes for the hidden layers. Figure 4 shows the results

of the model as a confusion matrix.

In other cases, if you are not interested in the con-

fusion matrix, but you want to run the training and

the evaluation multiple times to see how does the al-

gorithm behave, you can do it easily, using only the

UI, with no programming knowledge.

To demonstrate this functionality, we ran the train-

ing and evaluation process 14 times. The accuracy

diagram for this training and evaluation loop can be

seen in Figure 5

To conﬁgure the Random Forest algorithm to be

Evaluation Platform for Artiﬁcial Intelligence Algorithms

Figure 4: Evaluation of the different algorithms.

Figure 5: Evaluation of the Random Forest model in a loop.

Figure 6: Scatter Plot.

evaluated it in a loop you can set the maximum num-

ber of evaluators (decision trees), choose the slicing

method, conﬁgure a cutoff depth if needed and de-

ﬁne the step size (the number of estimators will be

increased with the step size in each step). Figure 5

contains a digram which shows the accuracy of the

model in each step. This is very useful when you al-

ready know which algorithm do you want to use and

you want to ﬁnd the most optimal parameters to train

the model with.

On the visualization part, we can see our data in a

scatter plot (Figure 6), as a histogram (Figure 7) or as

Figure 7: Histogram.

Figure 8: Time Series.

time series (Figure 8), depending on the nature of the

data.

As future development we aim to add some al-

gorithms for visualization and dimensionality reduc-

tion (like PCA) and some preprocessing algorithms

to cover most of the techniques and algorithms that a

general purpose platform should offer.

IJCCI 2018 - 10th International Joint Conference on Computational Intelligence

5.2 Using the Proposed Methodology

for Choosing Algorithms in the

Context of Anomaly Detection

Anomaly detection ﬁnds data points that do not ﬁt

well with the rest of the data. It has a wide range

of applications such as fraud detection, surveillance,

diagnosis, data cleanup, and predictive maintenance.

Lets consider Thyroid Disease dataset, which has

3772 training instances and 3428 testing instances. It

has 15 categorical and 6 real attributes. The problem

is to determine whether a patient referred to the clinic

is hypothyroid. Therefore three classes are built: nor-

mal (not hypothyroid), hyper function and subnormal

functioning. For outlier detection, 3772 training in-

stances are used, with only 6 real attributes. The hy-

per function class is treated as outlier class and other

two classes are inliers, because hyper function is a

clear minority class.

To ﬁnd the appropriate algorithm for this particu-

lar dataset, the ﬁrst question is about the nature of the

problem. Because it is clear that we have a classiﬁca-

tion problem (classify examples as outliers or inliers),

we can reduce the search space and go to the next step,

to analyze the nature of the data. As we speciﬁed in

the description of the thyroid disease dataset, we have

labeled data, so we can step further and analyze the

dimension of our dataset. The dataset contains 3772

instances, which is less than 100K, so we can try out

the SVM algorithm.

After we found the appropriate algorithm, the next

step is to use our platform to minimize the error and

to maximize the precision, recall, accuracy and other

evaluation results. We run the algorithm multiple

times. Firstly we used C=1 and Gamma=0.1 with rbf

kernel. Figure 9 shows the results of the experiment.

The results are not too promising, because having a

really small recall much of the outliers are omitted.

In the next step we increased both C and Gamma,

which gave us a much better result, as we can see in

Figure 10. From 0.08 we increased the recall to 0.6.

Even if the precision has dropped, the overall result is

much higher, the model behaves much better.

As we keep increasing the values of C and

Gamma, we are getting better and better results until

we reach a threshold, when the model starts to over-

ﬁt. The best results we obtained using C=100 and

Gamma=10 (see Figure 11)

In this case we still have a choice to make. Are

the results good enough for our problem or should we

test other algorithms too. If the answer is yes, then we

can stop and use the SVM to solve our problem. If the

answer is no, then we can still choose from plenty of

algorithms. As we can see in the algorithm selection

Figure 9: SVM with C=1 and Gamma=0.1.

Figure 10: SVM with C=10 and Gamma=1.

tree (Figure 1) the next question is whether our data

is text or not. In our case we don’t have to deal with

text, so we can choose between KNeighbors, Random

Forest or we can use Ensemble Methods.

With this example for using our algorithm selec-

tion tree, we demonstrated the beneﬁts of our algo-

rithm selection methodology and the usefulness of our

platform in ﬁnding the correct parameters for the cho-

sen algorithm. This experiment clearly shows how

can we reduce the search space when a developer

have to choose between algorithms, this way increas-

ing productivity and velocity of the developers and

implicitly decreasing the time to production and also

the costs of the project.

Evaluation Platform for Artiﬁcial Intelligence Algorithms

Figure 11: SVM with C=100 and Gamma=10.

6 CONCLUSIONS

The methodology and decision tree for choosing the

proper algorithm based on the problem context and

the nature of the data tries to show a coherent way

allowing a developer to select the best methods that

ﬁts the requirements of a given application. This

taxonomy was used as a theoretical background for

the development of a platform that incorporate differ-

ent artiﬁcial intelligence and data preprocessing tech-

niques.

The platform implements the main functionalities

needed by a developer in the process of ﬁnding the

best strategy for a given domain that assure high qual-

ity problem solving in a reasonable execution time,

increasing the productivity of the developers, increas-

ing the probability of success and decreasing the costs

and risk. It includes multiple data acquisition, pre-

processing, anomaly detection and visualization func-

tionalities that may be combined in a speciﬁc process-

ing ﬂow.

Every individual functionality is implemented as

an autonomous service and multiple services may be

orchestrated in a logical ﬂow in order to obtain the

ﬁnal results. The richest group of functionalities is

the one that contains the artiﬁcial intelligence meth-

ods. Here we tried to cover most of the develop-

ment directions present today in the scientiﬁc liter-

ature, from statistical methods, towards, signal pro-

cessing and neural networks. In this way the platform

can be used as a training tool for those who must de-

velop different solutions in different areas. For the

same learning purposes we included a multitude of

datasets that cover a diversity of application domains.

The experimental part of the paper demonstrates

some of the functionalities of the platform, including

the possibility to compare the result obtained with dif-

ferent methods, to run the training and evaluation pro-

cess multiple times and to visualize the data and the

results in a meaningful way.

REFERENCES

Agrawal, S. and Agrawal, J. (2016). Survey on anomaly

detection using data mining techniques. In 19th Inter-

national Conference on Knowledge Based and Intelli-

gent Information and Engineering Systems. Elsevier.

Dasgupta, A. and Nath, A. (2016). Classiﬁcation of ma-

chine learning algorithms. In International Journal

of Innovative Research in Advanced Engineering, vol-

ume 3. IJIRAE.

Gulden, K. U. and Nese, G. (2013). A study on multiple

linear regression analysis. 106.

Gupta, M., Gao, J., Aggarwal, C. C., and Han, J. (2014).

Outlier detection for temporal data: A survey. In IEEE

Transactions on Knowledge and Data Engineering,

volume 25. IEEE Xplore.

Ilias, M., Kostas, K., Manolis, W., and John, S., edi-

tors (2007). Proceedings of the 2007 Conference

on Emerging Artiﬁcial Intelligence Applications in

Computer Engineering: Real Word AI Systems with

Applications in eHealth, HCI, Information Retrieval

and Pervasive Technologies, Amsterdam, The Nether-

lands, The Netherlands. IOS Press.

Juan, E.-T., Pedro, G.-T., and Jesus, D.-V. (2004). Anomaly

detection methods in wired networks: A survey and

taxonomy. Comput. Commun., 27(16):1569–1584.

Lichman (2013). Uci machine learning repository.

Murad, R., Anazida, Z., and Mohd, M. (2013). Advance-

ments of data anomaly detection research in wire-

less sensor networks: A survey and open issues.

13:10087–10122.

SciKit-Learn (2017). //scikit-learn.org/stable/.

Varun, C., Arindam, B., and Vipin, K. (2009). Anomaly

detection: A survey. ACM Computing Surveys,

41(3):15:1–15:58.

Weka (1997). //weka.wikispaces.com/.

Wolpert, D. H. and Macready, W. G. (1997). No free lunch

theorems for optimization. In IEEE Transactions on

Evolutionary Computation, volume 1. IEEE Xplore.

Zareapoor, M., Seeja.k.r, and Alam, M. A. (2012). Analy-

sis on credit card fraud detection techniques: Based on

certain design criteria. International Journal of Com-

puter Applications, 52(3):35–42.

IJCCI 2018 - 10th International Joint Conference on Computational Intelligence