Predicting the Empirical Robustness of the Ontology Reasoners based on

Machine Learning Techniques

Nourh

ene Alaya

1,2

, Sadok Ben Yahia

and Myriam Lamolle

LIASD, IUT of Montreuil, University of Paris 8, Montreuil, France

LIPAH, Faculty of Sciences of Tunis, University of Tunis, Tunis, Tunisia

Keywords:

Ontology, OWL, Reasoner, Robustness, Supervised Machine Learning, Prediction.

Abstract:

Reasoning with ontologies is one of the core tasks of research in Description Logics. A variety of reason-

ers with highly optimized algorithms have been developed to allow inference tasks on expressive ontology

languages such as OWL (DL). However, unexpected behaviours of reasoner engines is often observed in prac-

tice. Both reasoner time efﬁciency and result correctness would vary across input ontologies, which is hardly

predictable even for experienced reasoner designers. Seeking for better understanding of reasoner empirical

behaviours, we propose to use supervised machine learning techniques to automatically predict reasoner ro-

bustness from its previous running. For this purpose, we introduced a set of comprehensive ontology features.

We conducted huge body of experiments for 6 well known reasoners and using over 1000 ontologies from the

ORE’2014 corpus. Our learning results show that we could build highly accuracy reasoner robustness pre-

dictive models. Moreover, by interpreting these models, it would be possible to gain insights about particular

ontology features likely to be reasoner robustness degrading factors.

1 INTRODUCTION

The key component for working with OWL ontolo-

gies is the Reasoner. This is because knowledge in an

ontology might not be explicit, then a reasoner is re-

quired to deduce its implicit knowledge. However, the

high expressivity of OWL has increased the compu-

tational complexity of inference tasks. For instance,

it has been shown that the complexity of the consis-

tency checking of SR OI Q ontologies, the descrip-

tion logic (DL) underlying OWL 2, is of worst-case

2NExpTime-complete (Horrocks et al., 2006). There-

fore, a number of highly-optimized reasoners have

been developed, which support reasoning about ex-

pressive ontologies (Sirin et al., 2007; Glimm et al.,

2012; Steigmiller et al., 2014).

Despite the remarkable progress in optimizing

reasoning algorithms, unpredictable behaviours of

reasoner engines is often observed in practice, partic-

ularly when dealing with real world ontologies. Two

main aspects would depict this phenomena. On the

one hand, the respective authors of (Weith

oner et al.,

2007; Gonc¸alves et al., 2012) have outlined the vari-

ability of reasoner’s time efﬁciency across OWL on-

tologies. Roughly speaking, the reasoner optimiza-

tion tricks, set up by designers to overcome particu-

lar DL complexity sources, would lead to enormous

scatter in computational runtime across the ontolo-

gies, which is still hardly predictable a priori. These

ﬁndings have motivated various attempts of reasoner

runtime prediction using machine learning techniques

(Kang et al., 2012; Sazonau et al., 2014; Kang et al.,

2014). On the other hand, results reported from

the latest Ontology Reasoner Evaluation Workshops,

ORE (Gonc¸alves et al., 2013; Bail et al., 2014) have

revealed another aspect of reasoner behaviours vari-

ability, namely the correctness of the reasoning re-

sults. In fact, the evaluations were surprising as rea-

soners would derive different inferences for the same

input ontology. Therefore, an empirical correctness

checking method was established to examine reasoner

results. Actually in both ORE 2013 and 2014 com-

petitions, there were no single reasoner, which cor-

rectly processed and outperformed on all given inputs.

Even the fastest reasoner have failed to derive accu-

rate results for some ontologies, while others less per-

forming engines, have succeeded to correctly process

them. Thus, we would admit that the most desired

qualities, i.e. result correctness and time efﬁciency,

are not empirically guaranteed by all the reasoners

and for every ontology. These observations pinpoints

the hardness of understanding reasoner empirical be-

Alaya, N., Yahia, S. and Lamolle, M..

Predicting the Empirical Robustness of the Ontology Reasoners based on Machine Learning Techniques.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 2: KEOD, pages 61-73

ISBN: 978-989-758-158-8

haviours even for experienced and skilled reasoner

designers. Thus, it would be worthwhile to be able to

automatically predict both correctness and efﬁciency

of reasoners against given ontologies. More invalu-

able would be gaining insights about which particular

aspects in the ontology are lowering these qualities.

Obviously, learning such aspects would further im-

prove reasoner optimizations and enhance ontology

design by avoiding reasoner performances degrading

factors.

In this paper, we introduce a new approach aim-

ing to predict an ontology classiﬁcation quality that

a particular reasoner would be able to achieve by

a speciﬁc cutoff time. To the best of our knowl-

edge, no prior work tackled such an issue. We de-

signed by robustness, the required reasoner quality.

We rely on supervised machine learning techniques

in order to learn models of reasoner robustness from

their previous running. To achieve our purpose, we

proposed a set of valuable ontology features, likely

to be good indicators of ontologies’ hardness level

against reasoning tasks. Then, we carried out a huge

body of experiments using the widely recognized

ORE’2014 Framework (Bail et al., 2014). Over 1000

ontologies were processed by 6 well known reason-

ers. Given these evaluation data, reasoner predictive

models were trained by 5 of the most effective super-

vised learning algorithms. We have further improved

the accuracy of our models by employing a set of fea-

ture selection techniques. Worth of cite, no prior work

made use of the discretization to identify ontology rel-

evant features. Then, we discussed and employed a

variety of prediction assessment measures, well suited

in the case of imbalanced datasets. Thanks to our es-

tablished study, we unveiled a set of local and global

ontology key features, likely to alter the reasoner ro-

bustness. In overall, our trained reasoner robustness

predictive models have shown to be highly accurate,

which witness the worthiness of our learning process.

The rest of the paper is organized as follows. Sec-

tion 2 brieﬂy recalls basic notions that will be of use

throughout the paper. Section 3 scrutinizes the related

work approaches. Our reasoner robustness learning

process as well as the achieved results are, respec-

tively, detailed in Sections 4 - 9. Concluding remarks

as well as our future works are given in Section 10.

2 BACKGROUND

In this paper, we focus on OWL 2 (OWL Work-

ing Group, 2009) ontologies. Recommended by the

W3C, OWL 2 is based on the highly expressive De-

scription Logics SR OI Q (Horrocks et al., 2006).

This logical provides a precisely deﬁned meaning for

statements made in OWL and thus, makes it pos-

sible to automatically reason over OWL ontologies.

Among the reasoning tasks, classiﬁcation is consid-

ered as the key one. It computes the full concept

and role hierarchies in the ontology (Baader et al.,

2003). Explicit and implicit subsumption will be de-

rived to help users navigating through the ontology

towards mainly explanation and/or query answering

respective tasks. Thus, it’s supported by all modern

DL reasoners and its duration is often used as a per-

formance indicator to benchmark reasoning engines

(Abburu, 2012). From an application point of view,

an ontology should be classiﬁed regularly during its

development and maintenance in order to detect un-

desired subsumptions as soon as possible.

We recall some basic concepts of machine learn-

ing (ML) (Kotsiantis, 2007) for a better understanding

of our study. In any dataset used by machine learn-

ing algorithms, every instance is represented using the

same set of features. In our case, the instances are on-

tologies belonging to some corpus and the features

are metrics characterizing the ontology content and

design. Thus, each ontology is represented by a d-

dimensional vector x

(i)

= [x

(i)

, x

(i)

, . . . , x

(i)

] called

a feature vector, where i refers to the i-th ontology in

the dataset, i ∈ [1, N], with N denoting the total num-

ber of ontologies and d standing for the total number

of features. The latter ones may be continuous, cat-

egorical or binary. The learning is called supervised

when the dataset ontologies are given with known la-

bels. In our context, a label would describe a given

reasoner performances when processing the consid-

ered ontology, for instance a time-bin. The vector of

all labels is speciﬁc to one reasoner and denoted Y ,

where y

is the label of the i-th ontology. Thus, for

each under study reasoner a dataset is built in and de-

signed by D = [X

N,M

| Y

]. Later, the dataset is pro-

vided to a supervised learning algorithm in order to

train its data and establish a predictive model for its

corresponding reasoner. Roughly speaking, a model

is a mapping function from a set of features, to a spe-

ciﬁc label. It would be a mathematical function, a

graph or a probability distribution, etc. When a new

ontology, which does not belong to the dataset is in-

troduced, then the task is to predict its exact label,

using a reasoner predictive model.

3 RELATED WORKS

Works that attempted to predict reasoner’s runtime us-

ing supervised machine learning techniques are the

closest to our context. Authors of (Kang et al., 2012)

KEOD 2015 - 7th International Conference on Knowledge Engineering and Ontology Development

were the ﬁrst to apply supervised machine learning

techniques (Kotsiantis, 2007) to predict the ontology

classiﬁcation computational time, carried by a spe-

ciﬁc reasoner. 27 ontology metrics were computed for

each ontology. These metrics were previously pro-

posed by a work stressing on ontology design com-

plexity (Zhang et al., 2010). The labels to be predicted

were time bins speciﬁed by the authors. They learned

Random Forest based models for 4 state of art rea-

soners and obtained high prediction accuracy. More-

over, they proposed an algorithm to compute the im-

pact factors of ontology metrics according to their ef-

fectiveness in predicting classiﬁcation performances

for the different reasoners. Kang et al. have fur-

ther improved their approach, in a more recent work

(Kang et al., 2014). They replaced time bin labels by

concrete values of reasoner’s runtime and proposed

more ontology metrics. They learned regression mod-

els for 6 widely known reasoners. In addition, they

demonstrated the strengths of their predictive models

by applying them to the problem of identifying ontol-

ogy performance Hotspots (Gonc¸alves et al., 2012).

On the other hand, in (Sazonau et al., 2014), authors

claimed that Kang’s et al. metrics based on graph

translation of OWL ontologies are not effective. Thus,

they proposed another set of metrics and deployed a

dimensionality reduction method to remove the inter-

correlations between ontology features. They further

proposed a new approach to build predictive models

based on examining single ontologies rather than the

whole corpus.

In all these previous works, machine learning

techniques were set up to estimate the amount of time

a given reasoner would spend to process any input

ontology. However, no attention was paid to assess

the correctness of the reasoner derived results. In

our opinion, a reasoner that quickly but incorrectly

process an ontology is of little use. Therefore, we

believe that the effectiveness and the utility of these

approaches are still limited in meeting the need of

predicting reasoner empirical behaviours. Neverthe-

less, these works have succeeded to establish highly

accurate reasoner performances models, which is a

promising advance towards the practical understand-

ing of reasoners. Thus, we are convinced that em-

ploying machine learning techniques would be the ul-

timate approach to gain insights about the reasoner ro-

bustness facing real world ontologies. Certainly, pre-

viously deployed ML techniques need to be reviewed

and improved for a better ﬁt to our learning context.

4 PREDICTING THE

ROBUSTNESS OF THE

ONTOLOGY REASONERS

In this section, we specify, at ﬁrst, the notion of rea-

soner robustness and then, the main steps to automat-

ically learn it from empirical data.

4.1 Why Robustness?

The research question of this paper is whether it is

possible to predict the robustness of modern reason-

ers using the results of their previous running. Worth

of mention, the notion of robustness differs by the

ﬁeld of research. For software developer, (Mikol

sek,

2009) deﬁned the robustness as the capability of

the system to deliver its functions and to avoid ser-

vice failures under unforeseen conditions. Recently,

(Gonc¸alves et al., 2013) bought forward this deﬁni-

tion in order to conduct an empirical study about the

robustness of DL reasoners. Authors underlined the

need to specify the robustness judgement constrains

before assessing the reasoners. These constrains are:

1) the range of the input ontologies, 2) the reasoner

functional and non functional properties of interest,

and 3) the deﬁnition of the failure state. The instanti-

ation of the constrains would describe some reasoner

usage scenario. Thus, a reasoner may be considered

as robust under a certain scenario and non-robust un-

der another.

Given these ﬁndings, we started by setting our

proper constrains in order to describe an online ex-

ecution scenario of reasoners. In addition to the com-

putational time which should be maintained as short

as possible, we focused on assessing the correctness

of the reasoner computed results. Reasoning engines

can load and process an ontology to achieve some

reasoning task, but they can also deliver quite dis-

tinct results. Disagreement over inferences or query

answers, computed over one input ontology, would

make it hard to provide interoperability in the Se-

mantic Web. Therefore, we consider the reasoner ro-

bustness as its ability to correctly achieve a reason-

ing task for a given ontology within a ﬁxed cut-off

time. Consequently, the most robust reasoner over an

ontology corpus would be the one satisfying the cor-

rectness requirement for the greatest number of on-

tologies while maintaining the shortest computational

time. Based on this speciﬁcation, we tried to conduct

a new reasoner robustness empirical study. One ma-

jor obstacle we faced was about whether it is possi-

ble to automatically check the correctness of the rea-

soning results. In fact, little works have addressed

the issue. As previously outlined by (Gardiner et al.,

Predicting the Empirical Robustness of the Ontology Reasoners based on Machine Learning Techniques

2006), manually testing the correctness of reasoner

inferences would be relatively easy for small ontolo-

gies, but usually infeasible for real world ontologies.

Authors claimed that the most straightforward way to

automatically determine the correctness is by com-

paring answers derived by different reasoners for the

same ontology. Consistency across multiple answers

would imply a high probability of correctness, since

the reasoners have been designed and implemented

independently and using different algorithms. Luck-

ily, this testing approach was implemented in the ORE

evaluation Framework (Gonc¸alves et al., 2013). The

reasoner output was checked for correctness by a ma-

jority vote, i.e. the result returned by the most reason-

ers was considered to be correct. Certainly, this is not

a faultless method. Improving it would be advanta-

geous for our study, however it is out of the scope of

this paper.

Afterwards, we designed four labels that would

describe the termination state of a reasoning task. The

ﬁrst label describes the state of success and the others

distinguish three types of failure. These labels are: 1)

Correct (C) standing for an execution achieved within

the time limit and delivered expected results; 2) Unex-

pected (U) in the case of termination within the time

limit but delivered results that are not expected, other-

wise incorrect; 3) Timeout (T) in the case of violating

the time limit; and 4) Halt (H) describing a sudden

stop of the reasoning process owing to some execu-

tion error. Thus, the set {C, U, T, H} designs our label

space L admitted for the learning process. Intuitively,

the reasoner robustness is close to the reasoning task

to be processed. In our study, we focus on the ontol-

ogy classiﬁcation task (Section 2).

4.2 The Learning Steps

We propose the following steps for predicting the ro-

bustness of a given reasoner on individual ontologies.

Our learning steps are partially inspired by the earlier

work (Kang et al., 2012).

Step 1. Features Identiﬁcation. We carried a rigor-

ous investigation on the most valuable ontology fea-

tures, likely to have an impact on the reasoner ro-

bustness. The results of this investigation is detailed

in Section 5, where we introduced a rich set of fea-

tures covering a wide range of ontology characteris-

tics. The latter ones depict our features space F

where d stands for the space dimension.

Step 2. Ontologies Selection. An ontology corpus

C (O) should be provided to carry out the reasoner

evaluations. This corpus should be highly diversiﬁed

in terms of ontology characteristics, in order to reduce

the probability for a given reasoner to encounter only

problems it is particularly optimized for. Features,

identiﬁed in the previous step, will be computed for

each ontology O

∈ C (O), to obtain its corresponding

d-dimensional feature vector X

)

∈ F

Step 3. Reasoner Selection. Any reasoner would

be enough for the study; no knowledge about its algo-

rithm neither details about its internal working mech-

anism are needed for the training process. Thus, given

a set of reasoners S(R ), a reasoner R

∈ S(R ) will be

iteratively picked to carry on the process.

Step 4. Dataset Building. At this step, each on-

tology belonging to the corpus O

∈ C (O) will be

processed by the previously selected reasoner R

achieve the classiﬁcation task. Then, the termi-

nation state of this task will be retained l

) ∈

{C, U, T, H}. Thus, the ﬁnal training dataset of the

selected reasoner R

, designed by D

, is built in by

gathering the pairs (feature vector, label), i.e. D

{(X

)

, l

)

)), i = 1 ···N}, where N = |C (O)|.

Step 5. Feature Selection. A huge amount of fea-

tures does not necessarily improve the accuracy of

the prediction model. Commonly, feature selection

or dimensionality reduction techniques are applied to

identify the relevant features. In our study, we will

compare three different methods of feature selection,

described in Section 7. Consequently, three variants

of the initial reasoner dataset are established, called

featured datasets, each with a different and eventu-

ally reduced subset of ontology features. The initial

dataset is also maintained and called ”RAWD”.

Step 6. Learning and Assessing the Models. Each

reasoner dataset D

( j)

established in the Step 5 is pro-

vided to a supervised machine learning algorithm.

The latter train the data and build a reasoner predic-

tive model, M

(section 2):

: X ∈ F

7→

(X) ∈ Y (1)

Worth to cite, we investigated 5 well known super-

vised machine learning algorithms. Therefore, this

step will be repeated as far as the number of algo-

rithms and the number of datasets for a given rea-

soner. Then, the established models will be evaluated

against a bunch of assessment measures. Further, we

introduced a method to compare these models and ﬁg-

ure out their ”best” one. Details about the learning al-

gorithms, the assessment measures and the selection

procedure are given in Section 8.

Step 7. Unveil the Key Features. Steps 3-6 are re-

peated K times to cover all reasoners in the set S(R ),

with K = |S (R )|. As a result, K best predictive mod-

els are identiﬁed each for a reasoner, and each having

its own most relevant feature subset. Accordingly, we

KEOD 2015 - 7th International Conference on Knowledge Engineering and Ontology Development

believe that the key ontology features likely to have

impact on reasoner robustness are those the most in-

volved across the whole set of best models. We denote

this subset as the Global Key Features, in contrast to

Local Key Features, which designs features employed

in one given reasoner best predictive model.

5 ONTOLOGY FEATURES

(STEP 1)

When reviewing the state of art, we noticed that there

is no known, automatic way of constructing ”good”

ontology feature sets. Instead, distinct domain knowl-

edge should be used to identify properties of ontolo-

gies that appear likely to provide useful informations.

To accomplish our study, we reused some of previ-

ously proposed ontology features and deﬁned new

ones. Mainly, we discarded those computed based

on speciﬁc graph translation of the OWL ontology, as

there is no agreement of the way an ontology should

be translated into a graph (Kang et al., 2014). We split

the ontology features into 4 categories: (i) size de-

scription; (ii) expressivity description; (iii) structural

features; and (iv) syntactic features. Within these cat-

egories, features are intended to characterize speciﬁc

aspect of an OWL ontology. Some of the categories

are further split up. In overall, 101 ontology features

were characterized. Figure 1 lists the main ones. In

the following, we will shortly depict our feature cate-

gories. We give a more detailed description in (Alaya

et al., 2015).

Ontology Size Description. The purpose of this fea-

ture category is to characterize the size of the ontol-

ogy considering both the amount of its terms and ax-

ioms. Therefore, we designed 5 features, each records

the names of a particular OWL entity. In addition, we

computed the number of its axioms (SA) and more

particularly the logical ones (SLA).

Ontology Expressivity Description. We retained

two main features to identify the expressivity of the

ontology language, namely the OWL proﬁle

(OPR)

and the DL family name (DFN).

Ontology Structural Description. We paid a special

attention to characterize the taxonomic structure of

an ontology, i.e. its inheritance hierarchy. The lat-

ter sketches the tree like structure of subsumption re-

lationships between named classes A v B or named

properties R v S. In this category, we gathered vari-

ous features that have been deﬁned in literature to de-

scribe the ontology hierarchies. These are, basically,

For further details about OWL 2 proﬁles, the reader is

kindly referred to http://www.w3.org/TR/owl2-proﬁles/.

metrics widely used by the ontology quality evalua-

tion community (Gangemi et al., 2006; Tartir et al.,

2005; LePendu et al., 2010). The following subcat-

egories describe the essence of the retained features.

• Class and Property Hierarchical Features: ﬁve

features were speciﬁed to outline the design of the

class hierarchy (CHierarchy). They consider the

depth of this tree like structure and the distribu-

tion of the subclasses as well as the super-classes.

These features were also used to characterize the

design of the property hierarchy (PHierarchy).

• Cohesion Features: the literature provides a

plethora of various metrics to design the Cohesion

of the ontology, otherwise the degree of relatedness

of its entities. We retained the ones introduced by

(Faezeh and Weichang, 2010). Roughly speaking,

the ontology cohesion (OCOH) is a weighted ag-

gregation of the class cohesion (CCOH), the prop-

erty cohesion (PCOH) and the object property co-

hesion (OPCOH).

• Schema Richness Features: we enriched the on-

tology structural category by two additional fea-

tures proposed by (Tartir et al., 2005): the schema

relationship richness (RRichness), and the schema

attribute richness (AttrRichness). These features

are well known for ontology evaluation community

as they are part of the OntoQA tool.

Ontology Syntactic Features. When collecting fea-

tures for this category, we conducted an investigation

about the main reasoning algorithms (Baader and Sat-

tler, 2000; Motik et al., 2009). Our main purpose was

to quantify some of the general theoretical knowledge

about DL complexity sources. Thus, we gathered rel-

evant ontology features, that have inspired the imple-

mentation of well known reasoning optimization tech-

niques (Horrocks, 2003; Tsarkov et al., 2007). Fea-

tures of the current group are divided in 6 subcate-

gories, each speciﬁc to a particular ontology syntactic

component. This organization was inspired by the one

introduced by (Kang et al., 2012).

• Axioms Level: reasoners process differently each

type of axiom with different computational cost

(Baader et al., 2003). We introduced two sets of

features, (KBF) and (ATF), in order to character-

ize the different types of OWL axioms as well as

their respective relevance in the ontology. Further,

we computed the maximal and the average parsing

depth of axioms (AMP, AAP).

• Constructors Level: these concern particularly

DL class constructors

(Baader et al., 2003). In

previous reasoner prediction works (Kang et al.,

By DL class constructor, we refer to conjunction (∪),

disjunction (∩), negation (¬), quantiﬁcation (∃, ∀), etc.

Predicting the Empirical Robustness of the Ontology Reasoners based on Machine Learning Techniques

Figure 1: Ontology Features Catalog.

2012; Kang et al., 2014), authors simply counted

axioms that involve potentially hard constructors.

However, they missed that one constructor could be

invoked more than once in the same axiom. Con-

sidering this fact, we proposed a metric to com-

pute a class constructor frequency in the ontology

(CCF). Moreover, we deﬁned the density of the

overall constructors (OCCD). We also introduced

three modelling patterns of particular combinations

of constructors. We believe that these combina-

tions, whenever used in an axiom, would increase

the inference computational cost.

• Classes Level: classes in the ontology could be

named ones or complex expressions. They would

be cyclic or disjoin ones. In this subcategory, we

introduced features that pinpoint different methods

to deﬁne a class and track their impact in the ontol-

ogy TBox part.

• Properties Level: we were interested in capturing

two aspects of the ontology properties syntactic de-

scription. First, we tried to outline the relevance

of speciﬁc object property characteristics, such as

transitivity, symmetry, reﬂexivity, etc. Thus, we in-

troduced a metric, (OPCF), that measures their re-

spective frequencies in the ontology. Then, we pro-

posed two further metrics (HVC, AVC) aiming at

examining the impact of using high values in num-

ber restrictions.

• Individuals Level: we specify some of the in-

teresting characteristics of named individuals that

would be declared in the ontology. We examined

the ratio of nominals in the TBox (NNF) and com-

puted the number of individuals declared as equal

ones (ISAM) or disjoint ones (IDISJ).

6 DATA COLLECTION (STEPS

2-4)

To ensure the reliability of reasoner evaluation results,

we have chosen to reuse the ORE’2014 evaluation

Framework

, as well as, its test set ontologies. Under

this experimental environment, we conducted new DL

and EL classiﬁcation evaluations, for 6 reasoners and

using over 1000 ontologies. The motivation behind

our choice is multi-fold: ﬁrst, the event is widely rec-

ognized by the Semantic Web community; the ontol-

ogy corpus is well established and balanced through-

out easy and hard cases; and ﬁnally the description

of reasoner results is consistent with the speciﬁcation

of the robustness criterion, designed in the previous

Section 4.1.

Ontologies. In overall, we retained 1087 ontologies

from the ORE’2014 corpora

. The testing ontolo-

gies fall into both the OWL 2 DL and the OWL 2

EL proﬁles. For each proﬁle, the ontologies were fur-

ther binned according to their sizes

. The latter varies

The ORE’2014 Framework is available at

https://github.com/andreas-steigmiller/ore-2014-

competition-framework/

The whole corpus of ontologies is available for down-

load at http://zenodo.org/record/10791

The size corresponds to the number of logical axioms.

KEOD 2015 - 7th International Conference on Knowledge Engineering and Ontology Development

from small ([100,1000[), going to very large one ( ≥

100 000). In addition, the set covers more than 100

distinct DL families. Table 1 describes the distribu-

tion of ontology size bins over the OWL proﬁles. #O

stands for the number of ontologies.

Table 1: Ontology Test Set Description Summary.

Proﬁle #O #O per Size Bin

Small Medium Large V-Large

DL 487 125 124 124 113

EL 600 173 150 150 127

Reasoners. We investigated the 6 best ranked rea-

soners

in both the DL and EL classiﬁcation chal-

lenges of the ORE’2014 competition. These rea-

soners are included in our study set S(R ). They

are namely Konclude, MORe, HermiT, TrOWL,

FaCT++ and JFact. We excluded ELK despite its

good results and high rank, as it doesn’t support the

classiﬁcation of DL ontologies. We run the ORE

Framework in the sequential mode on one machine

equipped with an Intel Core I7-4470 CPU running at

3.4GHz and having 32GB RAM, where 12GB were

made available for each reasoner. We set the condi-

tion of 3 minutes time limit to classify an ontology by

a reasoner. This tight schedule would be consistent

with the chosen scenario, i.e. the online classiﬁcation

of the ontology.

Table 2 summarizes the new classiﬁcation chal-

lenge results

. We did not distinguish between re-

sults of DL and EL ontology classiﬁcation. Reasoners

are listed based on their robustness rank, that is the

number of correctly processed ontologies within the

ﬁxed cutoff time. For each reasoner, Table 2 shows

the number of ontologies within a speciﬁc robustness

bin (Section 4.1) and the range of processing time of

the correct cases.

Building the Reasoner Datasets. Each testing on-

tology was examined to compute and to record its fea-

tures. Having the evaluation results and the ontology

feature vectors, 6 datasets were established each for a

speciﬁc reasoner. It’s important to notice that our rea-

soner datasets are imbalanced. Based on the descrip-

tion made by (Hu and Dong, 2014), a dataset is con-

sidered as imbalanced, if one of the classes (minor-

ity class) contains much smaller number of examples

than the remaining classes (majority classes). In our

context, the classes are reasoner robustness labels. We

take as an example the Konclude’s dataset described

in the table 2, we can easily notice the huge difference

All ORE’2014 reasoners are available for download at

https://zenodo.org/record/11145/

All the results of our experiments are available at https://

github.com/PhdStudent2015/Classiﬁcation Results 2015.

Table 2: Ontology classiﬁcation results. The time is given in

seconds. #C, #U, #T and #H stand respectively for the num-

ber of ontologies labelled by Correct, Unexpected, Timeout

or Halt after the classiﬁcation.

Reasoner #C #U #T #H Runtime (correct)

Min Max Mean

Konclude 1030 22 27 8 0.001 54.73 1.48

MORe 971 3 110 3 0.259 144.56 3.77

HermiT 954 29 102 2 0.082 144.12 9.59

TrOWL 927 106 52 2 0.027 109.95 3.36

FaCT++ 821 13 205 48 0.019 141.87 4.88

JFact 683 13 323 68 0.019 138.34 9.68

between the number of ontologies labelled as Correct

#C(the majority) and those labelled as Unexpected #U

or Timeout #T (the minority). However, its obvious

that predicting the minor cases is much more interest-

ing for both ontology and reasoner designers, since

they describe a failure situation in processing the on-

tology. In fact, a user would probably want to know,

if it would be safe to choose Konclude to classify its

ontology. The learning from imbalanced datasets is

considered as one of the most challenging issue in the

data mining (Hu and Dong, 2014). We will be consid-

ering this aspect when training the predictive models.

7 FEATURE SELECTION (STEP 5)

We started by performing some preprocessing steps

on the reasoner datasets. Mainly, we applied feature

selection methods. The latter ones were designed to

recognize the most relevant features, by weeding out

the useless ones. In our study, we tried to track down

the subset of features, which correlate the most with

the robustness of a given reasoner. Thus, we chose

to investigate the utility of employing discretization

(Garcia et al., 2013), as a feature selection technique.

Basically, discretization stands for the transforma-

tion task of continuous data into discrete bins. More

speciﬁcally, we opted for the well known Fayyad &

Irani’s supervised discretization method (MDL). This

technique makes use of the ontology label to achieve

the transformation. If the continuous values of an on-

tology feature are discretized to a single value, then

it means the feature is useless for the learning. Con-

sequently, it can be safely removed from the dataset.

Seeking of more validity, we decided to compare the

discretization results to further feature selection tech-

niques. Thus, we carefully chose two well known

methods representative of two distinct categories of

feature selection algorithms: ﬁrst, the Relief method

(RLF), which ﬁnds relevant features based on a rank-

ing mechanism; then, the CfsSubset method (CFS),

Predicting the Empirical Robustness of the Ontology Reasoners based on Machine Learning Techniques

Table 3: Summary of feature selection results. (∩) stands

for the intersection of feature subsets.

Dataset MDL RLF CFS (∩)

Konclude

53 64 8 4

74 61 12 8

HermiT

85 61 15 10

TrOW L

81 51 13 10

FaCT ++

79 58 16 12

JFact

86 56 19 15

(∩) 50 49 1 1

which selects subsets of features that are highly cor-

related with the target label while having low inter-

correlations. All of aforementioned feature selec-

tion methods are available in the machine learning

working environment, Weka (Hall et al., 2009). Ta-

ble 3 summarizes the feature selection results when

applied to the feature space, F

, of each of the rea-

soner datasets. The reported values are the sizes of the

reduced feature subsets

computed by the respective

method. We further investigated the possible presence

of common features across these subsets, by com-

puted the size of their respective intersections.

Interestingly enough, in the all cases, we observe

that the initial feature dimension was reduced. This

would conﬁrm that there are some particular features,

which are more correlated to the reasoner robust-

ness bins. However, the reduction level of the fea-

ture selection methods varies even for the same rea-

soner dataset. Nevertheless, it would be noticed that,

for each reasoner dataset, feature selection methods

agreed on the predictive power of some number of on-

tology features. Moreover, given a selection method,

common features were identiﬁed across the datasets

of the different reasoners, particularly when using

MDL and RLF. However, the CFS method delivered

just one shared feature, this is the frequency value

(CCF) of the OWL constructor hasSelf. Worth of

cite, this feature also ﬁgures in the intersection set of

MDL’s feature subsets. Certainly, at this stage, it is

hard to decide which feature subset is having the most

relevant features for a given reasoner. This would be

concluded only after training predictive models from

these featured datasets. We believe that the subset

leading to the most accurate predictive model, is the

one having the key ontology features, likely to impact

the reasoner robustness. As mentioned in the Section

4.2, we will conduct a comparison between predictive

models derived from reasoner initial datasets, RAWD,

and those trained form the featured ones respectively

by i.e. MDL, RLF and CFS.

The initial dimension of our feature space is 101.

8 LEARNING METHODS OF

REASONER ROBUSTNESS

(STEP 6)

In this paragraph, different learning algorithms and

assessment measures will be shortly described.

Supervised Machine Learning Algorithms. Seek-

ing for diversity in the learning process, we selected

5 candidate algorithms, each one is representative of

a distinct and widely recognized category of machine

learning algorithms (Kotsiantis, 2007). All the used

implementations are available in the Weka framework,

and was applied with the default parameters of this

tool. They are namely: 1) Random Forest (RF) a com-

bination of C4.5 decision tree classiﬁers; 2) Simple

Logistic (SL) a linear logistic regression algorithm;

3) Multilayer Percetron (MP) a back propagation al-

gorithm used to build an Artiﬁcial Neural Network

model; 4) SMO a Support Vector Machine learner

with a sequential minimal optimization; and ﬁnally 5)

IBk a K-Nearest-Neighbour algorithm with normal-

ized euclidean distance.

Prediction Accuracy Measures. In previous works

(Kang et al., 2012), the accuracy standing for the frac-

tion of the correct predicted labels out of the total

number of the dataset samples, was adopted as main

evaluation metric of the predictive models. How-

ever, accuracy would be misleading in the case of

imbalanced datasets as it places more weight on the

majority class(es) than the minority one(s). Conse-

quently, high accuracy rates would be reported, even

if the predictive model is not necessarily a good one.

Thus, we looked for assessment measures known for

their appropriateness in the case of imbalanced data.

Based on the comparative study conducted by (Hu

and Dong, 2014), we retained the following ones.

Worth to cite, all of these measures are computed

based on the confusion matrix, that we describe in

what follows.

• Confusion Matrix it is a square matrix, L×L, where

L is the number of labels to be predicted. It shows

how the predictions are made by the model. The

rows correspond to the known labels of the data, i.e.

in our case {C, U, T, H}. The columns correspond to

the predictions made by the model. Each cell (i, j)

of the matrix contains the number of ontologies from

the dataset that actually have the label i and were

predicted as with label j. A perfect prediction model

would have its confusion matrix with all elements on

the diagonal.

• Precision (PR), Recall (RC), and F-Measure

(FM): these are common measures of model effec-

tiveness. Precision is a measure of how many er-

KEOD 2015 - 7th International Conference on Knowledge Engineering and Ontology Development

rors we make in predicting ontologies as being of

some label l. On the other hand, Recall assesses

how good we are in not leaving out ontologies that

should have been predicted with the label l. How-

ever, both measures are misleading when consid-

ered separated. Usually, we rather employ the F-

Measure (FM) to assess the model. This measure

combines both Precision and Recall into a single

value: FM =

2×RC×PR

RC+PR

• Kappa Coefﬁcient (κ): it is used to measure the

agreement between the predicted labels and the real

ones. The value of Kappa lies between −1 and 1,

where 0 represents agreement due to chance. The

value 1 represents a complete agreement between

both values. In rare cases, Kappa can be negative.

• Matthews Correlation Coefﬁcient (MCC): it is per-

formance measure barely inﬂuenced by imbalanced

test sets since it considers mutually accuracies and

error rates on all labels. So, it involves all values

of the confusion matrix. MCC ranges from 1 for a

perfect prediction to −1 for the worst possible pre-

diction. A MCC value close to 0 indicate a model

that performs randomly.

Training and Selecting the Best Predictive Mod-

els. A reasoner dataset, D

, is trained by each of

the above-mentioned supervised machine learning al-

gorithms. Thus, ﬁve distinct predictive models are

learned. We made use of the standard cross-validation

technique for evaluating these models. We applied a

stratiﬁed 10-fold cross-validation to ensure the ”gen-

eralizability” of the results. For the sake of sim-

plicity, we only retained the average values of the

computed assessment measures over all the cross-

validation steps.

We recall, our study covers 6 reasoners, for each 4

datasets were built in and then provided to 5 super-

vised learning algorithms, each of which have trained

it separately and produced its corresponding predic-

tive model.To sum up, 20 models were learned for ev-

ery reasoner, and assessed using 3 distinct evaluation

measures. All these steps were put forward to reveal

the reasoner best predictive model, according to the

aforementioned assessment measures. Being aware

of the amount of data to analyse, we propose to com-

pute a score index per model. By referring to it, we

will be able to establish a total order of the reasoner’s

predictive models. We called it, the Matthews Kappa

Score (MKS). As the acronym would suggest, MKS

is a weighted aggregation of the MCC and the Kappa

values computed to assess a reasoner model M

MKS(M

) =

α ∗ MCC(M

) + β ∗ Kappa(M

)

α + β

(2)

,where (α + β) = 1. Thus, the best predictive model

for a given reasoner is the one having the maximal

value of the MKS score:

RBest

= arg max

∈M R

( MKS(M

)) (3)

,where M R denotes the set of predictive models

learned for a reasoner R. In the case where multi-

ple maximal models are identiﬁed, the model with

the highest F-Measure (FM) is selected. We believe

that using MKS is the straightforward way to automat-

ically determine the best robustness predictive model

for a given reasoner. The rational behind this proposal

is twofold: ﬁrst, MCC and Kappa are widely recog-

nized as powerful assessment measures, even more

effective that FM, in the case of imbalanced datasets;

second both measures are ranging in [−1, 1], which

makes the aggregation coherent.

9 RESULTS OF REASONER

ROBUSTNESS PREDICTIVE

MODELS

The selection of a reasoner best predictive model is

achieved within two stages: the best model given a

speciﬁc reasoner dataset variant; then the best model

across all the datasets. Results of each stage will be

discussed in the following:

9.1 Best Models from RAWD Datasets

For the sake of brevity, we only report the assessment

results of the predictive models learned from reasoner

RAWD datasets. Indeed in this type of datasets, on-

tology feature vectors are full ones, counting 101 dis-

tinct features. Table 4 shows the distributions of F-

Measures (FM), MCC and Kappa (κ) across the 5

learned models and for every reasoner. The end-

ing line of table 4 displays the name of the reasoner

best predictive model, denoted by M

RBest

, under the

RAWD datasets. The selection was made according

to the MKS values

. The assessment results of M

RBest

are denoted in boldface.

A number of important observations can be made

from this table. Obviously, the RF algorithm is the

most performing learning algorithm in training the

RAWD datasets, since it derived the best predictive

models for all the 6 reasoners. Moreover, it would

be noticed that the maximal reported value of the F-

Measure (FM) for the 6 reasoner ranges from 0.86 by

RF in the case of TrOWL, going to 0.96 also by RF

in the case of Konclude. These close to optimum FM

In our experimentations, we set up α = β = 0.5.

Predicting the Empirical Robustness of the Ontology Reasoners based on Machine Learning Techniques

Table 4: Assessment summary of the reasoners’ models learned from RAWD datasets.

Konclude MORe HermiT TrOWL FaCT++ JFact

FM MCC κ FM MCC κ FM MCC κ FM MCC κ FM MCC κ FM MCC κ

RF 0.96 0.66 0.57 0.95 0.79 0.77 0.89 0.66 0.65 0.86 0.51 0.48 0.91 0.75 0.74 0.92 0.86 0.86

SL 0.94 0.40 0.33 0.94 0.74 0.72 0.88 0.62 0.62 0.82 0.34 0.27 0.87 0.66 0.64 0.90 0.80 0.80

MP 0.95 0.53 0.47 0.94 0.68 0.64 0.83 0.48 0.43 0.79 0.24 0.18 0.81 0.50 0.44 0.75 0.51 0.50

IBk 0.95 0.56 0.48 0.94 0.73 0.72 0.87 0.61 0.60 0.84 0.45 0.43 0.88 0.68 0.67 0.90 0.81 0.80

SMO 0.94 0.43 0.36 0.93 0.69 0.68 0.83 0.57 0.57 0.83 0.40 0.33 0.85 0.61 0.57 0.87 0.75 0.75

RBest

RF (MKS: 0.62) RF (MKS: 0.78) RF(MKS: 0.66) RF (MKS: 0.50) RF (MKS: 0.75) RF (MKS: 0.86)

Table 5: Best reasoner models across the different types of datasets.

Dataset Konclude MORe HermiT TrOWL FaCT++ JFact

variant M

best

MKS FM M

best

MKS FM M

best

MKS FM M

best

MKS FM M

best

MKS FM M

best

MKS FM

RAWD RF 0.62 0.96 RF 0.78 0.95 RF 0.66 0.89 RF 0.50 0.86 RF 0.75 0.91 RF 0.86 0.92

MDL RF 0.65 0.96 SL 0.80 0.96 SL 0.71 0.91 MP 0.61 0.89 RF 0.77 0.92 SMO 0.89 0.92

RLF RF 0.62 0.96 RF 0.77 0.95 RF 0.67 0.89 RF 0.50 0.87 RF 0.75 0.91 RF 0.86 0.92

CFS RF 0.65 0.96 RF 0.74 0.95 RF 0.70 0.90 RF 0.58 0.88 RF 0.74 0.90 RF 0.87 0.90

RBest

RF+CSF(8f) SL+MDL(74f) SL+MDL(85f) MP+MDL(81f) RF+MDL(79f) SMO+MDL(86f)

values indicate that our proposed set of ontology fea-

tures entails highly accurate predictive models, even

when no feature selection or dimensionality reduction

techniques are deployed a priori. Thereby, the rela-

tively high correlation between our ontology features

and the reasoners’ robustness is conﬁrmed. In overall,

the reasoner best models have achieved good MCC

and Kappa values, as their MKS scores vary between

0.5 (by RF for TrOWL) and 0.86 (by RF for JFact).

This ﬁnding proves the ability of the learned mod-

els to predict the minor classes (Timeout, Unexpected

and Halt). Nevertheless, TrOWL’s and Konclude’s

predictive models are less effective than the other rea-

soner models. At this stage, we can not conclude,

which precise aspect is lowering the MKS scores of

both models. It would be either the biased nature of

the datasets, or probably some noisy features, which

should be removed.

9.2 Best Models Across the Dataset

Types

In the table 5, each reasoner best model given a

dataset type is reported as well as its assessment val-

ues of MKS and FM. The last line of the table sum

ups the comparison by revealing, the across datasets,

reasoner best robustness predictive model, denoted by

RBest

. The dataset type and the dimension of the fea-

ture space of the M

RBest

are also indicated.

Not once the RAWD dataset, with its entire set of

101 ontology features, was listed in the ﬁnal selec-

tion of the reasoners’ best models. In most cases, the

MKS and FM values of reasoner predictive models

derived from featured datasets exceed the ones com-

puted from RAWD models. Therefore, it is quite cer-

tain, that restricting the full initial set of ontology fea-

tures to some particular subsets would improve pre-

dictive power of the reasoner robustness models. It

would also observed that the size of feature vectors of

the M

RBest

models varies from 8 in the case of Kon-

clude going to 86 in the case of JFact. This observa-

tion pinpoints that key ontology features, are close to

the reasoner under study. Reasoners implement dif-

ferent optimization techniques to overcome particular

complexity sources in the ontology and thus, indica-

tors of their robustness would also vary. Nevertheless,

we are not sure which feature selection method would

be the most suited in discovering key features for all

the reasoners. In fact, the MDL method have outper-

formed in major cases, but beaten by the CFS’s tech-

nique in one reasoner instance, i.e. Konclude. For this

reasoner, the assessment measures of the MDL’s and

CFS’s predictive models were equal. However, the

CFS method have delivered a more compact feature

set fully included in the one identiﬁed by MDL. Given

this fact, we chose the CFS’ model as the best one for

the reasoner Konclude, since it is much easier to inter-

pret. On the other hand, SL, SMO, MP and RF have

shared the podium of the most performing supervised

learning algorithms. Considering these ﬁndings, we

would conﬁrm that there is no ultimate best combi-

nation, i.e. feature selection technique and learning

algorithm, that suits all the reasoners. Accordingly,

we admit that even if the learning process may be

maintained the same, the learning techniques must

be diversiﬁed to grasp, in a better way, the reasoner-

speciﬁc empirical behaviours. In overall, the learned

reasoners’ best predictive models showed to be highly

accurate, achieving in the most cases, an over to 0.90

FM value. In addition, they are well resisting to the

KEOD 2015 - 7th International Conference on Knowledge Engineering and Ontology Development

problem of minor classes, as in all cases, their MKS

scores were over 0.60. In particular, the MKS of the

JFact best predictive model was 0.89, indicating al-

most a perfect predictive model. All of these ﬁndings

witness the high generalizability of the learned model

and their effectiveness in predicting reasoner robust-

ness for the ontology classiﬁcation task.

9.3 Unveil the Key Features

The high accuracy of the reasoner robustness predic-

tive models trained from featured datasets conﬁrmed

the validity of our assumption that particular ontol-

ogy features would help tracking down the reasoner

empirical behaviours. Henceforth, the robustness

of reasoners would be explained in terms of the

reduced set of ontology features, involved in their

respective best predictive models. We have called

these subsets, a reasoner local key features. We

have also investigated possible presence of shared

features across the different reasoners’ best predictive

models, i.e. global key features. We pinpointed their

contribution to the reasoner robustness prediction, by

computing their occurrences in the best models. A

group of 8 features were recognized to be the most

relevant indicators as they were involved in each of

the 6 best predictive models. Namely, these are: the

number of named object property (SOP), the highest

value of the max and the exact number restrictions,

i.e. HVC(max) and HVC(exact), as well as the

average value of numbers used in the different car-

dinality restrictions (AVR), two members of the set

of class constructors frequency CCF(owl:hasSelf)

and CCF(owl:maxCardinality), the ob-

ject property hard characteristics frequency

OPCF(owl:symmetricObjectProperty),

and ﬁnally the ratio

ATF(owl:functionalDataProperty). Worth

of mention, these features are the local key ones

of Konclude, but also good indicators for the other

reasoners. To have further insights about the contri-

bution of the remaining features, we distinguished

four levels of frequencies. Given O

the number of

occurrences of the feature f in the set of the 6 best

models, f has high frequency if O

≥ 5, medium one

if 3 ≤ O

≤ 4, low in case 1 ≤ O

≤ 2 and eventually

the feature could be NotUsed having O

= 0. In

overall, 69 features were found to be highly frequent,

10 were medium ones, 9 had low frequencies and

13 never been involved in the set of reasoners’ best

predictive models. The important number of highly

frequent features indicates the worthiness of our

conducted investigation about the most valuable on-

tology features. As we can not detail the frequencies

Figure 2: Frequency levels of ontology feature sets.

of each of the 101 features, we choose to report

the distribution of feature subcategories, over the

frequency levels. Figure 2 illustrates the results.

According to the established inspections, we

would assume that the highly frequent features among

best reasoner predictive models form a core group

of good indicators of reasoner robustness considering

the ontology classiﬁcation task. These features would

be recommended as starting points to conduct further

improvement in ontology modelling as well as rea-

soning optimization. We must stress that we are not

concluding that the low frequent features are unim-

portant for the reasoners. Quite the contrary, these

features could be very speciﬁc to a particular reasoner

behaviour.

10 CONCLUSIONS

In this paper, we conducted a rigorous investigation

about empirically predicting reasoner robustness. Our

main purpose was to be able to explain the empirical

behaviour of reasoners when inferring particular on-

tologies. To fulﬁll this purpose, we learned predictive

models of the robustness of 6 well known reasoners.

We put into comparison various supervised learning

algorithms and feature selection techniques in order

to train best reasoner predictive models. Thanks to

this predictive study, we unveiled sets of local and

global ontology key features likely to give insights of

ontology hardness level against reasoners. We want

to stress that these features are key ones under the se-

lected reasoner quality criterion, i.e. the robustness.

However, we cannot conﬁrm that the same features

would be always key ones when moving to different

criteria. Investigating this point would be interesting,

as to gain more insights about the most relevant fea-

tures across different reasoning quality criteria. In our

case, this purpose would be easily established, since

the whole learning steps described in this paper were

implemented in one single prototype. Our implemen-

Predicting the Empirical Robustness of the Ontology Reasoners based on Machine Learning Techniques

tation is generic enough, that it would be applied to

any reasoner, with the only requirement of provid-

ing enough amount of its running results. Our present

work could open further research perspectives. We as-

sume that the most important one would be extending

our learning steps by a ranking stage, where reasoners

could be compared based on their predicted robust-

ness for a given ontology. Such ranking would made

it possible to automatically recommend the most ro-

bust reasoner for any input ontology.

REFERENCES

Abburu, S. (2012). A survey on ontology reasoners and

comparison. International Journal of Computer Ap-

plications, 57(17):33–39.

Alaya, N., Lamolle, M., and Yahia, S. B. (2015). Towards

unveiling the ontology key features altering reasoner

performances. Technical report, IUT of Montreuil,

http://arxiv.org/abs/1509.08717, France.

Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D.,

and Patel-Schneider, P. F., editors (2003). The De-

scription Logic Handbook: Theory, Implementation,

and Applications. Cambridge University Press, USA.

Baader, F. and Sattler, U. (2000). Tableau algorithms for

description logics. In Proceedings of the Interna-

tional Conference on Automated Reasoning with Ana-

lytic Tableaux and Related Methods (TABLEAUX).

Bail, S., Glimm, B., Jimnez-Ruiz, E., Matentzoglu, N., Par-

sia, B., and Steigmiller, A. (2014). Summary ore 2014

competition. In the 3rd Int. Workshop on OWL Rea-

soner Evaluation (ORE 2014), Vienna, Austria.

Faezeh, E. and Weichang, D. (2010). Canadian semantic

web. chapter A Modular Approach to Scalable Ontol-

ogy Development, pages 79–103.

Gangemi, A., Catenacci, C., Ciaramita, M., and Lehmann,

J. (2006). Modelling ontology evaluation and valida-

tion. In Proceedings of the 3rd European Semantic

Web Conference.

Garcia, S., Luengo, J., Saez, J., Lopez, V., and Herrera, F.

(2013). A survey of discretization techniques: Tax-

onomy and empirical analysis in supervised learning.

IEEE Transactions on Knowledge and Data Engineer-

ing, 25:734–750.

Gardiner, T., Tsarkov, D., and Horrocks, I. (2006). Frame-

work for an automated comparison of description

logic reasoners. In Proceedings of the International

Semantic Web Conference, USA, pages 654–667.

Glimm, B., Horrocks, I., Motik, B., Shearer, R., and Stoi-

los, G. (2012). A novel approach to ontology classiﬁ-

cation. Web Semant., 14:84–101.

Gonc¸alves, R. S., Matentzoglu, N., Parsia, B., and Sattler,

U. (2013). The empirical robustness of description

logic classiﬁcation. In Informal Proceedings of the

26th International Workshop on Description Logics,

Ulm, Germany, pages 197–208.

Gonc¸alves, R. S., Bail, S., Jim

enez-Ruiz, E., Matentzoglu,

N., Parsia, B., Glimm, B., and Kazakov, Y. (2013).

Owl reasoner evaluation (ore) workshop 2013 results:

Short report. In ORE, pages 1–18.

Gonc¸alves, R. S., Parsia, B., and Sattler, U. (2012). Per-

formance heterogeneity and approximate reasoning in

description logic ontologies. In Proceedings of the

11th International Conference on The Semantic Web,

pages 82–98.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., and Reute-

mann, P. (2009). The weka data mining software: An

update. SIGKDD Explor. Newsl., 11:10–18.

Horrocks, I. (2003). Implementation and optimisation tech-

niques. In The Description Logic Handbook: Theory,

Implementation, and Applications, chapter 9, pages

306–346. Cambridge University Press.

Horrocks, I., Kutz, O., and Sattler, U. (2006). The even

more irresistible S R OI Q . In Proceedings of the 23rd

Benelux Conference on Artiﬁcial Intelligence, pages

57–67.

Hu, B. and Dong, W. (2014). A study on cost behaviors

of binary classiﬁcation measures in class-imbalanced

problems. CoRR, abs/1403.7100.

Kang, Y.-B., Li, Y.-F., and Krishnaswamy, S. (2012). Pre-

dicting reasoning performance using ontology met-

rics. In Proceedings of the 11th International Con-

ference on The Semantic Web, pages 198–214.

Kang, Y.-B., Li, Y.-F., and Krishnaswamy, S. (2014). How

long will it take? accurate prediction of ontology rea-

soning performance. In Proceedings of the 28th AAAI

Conference on Artiﬁcial Intelligence, pages 80–86.

Kotsiantis, S. B. (2007). Supervised machine learning: A

review of classiﬁcation techniques. In Proceedings

of the Emerging Artiﬁcial Intelligence Applications in

Computer Engineering Conference., pages 3–24, The

Netherlands. IOS Press.

LePendu, P., Noy, N., Jonquet, C., Alexander, P., Shah,

N., and Musen, M. (2010). Optimize ﬁrst, buy later:

Analyzing metrics to ramp-up very large knowledge

bases. In Proceedings of The International Semantic

Web Conference, pages 486–501.

Mikol

sek, V. (2009). Dependability and robustness: State

of the art and challenges. In Software Technologies for

Future Dependable Distributed Systems, pages 25–31.

Motik, B., Shearer, R., and Horrocks, I. (2009). Hyper-

tableau reasoning for description logics. Journal of

Artiﬁcial Intelligence Research, 36:165–228.

OWL Working Group, W. (27 October 2009). OWL 2 Web

Ontology Language: Document Overview. W3C Rec-

ommendation. Available at http://www.w3.org/TR/

owl2-overview/.

Sazonau, V., Sattler, U., and Brown, G. (2014). Predicting

performance of owl reasoners: Locally or globally? In

Proceedings of the Fourteenth International Confer-

ence on Principles of Knowledge Representation and

Reasoning.

Sirin, E., Parsia, B., Grau, B. C., Kalyanpur, A., and Katz,

Y. (2007). Pellet: A practical owl-dl reasoner. Web

Semant., 5:51–53.

KEOD 2015 - 7th International Conference on Knowledge Engineering and Ontology Development

Steigmiller, A., Liebig, T., and Glimm, B. (2014). Kon-

clude: System description. Web Semantics: Science,

Services and Agents on the World Wide Web, 27(1).

Tartir, S., Arpinar, I. B., Moore, M., Sheth, A. P., and

Aleman-Meza, B. (2005). OntoQA: Metric-based on-

tology quality analysis. In Proceedings of IEEE Work-

shop on Knowledge Acquisition from Distributed,

Autonomous, Semantically Heterogeneous Data and

Knowledge Sources.

Tsarkov, D., Horrocks, I., and Patel-Schneider, P. F.

(2007). Optimizing terminological reasoning for ex-

pressive description logics. J. of Automated Reason-

ing, 39(3):277–316.

Weith

oner, T., Liebig, T., Luther, M., B

ohm, S., Henke, F.,

and Noppens, O. (2007). Real-world reasoning with

owl. In Proceedings of the 4th European Conference

on The Semantic Web, pages 296–310.

Zhang, H., Li, Y.-F., and Tan, H. B. K. (2010). Measuring

design complexity of semantic web ontologies. J. Syst.

Softw., 83(5):803–814.

Predicting the Empirical Robustness of the Ontology Reasoners based on Machine Learning Techniques