TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES

Jaimie Murdock, Cameron Buckner and Colin Allen

Indiana University, Bloomington, IN, U.S.A.

Keywords:

Ontology evaluation, Ontology evolution.

Abstract:

Ontology evaluation poses a number of difﬁcult challenges requiring different evaluation methodologies, par-

ticularly for a “dynamic ontology” representing a complex set of concepts and generated by a combination of

automatic and semi-automatic methods. We review evaluation methods that focus solely on syntactic (formal)

correctness, on the preservation of semantic structure, or on pragmatic utility. We propose two novel methods

for dynamic ontology evaluation and describe the use of these methods for evaluating the different taxonomic

representations that are generated at different times or with different amounts of expert feedback. The pro-

posed “volatility” and “violation” scores represent an attempt to merge syntactic and semantic considerations.

Volatility calculates the stability of the methods for ontology generation and extension. Violation measures the

degree of “ontological ﬁt” to a text corpus representative of the domain. Combined, they support estimation

of convergence towards a stable representation of the domain. No method of evaluation can avoid making

substantive normative assumptions about what constitutes “correct” representation, but rendering those as-

sumptions explicit can help with the decision about which methods are appropriate for selecting amongst a

set of available ontologies or for tuning the design of methods used to generate a hierarchically organized

representation of a domain.

1 INTRODUCTION

The evaluation of domain ontologies that are gen-

erated by automated and semi-automated methods

presents an enduring challenge. A wide variety of

evaluation methods have been proposed; but it should

not be assumed that one or even a handful of evalua-

tion methods will cover the needs of all applications.

Ontology evaluation is as multifaceted as the domains

that ontology designers aspire to model. Projects dif-

fer in the resources available for validation, such as

a “gold standard” ontology, measures of user satis-

faction, explicitly stated assumptions about the logi-

cal or semantic structure of the domain’s conceptual-

ization, or a textual corpus or dictionary whose ﬁt to

the ontology can be measured. They will also differ

in the goals of the evaluation – for instance, whether

they aim to use evaluation to select amongst a set of

available ontologies or to tune their methods of on-

tology design. Further, the methods will differ in the

assumptions they make about their subject domains

– for no evaluation method is possible without sub-

stantive normative assumptions as to the nature of the

“right” ontology.

At the Indiana Philosophy Ontology (InPhO)

project

, we are developing techniques to evaluate the

taxonomic structures generated by machine reason-

ing on expert feedback about automatically extracted

statistical relationships from our starting corpus, the

Stanford Encyclopedia of Philosophy (SEP). InPhO

does not assume that a single, correct view of the dis-

cipline is possible, but rather takes the pragmatic ap-

proach that some representation is better than no rep-

resentation at all (Buckner et al., 2010). Evaluation

allows us to quantify our model and makes explicit

the speciﬁc biases and assumptions underlying each

candidate taxonomy.

In this paper, we describe a pair of evaluation met-

rics we have found useful for evaluating ontologies

and our methods of ontology design. These met-

rics are designed for projects which have access to

large textual corpora, and which expect the structure

of their ontology to ﬁt the distribution of terms in this

corpus. The volatility score (section 4.1) measures the

structural stability over the course of ontology exten-

sion and evolution. The violation score (section 4.2)

measures the semantic ﬁt between an ontology’s tax-

http://inpho.cogs.indiana.edu

110

Murdock J., Buckner C. and Allen C..

TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES.

DOI: 10.5220/0003101601100122

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2010), pages 110-122

ISBN: 978-989-8425-29-4

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

onomic structure and the distribution of terms in an

underlying text corpus.

Before diving into these methodologies, we will

ﬁrst situate them within the broader evaluation litera-

ture (section 2). Then we will describe the InPhO in

further detail, along with the raw materials we will be

evaluating (section 3). After this, we explore each of

the two new measures, labeling their assumptions and

demonstrating their capacity to guide the process of

ontology design.

2 STATE OF THE ART

Approaches to ontology evaluation are heavily depen-

dent on the positions taken towards ontology struc-

ture and design. Different assumptions underlying

these positions are often left implicit and this has

led to a tangled web of conﬂicting opinions in the

literature. However, Gangemi, Catenacci, Ciaramita

and Lehmann (2006) provide an excellent conceptual

scaffolding for use in detangling the web by establish-

ing three categories of evaluation techniques:

• Structural Evaluation inspects the logical rigor

and consistency of an ontology’s encoding

scheme, typically as a directed graph (digraph) of

taxonomic and non-taxonomic relations. Struc-

tural evaluations are a measure of syntactic cor-

rectness. A few examples of structural evalua-

tion include the OntoClean system (Guarino and

Welty, 2004) and G

omez-P

erez’s (1999) paradigm

of correctness, consistency and completeness,

which was extended by Fahad & Qadir (2008).

Our proposed volatility score (Section 4.1) is a

structural evaluation of semantic consistency dur-

ing successive stages of a dynamic ontology’s it-

erative extension and evolution.

• Functional Evaluation measures the suitability

of the ontology as a representation of the target

domain. Many functional evaluations follow a

“gold standard” approach, in which the candidate

ontology is compared to another work deemed

a good representation of the target domain (e.g.

Dellschaft & Staab (2008) and Maedche & Staab

(2002)). Another approach is to compare the can-

didate ontology to a corpus from which terms and

relations are extracted (Brewster et al., 2004). Our

proposed violation score (Section 4.2) is a corpus-

based functional evaluation of semantic ontologi-

cal ﬁt.

• Usability Evaluation examines the pragmatics

of an ontology’s metadata and annotation by fo-

cusing on recognition, efﬁciency (computational

and/or economic), and interfacing. The recog-

nition level emerges from complete documenta-

tion and effective access schemes. The efﬁcency

level deals with proper division of ontological re-

sources, and proper annotation for each. The in-

terfacing level is limited by Gangemi et al. (2006)

to the examination of inline annotations for inter-

face design, but these are not essential properties.

One chief measure of usability is compliance to

standards such as OWL and RDFa. Several frame-

works for social usability evaluation have been

proposed by Supekar (2004) and G

omez-P

erez

(in Staab, 2004). ONTOMETRIC is an attempt

to codify the various factors in usability evalua-

tion by detailing 160 characteristics of an ontol-

ogy and then weighting these factors using semi-

automatic decision-making procedures (Lozano-

Tello and G

omez-P

erez, 2004).

These three paradigms of evaluation are real-

ized in different evaluation contexts, as identiﬁed by

Brank, Mladenic and Grobelnik (2005):

• Applied. For functional and usability evalua-

tion, using the ontology to power an experimental

task can provide valuable feedback about suitabil-

ity and interoperability. Applied approaches re-

quire access to experts trained in the target domain

and/or ontology design. Velardi, Navigli, Cuc-

chiarelli, and Neri’s OntoLearn system (2005) uti-

lizes this type of applied evaluation metric. Porzel

and Malaka (2005) also use this approach within

speech recognition classiﬁcation.

• Social. Methods for usability evaluation proposed

by Lozano-Tello and G

omez-P

erez (2004), Su-

pekar (2004), and Noy (in Staab, 2004) for net-

works of peer-reviewed ontologies, in a similar

manner to online shopping reviews. Most social

evaluation revolves around the ontology selection

task. These evaluations involve a purely qualita-

tive assessment and may be prone to wide varia-

tion.

• Gold Standard. As mentioned above, the gold

standard approach compares the candidate on-

tologies to a ﬁxed representation judged to be a

good representation (Maedche and Staab, 2002;

Dellschaft and Staab, 2008). These approaches

draw strength from the trainability of the au-

tomatic methods against a static target, but the

possibility of over-training of automated and

semi-automated methods for ontology population

means that the methods may not generalize well.

• Corpus-based. Approaches such as those used by

Brewster, Alani, Dasmahapatra, and Wilks (2004)

calculate the “ontological ﬁt” by identifying the

TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES

111

proportion of terms that overlap between the on-

tology and the corpus. This is a particularly well-

suited measure for evaluating ontology learning

algorithms. Our methods expand this measure-

ment approach to cover term relations through

both the violation and volatility measures.

This collection of evaluation paradigms and con-

textual backdrops allows us ﬁnally to consider the

type of information content being evaluated. A

“computational ontology”, such as the InPhO, is a

formally-encoded speciﬁcation of the concepts and a

collection of directed taxonomic and non-taxonomic

relations between them (Buckner et al., 2010; Gruber,

1995; Noy and McGuinness, 2001). When evaluat-

ing information content, we must be careful to delin-

eate those which are node-centric (focusing on con-

cepts) from those which are edge-centric (focusing

on relations). Many authors (Maedche and Staab,

2002; Guarino and Welty, 2004; Brewster et al., 2004;

omez-P

erez, 1999; Velardi et al., 2005) focus upon

node-centric techniques, asking “Are the terms speci-

ﬁed representative of the domain?” These investigate

the lexical content of an ontology. However, the se-

mantic content of an ontology is not deﬁned solely

by the collection of terms within it, but rather by the

relations of these terms. Maedche & Staab (2002)

take this initial lexical evaluation and extend it to an

edge-based approach which measures the number of

shared edges in two taxonomies. The proposed viola-

tion and volatility scores (Section 4) are novel edge-

based measures which address the semantic content of

an ontology by comparing them to statistics derived

from a relevant corpus as a proxy for domain knowl-

edge. Additionally, these scores can provide insight

to the ontology design process by showing the con-

troversy of domain content and convergence towards

a relatively stable structure over time.

3 OUR DYNAMIC ONTOLOGY

A wide variety of projects can beneﬁt from the de-

velopment of a computational ontology of some sub-

ject domain. Ontology science has evolved in large

part to suit the needs of large projects in medicine,

business, and the natural sciences. These domains

share a cluster of features: the underlying structures

of these domains have a relatively stable consensus,

projects are amply funded, and a primary goal is of-

ten to render interoperable large bodies of data. In

these projects, the best practices often require hir-

ing so-called “double experts” – knowledge modelers

highly trained in both ontology design and the sub-

ject domains – to produce a representation in the early

stages of a project which is optimally comprehensive

and technically precise.

There is another cluster of applications, however,

for which these practices are not ideal. These involve

projects with principles of open-access and domains

without the ample funding of the natural sciences.

Additionally, ontologies for domains in which our

structural understanding is controversial or constantly

evolving, and projects which utilize computational

ontologies to enhance search or navigation through

asynchronously updated digital resources must ac-

count for the dynamic nature of their resources –

whether it is in the underlying corpus or in the judg-

ments of the experts providing feedback on domain

structure. On the positive side, these areas often have

more opportunities to collect feedback from users

who are domain experts but lack expertise in ontol-

ogy design.

For the latter type of project we have recom-

mended an approach to design which we call dynamic

ontology. While a project in the former group prop-

erly focuses the bulk of its design effort on the pro-

duction of a single, optimally correct domain repre-

sentation, the latter cluster is better served by treating

the domain representation as tentative and disposable,

and directing its design efforts towards automating as

much of the design process as possible. Dynamic on-

tology, broadly speaking, tries to take advantage of

many data sources to iteratively derive the most useful

domain representation obtainable at the current time.

Two primary sources of data are domain experts and

text corpora. Domain experts provide abstract infor-

mation about presently-held assumptions and emer-

gent trends within a ﬁeld from a source, namely their

own ideas, that is hard to examine directly. Text cor-

pora make it possible to quantify what is meant by

“domain” by providing a concrete encoding of the se-

mantic space that is available for empirical analysis,

in contrast to the ill-deﬁned abstraction of “the do-

main is what the experts conceive of it as”. From both

kinds of sources many types of data may be gathered:

statistical relationships among terms, feedback from

domain experts, user search and navigation traces, ex-

isting metadata relationships (e.g. cross-references or

citations), and so on. As more data become available

and our understanding of the subject domain contin-

ues to evolve, the domain representation will be be

dynamically extended, edited, and improved.

In dynamic ontology, problems of validation loom

especially large due to the combination of heteroge-

nous data sources. Each step in the design process

presents modelers with a panoply of choices for in-

consistency mitigation – e.g., which sources of data

to favor over others, how to settle feedback disagree-

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

112

ments, which reasoning methods to use for popula-

tion, how much feedback to solicit, and how to weigh

user feedback against statistical suggestions. The au-

tomation of ontology design is a ﬁeld in its infancy,

and very little is known about the optimal choices to

satisfy speciﬁc design goals. Additionally, dynamic

ontologists might have questions regarding represen-

tational stability. If the domain is itself in ﬂux or

controversial, modelers might want to know if they

have captured that change. The quantity of feedback

may also inﬂuence the convergence of a population

method to some stable representation. The devel-

opment of precise metrics about the relationship be-

tween an ontology and a domain may be useful in a

answering these questions.

The InPhO is a dynamic ontology which models

the discipline of philosophy. Our approach leverages

expert knowledge by augmenting it with machine rea-

soning, greatly reducing the need for expensive “dou-

ble experts”. The primary source of text data and do-

main experts is the Stanford Encyclopedia of Philoso-

phy (SEP)

. With over 700,000 weekly article down-

loads, the SEP is the leading digital humanities re-

source for philosophy. The corpus consists of over

1,200 articles and 14.25 million words maintained by

over 1,600 volunteer authors and subject editors. The

tremendous depth of the encyclopedia makes it im-

possible for any one person to have expertise over the

whole domain, necessitating the creation of a useful

organization scheme to provide better editorial con-

trol and content accessibility. At the same time, the

comprehensive richness of the SEP makes it a reason-

able proxy for the discipline of philosophy as a whole.

We begin with a small amount of manual ontology

construction obtained through collaboration with do-

main experts. A lexicon is established from SEP ar-

ticle titles, Wikipedia philosophy categories, n-gram

analysis and ad hoc additions by the InPhO curators.

We then build on this framework using an iterative

three-step process of data mining, feedback collec-

tion, and machine reasoning to populate and enrich

our representation of philosophy (see Figure 1).

First, the SEP is mined to create a co-occurrence

graph consisting of several statistical measures. For

each term in our lexicon, information entropy is mea-

sured, which provides an estimate of relative gen-

erality. For each graph edge, we calculate the J-

measure, which provides an estimate of semantic sim-

ilarity. From these measures we are able to generate

hypotheses about hypernym/hyponym candidates for

sets of terms in the corpus (Niepert et al., 2007). Sec-

ond, SEP authors and other volunteers verify these

hypotheses by answering questions about relational

http://plato.stanford.edu

hypotheses. This reduces the effect of any statisti-

cal anomalies which emerge from the corpus. Finally,

logic programming techniques are used to assemble

these aggregated feedback facts into a ﬁnal popu-

lated ontology (Niepert et al., 2008). This knowledge

base can then be used to generate tools to assist the

authors, editors, and browsers of the SEP, through

tools such as cross-reference generation engine and

context-aware semantic search.

As was mentioned in the introduction, our prag-

matic approach recognizes the likelihood that there is

no single, correct view of the discipline. However,

even if other projects do not agree with our taxonomic

projections, our statistical data and expert evaluations

may still be useful. By exposing our data from each

of the three steps through an easy-to-use API, we en-

courage other projects to discover alternative ways

to construct meaningful and useful representations of

the discipline. Additionally, by offering an open plat-

form, we invite other projects to contribute relevant

data and expert feedback to improve the quality of the

service.

3.1 Raw Materials

In this section we describe the various components

of our project which can be exploited for ontology

evaluation.

3.1.1 Structure

The core of the InPhO is the taxonomic representation

marked by the isa relations between concepts. Con-

cepts in the InPhO may be represented as part of ei-

ther class or instance relations. Classes are speciﬁed

through the direct isa hierarchy of the taxonomy (see

below). Instances are established between a concept

and another concept which is part of the taxonomic

structure. Semantic crosslinks (hereafter, links) can

be asserted between two classes to capture the relat-

edness of ideas deemed mutually relevant by feedback

or automatic methods.

3.1.2 Statistics

The InPhO’s ontology population and extension tech-

niques rely upon an external corpus (the SEP) to

generate hypotheses about similarity and generality

relationships. From this corpus we generate a co-

occurrence graph G = (V, E) in which each node rep-

resents a term in our set of keywords. An edge be-

tween two nodes indicates that the terms co-occur at

least once.

TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES

113

Figure 1: The InPhO Workﬂow.

For each node, the information content (Shannon

entropy) is calculated:

H(i) = p(i)log p(i) (1)

For each edge, the directed J-measure (Smyth and

Goodman, 1992; Niepert et al., 2007) and conditional

entropy (Shannon, 1949) is calculated bidirectionally.

The conditional entropy calculates the information

content of a directed edge i → j. This is used as a

measure of semantic distance between two terms:

H( j | i) = p(i, j)log

p(i)

p(i, j)

(2)

The J-measure calculates the interestingness of in-

ducing the rule “Whenever idea i is mentioned in a

fragment of text, then idea j is mentioned as well”

(Niepert et al., 2007). This is used as a measure of

semantic similarity between two terms:

f (i → j) =p( j | i) log

p( j | i)

p( j)

+ (1 − p( j | i))log

1 − p( j | i)

1 − p( j)

(3)

J(i → j) = p(i) f (i → j) (4)

3.1.3 Methods

The taxonomy itself is populated through the use of

answer set programming (Niepert et al., 2008). A

population method M(R, S, F) is speciﬁed by a set of

rules R, a seed taxonomy S, and a set of expert feed-

back or statistical hypotheses F. Changes in F allow

us to measure the impact of groups of expert feed-

back and to evaluate an ontology extension method.

Proposed ruleset changes can be evaluated by main-

taining the same set of inputs while testing variations

in R. The seed taxonomy is used to reduce the com-

putational complexity of a methodology, and changes

to this seed can be used to strengthen the ontology

design process. We currently have two years of data

collected on nightly repopulation of the published In-

PhO taxonomy, which is used for evaluation of our

ontology extension methods.

3.2 Our Challenges

As hinted above, our dynamic approach to ontology

design presents several unique challenges which re-

quire that appropriate validation methods be devel-

oped to address them. Speciﬁcally, there are a variety

of different ways that our answer set program could

infer a ﬁnal populated ontology from aggregate ex-

pert feedback. For example, there are different ways

of settling feedback inconsistencies (e.g. by lever-

aging user expertise in various ways (Niepert et al.,

2008)), by checking for inconsistency between feed-

back facts (e.g. looking only at directly asserted in-

consistencies or by exploring transitivities to look for

implied inconsistencies), and by restricting the con-

ditions in which an instance or link relationship can

be asserted (e.g. forbidding/permitting multiple clas-

siﬁcation, forbidding linking to a node when already

reachable by ancestry, etc.). It is difﬁcult or impossi-

ble to decide which of these design choices is optimal

a priori, and some precise evaluation metric would be

needed to determine which ruleset variations tend to

produce better results in certain circumstances.

Furthermore, our current methodology uses a

manually-constructed seed taxonomy and populates

this taxonomic structure through user feedback.

Many options are possible for this initial hand-coded

structure, and different experts would produce differ-

ent conceptualizations; we might want a measure of

which basic conceptualization tends to produce rep-

resentations which best ﬁt the distribution of terms in

the SEP. More ambitiously, if we allow the answer

set program to use disjunctive branching rules with

regards to instantiation (thus creating multiple candi-

date ontologies from a single set of input), we could

produce a large space of possible ontologies consis-

tent with user feedback and a general theory of on-

tologies; the task would then be to rank these candi-

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

114

dates according to their suitability for our metadata

goals. Again, a precise evaluation metric which could

be used to select the “best” ontology from this space

is needed.

Another question concerns the amount of expert

feedback needed before we begin to see diminishing

returns. For example, we can only collect a limited

amount of feedback from volunteer SEP authors and

editors before the task becomes onerous; as such, we

want to prioritize the collection of feedback for ar-

eas of the ontology which are currently underpopu-

lated, or even pay some domain experts to address

such sparseness. To optimize efﬁciency, we would

want to estimate the number of feedback facts that

are needed to reach a relatively stable structure in that

area.

Finally, given that philosophy is an evolving do-

main rich with controversies, we might wonder how

much our evolving representation of that domain cap-

tures these debates as they unfold. One of the alluring

applications of dynamic ontology is to archive ver-

sions of the ontology over time and study the evo-

lution of a discipline as it unfolds. This is doubly-

relevant to our project, as both our domain corpus

(the asynchronously-edited SEP) and our subject dis-

cipline are constantly evolving. The study of this con-

troversy and the evolution resulting from it could be

greatly enhanced by using metrics to precisely char-

acterize change across multiple archived versions of

the ontology.

4 OUR SCORES

By stressing the dynamic nature of philosophy, we

do not mean to imply that the sciences lack contro-

versy, or that scientiﬁc ontologies do not need ways of

managing change. Nevertheless, whereas the sciences

typically aim for empirically-grounded consensus, the

humanities often encourage interpretation, reinterpre-

tation, and pluralistic viewpoints. In this context, the

construction of computational ontologies takes on a

social character that makes an agreed-upon gold stan-

dard unlikely, and makes individual variation of opin-

ion between experts a permanent feature of the con-

text in which ontology evaluation takes place. Be-

cause of the dynamic, social nature of the domain, we

do not try to achieve maximal correctness or stability

of the InPhO’s taxonomy of philosophical concepts

in one step. But by iteratively gathering feedback,

and improving the measures by which the ontology

ﬁt to various corpora can be assessed, we can hope

to quantify the extent to which a stable representation

can be constructed despite controversy among users.

Our volatility score is designed to provide such a mea-

sure.

Many approaches to ontology evaluation, such as

our volatility score, focus solely on syntactic (formal)

properties of ontologies. These methods provide im-

portant techniques for assessing the quality of an on-

tology and its suitability for computational applica-

tions, but stable, well-formed syntax is no guarantee

that semantic features of the domain have been accu-

rately captured by the formalism. By using the SEP

as a proxy for the domain of philosophy, our violation

score exploits a large source of semantic information

to provide an additional estimate as to how well the

formal features of our ontology correspond to the rich

source material of the SEP.

4.1 Volatility Score

Most generally, a volatility score provides a mea-

sure of the amount of change between two or more

different versions of a populated ontology.

Such a

metric can serve a number of different purposes, in-

cluding controversy assessment and stability assess-

ment for a proposed methodology. As mentioned ear-

lier, the ever-changing copora and domains modeled

by a dynamic ontology are riddled with controversy.

By comparing the changes between multiple archived

versions of a populated ontology through a “directed

volatility” score, we are able to track the evolution of

a knowledge base over time. At the same time, we

expect a proposed methodology to handle inconsis-

tencies gracefully. By using random samples of ex-

pert feedback, we are able to test a ruleset variation’s

stability through a “grab-bag volatility” score. By ad-

justing the size of these random samples, we can also

use this measure to determine how much feedback to

solicit before reaching a point of diminishing returns

with regards to stability.

While “volatility” represents a family of related

methods, they all share the same basic intuition that

some value is added to the aggregate volatility score

each time the method “changes its mind” about as-

serting some particular link in the ontology (e.g. an

instance switches from being asserted to not asserted

under some class). For example, consider the repre-

sentation of controversy over time: if behaviorism is

said to be highly related to philosophy of language but

a handful of expert evaluations indicate otherwise, our

model would “change its mind” about asserting a link

between behaviorism and philosophy of language. As

other experts choose sides and weigh in on the matter,

the volatility continues to increase, further pointing to

We thank Uri Nodelman for early discussion of this

idea.

TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES

115

an area of conﬂict. To consider another application,

volatility can be used to indicate how much feedback

is needed to reach stability for some area of the ontol-

ogy by taking random subsets of feedback facts, and

assessing the amount of volatility between ontologies

generated from those random subsets. By increasing

the size of the subset, we then see how much impact

new feedback is having. Once we reach an accept-

ably low threshold for volatility, we can decide that

collecting more feedback is not worth the effort and

cost.

4.1.1 Assumptions & Requirements

Volatility measures the structural stability of a set of

ontologies or (derivatively) an ontology population

method. Many in the semantic web community hold

that domain ontologies are supposed to be authorita-

tive descriptions of the types of entites in a domain

(Smith, 2003). However, ontology development is of-

ten an iterative process (Noy and McGuinness, 2001),

especially in dynamic ontology. The volatility score

carries with it this assumption that a “ﬁnal answer”

description will not respond to the metadata needs

of a dynamic corpus such as the SEP, Wikipedia, or

WordNet. Additionally, a domain can undergo wide

paradigm shifts, dramatically changing its conceptual

landscape (Kuhn, 1962). The advent of new the-

ories like quantum mechanics or new technologies

like computers, for example, radically reshaped the

conceptual landscape of philosophy. Therefore, the

volatility score must be evaluated by domain experts

to determine whether instability is due to undesirable

errors/omissions in feedback or the machine reason-

ing program, or whether it instead properly highlights

ongoing controversy within the ﬁeld. In the former

case, changes to the ontology extension methods can

be made and evaluated against the old measure using

the violation score. In the latter, these highlighted ar-

eas of controversy could be used to inform research in

the ﬁeld. In the case of the InPhO project, this could

help facilitate analytic metaphilosophy (see Section

6.1 of Buckner, Niepert, and Allen (2010)).

4.1.2 Formalization

There are two subfamilies of volatility scores. One is

the “directed volatility” which assesses the number of

times an instance ﬂips from being asserted to not as-

serted given an ordered set of ontologies. “Directed

volatility” can be used to examine archived versions

of an ontology and provide feedback about ontology

extension methods. However, these directed measures

will not be useful in calculating the amount of feed-

back needed for the domain representation to reach

some desired threshold of stability, as any ordering

of populated ontologies derived from n random sam-

ples of z feedback facts would be entirely arbitrary.

Thus we want a measure which does not require the

ontologies to be ordered, but rather provides an esti-

mate of how volatile that whole set is when mutually

compared.

One way to achieve this is to consider the set of

feedback facts not as a single entity which evolves

over time, but rather as a supply of materials that can

be used to populate an ontology. In a similar man-

ner, we conceive of the populated ontology not as

a whole representation, but as a bag of inferred in-

stances. We then assess, for a set of n ontologies gen-

erated from random samples of z feedback facts and

any pair of terms P and Q, the relative proportion of

times instance o f (P, Q) is asserted vs. non-asserted.

Thus, for any two terms P and Q, the basic formula

for assessing the contribution of that pair to the over-

all volatility score is

v(P, Q) = 1 −

|x −

(5)

where x is the number of times that the

instance o f (P, Q) is asserted in the set under

consideration. The total volatility is given by

volatility(z) =

count(P, Q)

∑

∀P,Q

v(P, Q) (6)

However, a complication is introduced here in

that there are different etiologies which could lead

instance o f (P, Q) to switch from being asserted/non-

asserted. One way is for there to be a lack of any

feedback facts relevant to that instance which could

lead to the assertion of an instance o f relation; an-

other is due to the resolution of an inconsistency in

feedback facts (e.g. in one ontology a connection

is asserted between P and Q due to a user’s feed-

back, but not asserted in another because of con-

trary feedback from another user with a higher level

of expertise). In order to isolate these issues, we

adopt a “conservative” approach to assessing volatil-

ity: for any given pair of terms, we will only as-

sess a volatility contribution across the subset of on-

tologies where at least minimal raw materials are

present for asserting an instance o f relationship (e.g.,

more speci f ic(P, Q) and highly related(P, Q)). (It

follows from this that no violation is assessed for pairs

of terms which never have the raw materials for asser-

tion across those random subsets of feedback.) We

should still want to normalize this measure for the

whole set of generated ontologies, because we would

want to count an instance o f fact asserted 25 times

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

116

out of 50 relevant ontologies (i.e. ontologies gener-

ated from the relevant raw materials) as more volatile

than a instance o f fact which shifted 10 times out of

20 relevant ontologies (out of the 50 total generated).

In this case, the equation is modiﬁed to

(P, Q) = 1 −

|x −

(7)

which reduces to

(P, Q) = 1 −

|x −

(8)

and gives the sum volatility of

volatility(z) =

count(P, Q)

∑

∀P,Q

(P, Q) (9)

where m is the number of ontologies possessing raw

materials for a possible feedback assertion, and n is

the total number of ontologies generated for random

sampling of z feedback facts.

4.1.3 Interpretation of Results

Depending upon the modeler’s goals and assumptions

about the domain, the volatility metric can be dis-

played in different ways and given different interpre-

tations. Suppose, for example, that we want to visu-

alize an unfolding controversy in the discipline. We

may take some set of archived ontologies from the

temporal beginning and ending of the controversy,

and superimpose the volatility heat-maps for each

pairwise volatility comparison between a time slice

of the ontology and its temporal successor, coloring

areas of change, perhaps gradually fading from one

color to another as time goes on. “Hotter” areas of the

visualization indicate areas of more persistent contro-

versy, and the color shade indicates the trajectory of

the dialectic over time. This would allow an expert to

visualize the evolution of a controversy and its effects

rather effectively in a quick display.

Suppose instead that our goal was to determine the

amount of feedback needed for comprehensive and

authoritative coverage of an area of our ontology. In

that case, the volatility metric would be summed as in-

dicated above for random samples of z feedback facts,

and the net result would provide a volatility estimate

for z facts that could be compared to measures for

other numbers of feedback facts or a predetermined

threshold. In this case, volatility indicates not con-

troversy, but rather the stability of the representation

given that number of feedback facts, as well as how

likely that representation is to change with the addi-

tion of more. Furthermore, we could look not just

at the aggregate sum of individual pair volatilities,

but rather display those on a heat map again. “Hot-

ter” areas on this visualization might indicate areas

which require more comprehensive or authoritative

expert feedback, and thus could be used to direct the

feedback solicitation process towards areas where it

is most needed.

4.1.4 Preliminary Results

While we do not currently have enough feedback facts

to reliably estimate the amount of feedback needed

to achieve diminishing returns, we have tested the

measure by taking random samples of z = 2000,

4000, 6000, and 8000 feedback facts, conﬁrming

that volatility does indeed decrease with increas-

ing amounts of feedback even for our small data

set. A problem for small data sets, however, is that

the formalization of “grab-bag” volatility above de-

pends upon the idealization that one can draw y non-

overlapping random samples of z feedback facts from

the whole population of possible feedback. Our cur-

rent feedback consists of n = 8006 feedback facts.

This is severely limiting to the type of evaluation we

can presently do: At z = 2000, we can only take four

samples without overlap. As z approaches n, the prob-

ability that the very same feedback facts will be cho-

sen at each random sample increases exponentially

(and thus exponentially reduces the volatility metric).

While there are several possible methods to control

for this confound, we require a much larger sample of

feedback facts from which to draw our random sam-

ples. Further ideas as to how to deal with this con-

found are described in the Future Work section below.

4.2 Violation Score

For a candidate taxonomy, we introduce a “violation

score” that is computed by assessing the degree to

which its relative placement of terms diverges from

statistically generated expectations about those terms

relative locations in semantic space (as estimated by

their corpus-derived similarity and relative generality

measures). Similar to Dellschaft and Staab (2008), we

consider violation on both a local and the global level.

For local violations we only look at parent-child tax-

onomic relations. For the global violations, we look

at the weighted pathwise distance between two terms

in a taxonomy.

4.2.1 Assumptions & Requirements

One goal of ontology design is to produce a represen-

tation which captures the semantic structure of a do-

main. In order to have a concrete standard for evalua-

TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES

117

tion, the violation score uses the distribution of terms

in corpus, e.g. a reference work in that domain, as

a proxy for the domain itself. Evaluation may thus

draw upon the statistical measures outlined in Section

3.2.2. However, any metric relating an ontology’s tax-

onomic relations to statistical measures carries with

it implicit assumptions regarding the semantic inter-

pretation of the ontology’s structural properties, such

as the interpretation of edges, pathwise distance, or

genealogical depth. In order for the representation

to be useful in end user applications (such as visu-

alization, semantic search, and ontology-guided con-

ceptual navigation), we consider several approaches

to interpreting ontological structure, which may be

adopted with varying degrees of strength:

• Topic Neutrality. One might simply wish to reg-

iment all of the vocabulary in a common structure

representing only the isa relationships that exist

among the various terms. The goal of such a tax-

onomy is simply to enforce a hierarchical struc-

ture on all the terms in the language. According to

this approach, there is no implied semantic signiﬁ-

cance to the node depth (aka, genealogical depth)

or to path length between pairs of nodes beyond

the hierarchical semantics of the isa relation it-

self. For example, if English contains more levels

of classiﬁcatory terms for familiar animals than

it does for relatively unfamiliar organisms, a term

such as “dog” may sit at a greater depth in the tax-

onomy from the root node than terms for other or-

ganisms that are similarly speciﬁc, but nothing of

any semantic signiﬁcance is implied by this depth

(or the distance between term nodes) beyond the

existence of the intervening terms in the language.

• Depth as Generality. One might desire that all

sibling nodes have approximately the same level

of generality in the target domain, making node

depth (distance from the root node) semantically

signiﬁcant. On this view, the terms dog (a species)

and feline (a family) should not be at the same

depth, even if the language of the domain or cor-

pus contains the same number of lexical con-

cepts between dog and thing as between feline and

thing. Here one expects the entropy of terms at the

same depth to be highly correlated.

• Leaf Speciﬁcity. One might desire that all leaf

nodes in the structure represent approximately the

Edge equality provides a special case of depth as gener-

ality. The latter requires only that all edges at a given level

represent the same semantic distance, whereas edge equal-

ity also requires these distances to be consistent between

the different levels (e.g., the movement from a species to a

genus represents the same conceptual distance as that be-

tween an order and a class).

same grain of analysis. On this view, regard-

less of node depth, leaves should have similar en-

tropy. Thus, for example, if hammerhead shark

and golden retriever are both leaf nodes, leaf

speciﬁcity is violated if these terms are not simi-

larly distributed across the corpus that is standing

proxy for the domain.

Choices among these desiderata are central to any

argument for edge-based taxonomic evaluation. This

is especially true for gold standard approaches which

implicitly hold the relations of two candidate ontolo-

gies to be semantically equivalent. Additionally, we

suspect that most domains have asymmetric taxo-

nomic structures: subtrees of sibling nodes are not

typically isomorphic to one another, and this means

that even within a given taxonomy, path length be-

tween nodes and node depth may not have the same

semantic signiﬁcance.

In our comparison methods we assume that node

depth is topic neutral – that is, node depth bears lit-

tle correlation to speciﬁcity or generality on a global

level. However, by deﬁnition, a child node should be

more speciﬁc than its parent node. Thus, we measure

local violation by comparing the information content

of the parent and child nodes. When two terms are

reversed in speciﬁcity we can count this as a syntac-

tic violation of the taxonomic structure. Additionally,

we can expect sibling instances to be closely related

to one another and to their parent node by statistical

measures of semantic distance. An instance is in vi-

olation if it is an outlier compared to the rest of its

siblings.

We propose that overall violation is an emergent

property from these localized semantic violations.

These violations are each weighted by the magnitude

of the error, ensuring that an ontology with several

large mistakes will have greater violation than one

with many minute errors.

4.2.2 Formalization

A generality violation (g-violation) occurs when two

terms are reversed in speciﬁcity (e.g., the statistics

propose that connectionism is more speciﬁc than cog-

nitive science but the answer set asserts that cogni-

tive science is more speciﬁc). For two terms S and G,

where S is more speciﬁc than G, we hypothesize that

the conditional entropy will be higher for for G given

S than for S given G.

H(G | S) > H(S | G) (10)

This makes intuitive sense if one considers the

terms dog (S) and mammal (G). The presence of the

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

118

term dog will lend far more certainty to the appear-

ance of mammal than the other way around -- men-

tioning mammal is not very predictive of dog.

If this inequality does not hold, we take this as a

generality violation (g-violation):

gv(S, G) = H(S | G) − H(G | S) (11)

The mean of the g-violations is then taken to give

the overall g-violation.

violation

(O) =

count(S, G)

∑

∀S,G

gv(S, G) (12)

A similarity violation (s-violation) occurs when an

instance’s semantic similarity to its parent class is an

outlier compared to the rest of its siblings. For exam-

ple, the entity (ideas about) federalism has been ob-

served under both (ideas about) social and political

philosophy and (ideas about) forms of government.

However, the siblings of federalism under forms of

government are much closer to their parent node, than

those under social and political philosophy. There-

fore, a taxonomy asserting that federalism is an in-

stance of social and political philosophy will recieve

higher violation than one in which federalism is an

instance of forms of government.

Semantic similarity can be measured using a vari-

ety of measures reviewed in Jiang and Conrath (1997)

and Resnik (1999). We use the measure presented in

Lin (1998):

sim(x

, x

) =

2 × log P(C)

logP(x

) + log P(x

)

(13)

Such that x

and x

are entities in the taxonomy,

and C is the most speciﬁc class which subsumes x

and x

. As we are simply comparing an instance S to

its parent G, we can use:

sim(S, G) =

2 × log P(G)

logP(S) + logP(G)

(14)

The degree of s-violation can be determined by

the standard score, which normalizes the values by

standard deviation:

sv(S, G) =

sim(S, G) − µ

(15)

where x is the raw semantic distance, µ is the mean

of the semantic distance to the parent of all sibling

nodes and σ is the standard deviation of this popula-

tion. The ﬁnal s-violation is calculated as the mean of

s-violations.

violation

(O) =

count(S, G)

∑

∀S,G

sv(S, G) (16)

4.2.3 Interpretation of Results

The violation score is intended as way to select the

best representation of a given set of input parameters.

In our methodology, the violation score is used to test

variations in ruleset changes or seed taxonomies. This

evaluation can be used throughout the ontology de-

sign process to perfect methodology. We have used

violation to examine changes to the assertion of se-

mantic crosslinks and in the weighting of expert feed-

back obtained from novice philosophers, undergradu-

ate majors, graduate students, and professors of phi-

losophy.

Additionally, we are able to use the violation

score to compare different samples of expert feed-

back by using the same seed taxonomy and ruleset.

The changes in violation scores exposed a steady in-

crease in taxonomic ﬁt from novices to undergrad-

uates to graduate students, before a slight decrease

with professors. Further investigation of violations

found that our highest-level experts were more likely

to go against the statistical prediction in often use-

ful ways, further justifying the solicitation of feed-

back. Note that this starkly illustrates the limits of this

method of corpus-based ontology validation: in this

case, we solicited expert feedback precisely because

we regarded the co-occurrence statistics as less than

perfectly reliable, and in general the judgments of ex-

perts are regarded as more trustworthy than the eval-

uation metrics generated from those co-occurrence

statistics. As such, we would obviously not infer that

the ontology generated from the inclusion of expert

feedback is less desirable than that without. In gen-

eral, one should keep in mind during evaluation that

one should not evaluate representations generated us-

ing one source of data against evaluation metrics gen-

erated using another, less-trusted source of data. In

practice, this complicates even comparisons between

different versions of the ruleset, for we must care-

fully reason through whether some particular ruleset

change could be subtly biasing the representation to-

wards or against expert feedback (e.g., in the way it

settles inconsistency between users and experts).

4.2.4 Experimental Results

Since deploying the initial version of our answer set

program (described in Niepert et al. 2008), we discov-

ered a number of possible improvements, but could

not be sure a priori which version of the ruleset would

produce better results. The violation score provides

us with a way to compare these options in terms of

their suitability. We identiﬁed three binary param-

eters along which our program can vary, and have

compared the violation scores for each possible com-

TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES

119

bination (resulting in a 2x2x2 matrix). The three pa-

rameters are brieﬂy described under their abbreviated

names below.

• “plink”. Our original ruleset (Niepert et al. 2008)

included non-taxonomic “links” to allow reacha-

bility between entities which were semantically

related but which, for various reasons, could not

be connected taxonomically. To minimize unnec-

essary taxonomic relations, we added a rule (here-

after, the “nins” rule) which blocked an instance X

from being asserted as an instance of a class Y if

there was also evidence that X was an instance of

class Z and Y was possibly linked (“plink”ed) to Z

(since in that case X would already be reachable

from Y via the Y → Z link). Unexpectedly, we

found that this occasionally produced an undesir-

able “reciprocal plink deadlock” (see Figure 2):

whenever links were possible from both Y → Z

and Z → Y , the nins rule blocked X from be-

ing inferred as an instance of either Y or Z (and

thus X often became a taxonomic “orphan”). As

such, we created a second version of the pro-

gram which added a “no plink” restriction to the

“nins” rule, preventing this reciprocal plink situ-

ation. The “plink” parameter indicates that this

restriction was added to the nins rule.

Figure 2: The reciprocal plink problem

• “voting”. An important innovation of our project

involves the stratiﬁcation of user feedback into

different levels of self-reported expertise and us-

ing this information in a two-step process to re-

solve feedback inconsistencies. The ﬁrst step in

this process involves the application of a “voting

ﬁlter” which settles intra-strata feedback incon-

sistencies using a voting scheme and can be com-

pleted as a preprocessing step before the answer

set program is run (as described in Niepert et al

2009). The “voting” parameter indicates that this

ﬁlter was run.

• “trans”. Much of the information on which our

program operates is derived from the transitivity

of the “more general than”/“more speciﬁc than”

feedback predicates. The second step of our

method for settling feedback inconsistencies in-

volves settling inter-strata inconsistencies, which

is completed from within our ruleset. However,

transitivities in feedback can be computed either

before or after these inter-strata inconsistencies

are resolved (the former resulting in many more

inconsistencies requiring resolution). The “trans”

parameter thus indicates that this version of the

ruleset computes transitivities before (vs. after)

our ruleset settles inter-strata inconsistencies.

Each modiﬁcation was then compared to the cur-

rent ruleset using both the s-violation and g-violation

metrics using corpus statistics and user evaluations

from July 24, 2010 (see Figure 3). The number of

instances asserted is also included. As we can clearly

see, every proposed change decreased both violation

scores, with the best results provided by adopting

all three changes

. The decrease in s-violation can

be interpreted as the development of denser semantic

clusters subsumed under each class. The decrease in

g-violation can be interpreted as movement towards

greater stratiﬁcation in the heirarchy. This is quanti-

tative evidence that the principled design choices out-

lined above will provide useful additions to the ontol-

ogy enrichment process.

5 FUTURE WORK

With these methods of evaluating ontology structure

and function in hand, along with preliminary results

on our limrited feedback collection, we propose to

continue these evaluation experiments as new feed-

back is rapidly collected from SEP authors. These

scores will allow us to pursue a long-desired use of

our answer set programming to infer a space of popu-

lated ontologies and select an optimal one by ranking

them according to violation scores. We can then see

how consistent ruleset selection is.

We might also ask how feedback from people with

different levels of expertise in philosophy affects the

placement of terms in the InPhO. For instance, Eckart

et al (2010) have already gathered feedback data from

Amazon Mechanical Turk (AMT) users and com-

pared their responses to those of experts. Although

we know that as a whole they differ statistically from

experts, we do not yet know how much this matters

to the structure that is eventually produced from those

feedback facts.

As for the confound of overlapping samples in

the calculation of “grab-bag” volatility (see section

4.1.4), an ideal solution is to solicit more feedback,

increasing the amount of non-overlapping samples

of a size z. Collecting generalized feedback from

lower levels of expertise is economically feasible us-

ing AMT. Additionally, we can isolate small sections

g-violation was lowest when adopting the plink and

voting changes, but not trans. However, the result with all

three changes was second lowest.

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

120

s-violation g-violation instances

all-in voting all-in voting all-in voting

current 0.8248 0.8214 -0.1125 -0.1170 417 456

plink 0.8111 0.8089 -0.1182 -0.1227 521 568

trans 0.8119 0.8094 -0.1133 -0.1168 452 491

plink, trans 0.8061 0.8031 -0.1153 -0.1188 502 546

Figure 3: Violation score evaluations on the InPhO using feedback and corpus statistics from July 24, 2010.

of the ontology to gather a very large amount of ex-

pert feedback from SEP authors in order to determine

the point of diminishing returns for that location and

extrapolate that result to estimate the amount of feed-

back required for other sections.

Finally, the InPhO has daily archives of its popu-

lated ontologies from October 23, 2008 to the present

(July 25, 2010). By using the volatility measure on

this data set, we should gain insights into our own

ability to capture controversy and convergence within

a ﬁeld and be able to present that to philosophers

through the visualizations described in Section 4.1.3.

6 CONCLUSIONS

In this paper we have proposed two methods for

evaluating the structural and functional aspects of a

corpus-based dynamic ontology. Our work focuses on

the semantic evaluation of taxonomic relations, rather

than the lexical evaluation undertaken by Brewster et

al. (2004) and Dellschaft & Staab (2008). The vio-

lation score gives us a concrete measure of how well

an ontology captures the semantic similarity and gen-

erality relationships in a domain by examining statis-

tical measures on an underlying corpus. The volatil-

ity score exposes areas of high uncertainty within a

particular ontology population method, which can be

used for many purposes. Directed measures of volatil-

ity can indicate the evolution of a knowledge base

and highlight areas of controversy. Non-directed mea-

sures can indicate the stability of a ruleset variation by

using random samples of expert feedback. This can

also estimate the amount of expert feedback required

for a convergent representation. We also have exam-

ined the considerations necessary to examine a taxon-

omy, and demonstrated how these methods have been

used to enhance the enrichment process of the Indiana

Philosophy Ontology Project through experiments on

ruleset variations, expert feedback stratiﬁcation and

stability.

ACKNOWLEDGEMENTS

During the preparation of this manuscript, the ﬁrst

author was supported by grants from the Cognitive

Science Program and Hutton Honors College at Indi-

ana University. The research described in this paper

has been funded with grants from the United States

National Endowment for the Humanities Division of

Preservation and Access and the NEH Ofﬁce of Digi-

tal Humanities.

REFERENCES

Brank, J., Grobelnik, M., and Mladenic, D. (2005). Survey

of ontology evaluation techniques. In Proceedings of

the Conference on Data Mining and Data Warehouses

(SiKDD).

Brewster, C., Alani, H., Dasmahapatra, S., and Wilks, Y.

(2004). Data driven ontology evaluation. In Proceed-

ings of LREC, volume 2004.

Buckner, C., Niepert, M., and Allen, C. (2010). From ency-

clopedia to ontology: Toward dynamic representation

of the discipline of philosophy. Synthese.

Dellschaft, K. and Staab, S. (2008). Strategies for the Evalu-

ation of Ontology Learning. In Buitelaar, P. and Cimi-

ano, P., editors, Ontology Learning and Population:

Bridging the Gap Between Text and Knowledge, pages

253–272. IOS Press.

Eckert, K., Niepert, M., Niemann, C., Buckner, C., Allen,

C., and Stuckenschmidt, H. (2010). Crowdsourcing

the Assembly of Concept Hierarchies. In Proceedings

of the 10th ACM/IEEE Joint Conference on Digital Li-

braries (JCDL), Brisbane, Australia. ACM Press.

Fahad, M. and Qadir, M. (2008). A Framework for Ontol-

ogy Evaluation. In Proceedings International Con-

ference on Conceptual Structures (ICCS), Toulouse,

France, July, pages 7–11. Citeseer.

Gangemi, A., Catenacci, C., Ciaramita, M., and Lehmann,

J. (2006). Modelling ontology evaluation and valida-

tion. In The Semantic Web: Research and Applica-

tions, pages 140–154. Springer.

omez-P

erez, A. (1999). Evaluation of taxonomic knowl-

edge in ontologies and knowledge bases. In Pro-

ceedings of the 12th Banff Knowledge Acquisition for

Knowledge-Based Systems Workshop, Banff, Alberta,

Canada.

TWO METHODS FOR EVALUATING DYNAMIC ONTOLOGIES

121

Gruber, T. R. (1995). Toward principles for the design of

ontologies used for knowledge sharing. International

Journal of Human Computer Studies, 43(5):907–928.

Guarino, N. and Welty, C. A. (2004). An overview of Onto-

Clean. In Staab, S. and Studer, R., editors, Handbook

on ontologies, chapter 8, pages 151–159. Springer, 2

edition.

Jiang, J. and Conrath, D. (1997). Semantic similarity based

on corpus statistics and lexical taxonomy. In Proceed-

ings of International Conference Research on Com-

putational Linguistics (ROCLING X), number Rocling

X, Taiwan.

Kuhn, T. (1962). The Structure of Scientiﬁc Revolutions.

University of Chicago Press.

Lin, D. (1998). An information-theoretic deﬁnition of sim-

ilarity. In Proceedings of the 15th International Con-

ference on Machine Learning, pages 296–304. Cite-

seer.

Lozano-Tello, A. and G

omez-P

erez, A. (2004). Ontometric:

A method to choose the appropriate ontology. Journal

of Database Management, 15(2):1–18.

Maedche, A. and Staab, S. (2002). Measuring similar-

ity between ontologies. Knowledge Engineering and

Knowledge Management: Ontologies and the Seman-

tic Web, pages 15–21.

Niepert, M., Buckner, C., and Allen, C. (2007). A dy-

namic ontology for a dynamic reference work. In Pro-

ceedings of the 7th ACM/IEEE-CS joint conference on

Digital libraries, page 297. ACM.

Niepert, M., Buckner, C., and Allen, C. (2008). Answer

set programming on expert feedback to populate and

extend dynamic ontologies. In Proceedings of 21st

FLAIRS.

Noy, N. and McGuinness, D. (2001). Ontology develop-

ment 101: A guide to creating your ﬁrst ontology.

Porzel, R. and Malaka, R. (2005). A task-based framework

for ontology learning, population and evaluation. In

Buitelaar, P., Cimiano, P., and Magnini, B., editors,

Ontology Learning from Text: Methods, Evaluation

and Applications. IOS Press, Amsterdam.

Resnik, P. (1999). Semantic similarity in a taxonomy:

An information-based measure and its application to

problems of ambiguity in natural language. Journal

of artiﬁcial intelligence research, 11(4):95–130.

Shannon, C. E. (1949). A mathematical theory of communi-

cation. University of Illinois Press, Urbana, Illinois.

Smith, B. (2003). Ontology. In Luciano, F., editor, Black-

weel Guide to the philosophy of computing and infor-

mation, pages 155–166. Blackwell, Oxford.

Smyth, P. and Goodman, R. (1992). An information theo-

retic approach to rule induction from databases. IEEE

Transactions on Knowledge and Data Engineering,

4(4):301–316.

Staab, S., G

omez-P

erez, A., Daelemans, W., Reinberger,

M.-L., Guarino, N., and Noy, N. F. (2004). Why eval-

uate ontology technologies? because it works! IEEE

Intelligent Systems, 19(4):74–81.

Supekar, K. (2004). A peer-review approach for ontology

evaluation. In 8th Int. Protege Conf, pages 77–79.

Citeseer.

Velardi, P., Navigli, R., Cucchiarelli, A., and Neri, F.

(2005). Evaluation of OntoLearn, a methodology for

automatic learning of domain ontologies. In Buitelaar,

P., Cimiano, P., and Magnini, B., editors, Ontology

Learning from Text: Methods, Evaluation and Appli-

cations. IOS Press, Amsterdam.

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

122