MINING NON-TAXONOMIC CONCEPT PAIRS

FROM UNSTRUCTURED TEXT

A Concept Correlation Search Framework

Mei Kuan Wong, Syed Sibte Raza Abidi

Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, NS, Canada

Ian D. Jonsen

Department of Biology, Dalhousie University, 1459 Oxford Street, Halifax, NS, Canada

Keywords: Correlation Search, Non-taxonomic Relations, Association Rule Mining, Lift Interestingness Measure.

Abstract: Ontology consists of concepts, taxonomic relations and non-taxonomic relations. The majority of the

ontology learning tools focus on discovering concepts and taxonomic relations. Very little effort has been

put on discovering non-taxonomic relations. In this paper, we present a concept correlation search

framework to discover non-taxonomic concept pairs from unstructured text. Our framework features the (a)

extraction of correlated concepts beyond ordinary search window size of a single sentence; (b) use of lift as

interestingness measure for association rule mining; (c) harness of 2- itemsets association rules from n-

itemsets association rules where n>2; and (d) identification of non-taxonomic concept pairs based on

existing domain ontology. The proposed framework has been tested with the Fisheries Oceanography

journals, and the results demonstrate significant improvements over traditional association rule approach in

search of non-taxonomic concept pairs.

1 INTRODUCTION

Ontologies serve as semantic representations of

domain-specific knowledge, and are used for

knowledge sharing, interoperability and reuse

(Shamsfard & Barforoush 2003). An ontology

consists of a set of concepts or classes, C, which is

taxonomically related by the transitive, IS-A relation

H ∈ C × C and non-taxonomically related by named

object relation R ∗ ∈ C × C × String.

Developing ontologies—i.e. ontology

engineering—is a tedious process that demands a

sound understanding of the domain and the ability to

abstract and model the knowledge. In recent years,

ontology engineering has been pursued by ‘learning’

the ontology from domain-specific documents.

Ontology learning from text involves the application

of natural language processing, text analysis and

logical reasoning methods to capture knowledge—

i.e. domain concepts, relationships between

concepts, descriptions of concepts—from documents

to serve as the building blocks of an ontology. This

approach leads to a reduction in the time, effort and

manpower required in the ontology engineering

process.

Typically, the existing ontology learning tools

focus on discovering concepts and their taxonomic

relations from texts. However, the extraction of non-

taxonomic relations, which are an integral aspect of

an ontological description of a domain, is not well-

researched (Sánchez & Moreno 2008). An example

of a non-taxonomic relation is the relation cure

between the concept pairs of doctor and patient.

Current research on discovering non-taxonomic

relations is pursued based on (a) statistical approach

and (b) semantic analysis approach. Semantic

analysis approaches rely on lexico-syntactic patterns

to discover relations between a pair of co-occurring

concepts. Statistical approaches involve studying the

distributional properties of words in order to

determine the salient concepts and then use

correlation measures between concepts to establish

potential non-taxonomic relations between them.

Association rule mining is a popular statistical

707

Kuan Wong M., Sibte Raza Abidi S. and D. Jonsen I..

MINING NON-TAXONOMIC CONCEPT PAIRS FROM UNSTRUCTURED TEXT - A Concept Correlation Search Framework.

DOI: 10.5220/0003482707070716

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WTM-2011), pages 707-716

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

method to extract non-taxonomic relations, and is

used in ontology learning tools such as Text2Onto

(Cimiano et al. 2005) and OntoLearn (Velardi et al.

2005). These ontology learning tools use association

rule mining with traditional confidence measure to

extract non-taxonomic relations. However, there are

noted limitations about confidence measure is that it

(a) is sensitive to the frequency of the concepts in

the data set and may return pairs of concepts even if

there is no association between them, and (b) suffers

from rare itemset problem whereby even if an

association rule representing an important

relationship between concepts exists but since it is

rare it is pruned altogether (Sheikh et al. 2005).

In this paper, we pursue the extraction of concept

pairs, from unstructured text, that has a non-

taxonomic relation between them. We present a

concept correlation search framework that employs a

statistical approach that is an extension to the

traditional association rule mining approach used in

ontology learning tools for non-taxonomic relation

extraction. Our approach to search for correlated

concepts has three distinct elements: (i) we

investigate the use of the lift measure (Sheikh et al.

2005), as opposed to the traditional support and

confidence measures, to establish the interestingness

between correlated concepts. The key advantage of

our use of the lift measure is that it determines how

many times more often concept X and concept

Y occurs together than expected if they were

statistically independent. Lift does not suffer from

the rare item problem (Sheikh et al. 2005); (ii) when

searching for correlated concept pairs we look

beyond the traditional one-sentence window to

include multiple adjacent sentences. Our approach is

based on the observation that quite often scientific

authors discuss correlated concepts across multiple

sentences, therefore we search correlated concepts

across two adjoining sentences; (iii) we employ a

domain ontology, as background knowledge, to filter

out the correlated concepts that have a taxonomic

relationship between them. This leaves us with a set

of non-taxonomic concept pairs that serve as

candidates for non-taxonomic relations during

ontology learning. We apply our framework to

search for non-taxonomic concept pairs for the

domain of marine biology—we worked with 374

Fisheries Oceanography journal publications over a

period of 10 years (1999-2008). We extracted 130

concept pairs out of which 108 non-taxonomic

concept pairs were identified. The results were

validated by domain experts.

2 LITERATURE REVIEW-

RELATED WORK

Ontology learning involves Machine Learning (ML)

and advance Natural Language Processing (NLP)

technologies, starting from term extraction and

concept definition to more complex tasks such as

learning taxonomic and non-taxonomic relations. In

this section, we review the state-of-the-art in

ontology learning tools specific to non-taxonomic

relation extraction.

From a statistical perspective, the pioneer

research work in non-taxonomic relation extraction

was performed by Maedche & Staab (2000) using

association rule mining. Subsequently, ontology

learning tools such as Text2Onto (Cimiano et al.

2005) and OntoLearn (Velardi et al. 2005) also

approach the non-taxonomic relation extraction task

from the statistical point of view using association

rule mining with traditional confidence measure.

Hasti (Shamsfard & Barforoush 2004), another

ontology learning tool, extracts non-taxonomic

relations from the semantic analysis point of view.

Hasti combines logical, linguistic-based, template

driven and semantic analysis methods in their non-

taxonomic relation extraction. A hybrid of both

approaches is taken by RelExt (Schutz & Buitelaar

2005) in their non-taxonomic relation extraction

where relevant terms and verbs are extracted from a

given text collection. Then, a combination of both

linguistic and statistical processing is used to

compute relations between them. The problem with

these methods is that they are dependent on sentence

structure. Thus, the search window size for

correlated concepts is short and constrained to a

single sentence. Short search window size used often

proves to be deficient in discovering relations

(Chagnoux et al. 2008).

From the literature review, it is clear that

ontology learning, especially the extraction of non-

taxonomic relations from unstructured text is a

challenging, yet much pursued area. Our work is an

extension to the traditional association rule mining

used in some of the abovementioned tools. We

pursue to look beyond single-sentence window and

use lift as the interestingness measure to yield

interesting concept pairs that represent potential

non-taxonomic relations in ontology learning

context.

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

708

3 OUR CONCEPT

CORRELATION SEARCH

FRAMEWORK

In order to extract non-taxonomic concept pairs from

unstructured text, we propose a concept correlation

search framework, which consists of four phases:

text preprocessing, concept extractor, correlated

concept search and concept pair classifier (see

Figure 1).

Figure 1: Functional design of our concept correlation

search framework.

Phase I begins with processing the collection of

text documents to extract the sentences within the

documents. Phase II extracts domain concepts from

the sentences. Phase III takes as input the sentences

(in the order they appear in the document) and the

extracted domain concepts to find correlated concept

pairs. In Phase IV we measure the relevancy of the

extracted correlated concept pairs to identify the

concept pairs that are relevant to the domain, and

then we use background knowledge (a domain

ontology) to identify the non-taxonomically related

concept pairs.

The distinct aspects of our approach are: (a) In

Phase III our correlated concept generator searches

for correlated concept pairs beyond the traditional

one-sentence window. This allows the potential

correlation of important concepts that are spread

across two adjoining sentences thus yielding a larger

set of correlated concept pairs; (b) In Phase III, we

apply the lift interestingness measure to association

rule mining to assess the degree to which the

concept pairs are of interest within our context; (c)

In Phase III, we make use of the association rules

with more than 2 itemsets whereby we derive

indirect 2-itemsets association rules. This is

important as most of the previous work tends to

ignore these rules while solving the non-taxonomic

relations extraction problem; (d) In Phase IV, we

engage domain experts to evaluate the relevancy of

the concept pairs extracted; and (e) In Phase IV, we

leverage a domain ontology to distinguish

taxonomic concept pairs from non-taxonomic

concept pairs.

In the next few sections, we explain the methods

developed for each processing phase.

3.1 Phase I: Text Preprocessing

This phase involves the processing of the

unstructured text document which is in the form of

Portable Document Format (PDF). The PDF files are

converted to text files using pdf2Text, an open-

source software that converts PDF documents into

text files. We also remove non-essential information

from the text files such as the headers (journal title,

author information, etc.) and footers

(acknowledgements, references, etc.). The processed

text files are then combined into a single text file.

The resulting output is then (a) used in Phase II for

concept extraction and (b) further processed using a

sentence splitter developed in Perl to split the text

into a list of sentences. A total of 74,280 sentences

were produced from a total of 374 Fisheries

Oceanography journals in PDF files.

3.2 Phase II: Concept Extractor

In this phase, our main objective is to identify key

domain concepts from the processed text file in

Phase I for the domain being investigated. We use

KEA (Keyphrase Extraction Algorithm) to extract

key phrases for the document. In KEA (Witten et al.

1999), the commonly used information retrieval

method, the tf-idf weight (term frequency-inverse

document frequency) is used to rank the key phrases.

As not all key phrases generated by KEA are domain

specific, we engaged domain experts in this phase to

manually evaluate the generated key phrases. The

key phrases produced by KEA were shown to the

domain experts to determine their relevancy to the

domain. The relevant key phrases are then used to

represent key domain concepts in the domain

ontology. A total of 102 domain concepts were

selected from the top 200 key phrases generated by

MINING NON-TAXONOMIC CONCEPT PAIRS FROM UNSTRUCTURED TEXT - A Concept Correlation Search

Framework

709

KEA. These concepts are then used as candidates in

Phase III in order to find correlations between them.

3.3 Phase III: Correlated Concept

In this phase, we pursue the search for concepts that

are deemed to be correlated. These correlated

concept pairs will further be candidates for non-

taxonomic relations. We have developed four tools

to search for correlations between concepts as

follows: (i) correlated concept generator; (ii)

association rule miner; (iii) indirect 2-itemsets

association rule detector; and (iv) association rule

filter (see Figure 1). We explain the tools developed

for each task in the next few sections.

3.3.1 Task I: Correlated Concept Generator

Based on the list of sentences generated in Phase I

and the extracted concepts generated in Phase II, the

first task in correlation search is to search for all

tightly correlated concepts. In our proposed

framework, we differentiate our correlation search

by extending the search window size to multiple

adjacent sentences (see Figure 2). By doing so,

firstly, we are able to generate more correlated

concepts and secondly, we are able to minimize the

number of missing concepts by combining all

adjacent stand alone concepts.

In order to locate correlated concepts in multiple

adjacent sentences, we devised a correlated concept

generator. First, we run through all the single

sentences to locate all salient concepts.

Subsequently, we perform three iterations over the

list of sentences to combine multiple adjacent

sentences as follows:

ITERATION 1: Combine two consecutive sentences

if each adjoining sentence consists of a single

concept;

ITERATION 2: Merge two consecutive sentences if

the first sentence consists of a single concept while

the adjoining sentence consists of more than one

concept. This is the look forward strategy.

ITERATION 3: Repeat ITERATION 2 but using the

look backward strategy. In look backward strategy,

we work from the end to the start of the text file.

An example of the execution of the correlated

concept generator is shown in Figure 2. We noted

that stand alone concepts such as a, b, c and d found

in one-sentence window are captured in our

extended window size approach. In addition, more

correlated concept pairs are generated in our

approach by combining the multiple adjacent

sentences.

Figure 2: An example of our correlated concept generator

execution.

We examine the distribution of domain concepts for

the Fisheries Oceanography journal within a single

sentence and multiple adjacent sentences in order to

determine the salient concepts (see Table 1). In quest

for correlated concept pairs, sentences with more

than one salient concept are desirable. By using

original approach of one-sentence window, we noted

that only 36.3% of the sentences consist of more

than one salient concept. This means that the search

base for correlated concept pairs is constrained to

one third of the whole document. In addition,

concepts that are found solely in sentences with one

salient concept will be lost in the process of

extracting the correlated concept pairs.

Table 1: Number of sentences with salient concepts versus

search window size.

Search Window

Size

Number of

sentences with

1 salient

concept (%)

Number of

sentences with >

1 salient concept

(%)

Single Sentence

(Original

Approach)

23,725 (63.7%) 13,540 (36.3%)

Multiple

Sentences(Our

Approach)

11,615 (39.6%) 17,713 (60.4%)

Interestingly, in our proposed approach, the

percentage of sentences with more than one salient

concept has increased drastically from 36.3% to

60.4% (see Table 1). This indicates that our

proposed method of finding correlated concepts

within multiple adjacent sentences is capable of

returning more correlated concept pairs. In addition,

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

710

the potential for losing adjacent stand alone concepts

is reduced substantially.

3.3.2 Task II: Association Rule Miner

In the second task, we mine for correlated concept

pairs through the use of association rule. Association

rule mining is a data mining technique that identifies

data or text elements that co-occur frequently within

a dataset (Agrawal et al. 1993). An association rule

describes the association among items in which

when some items are purchased in a transaction,

others are purchased too. The problem in association

rule mining can be represented as follows:

A transaction T supports an itemset X if X is

contained in T. The support for an itemset X is

defined as the ratio of the number of

transactions that supports the itemset X to the

total number of transactions. If the support for

an itemset X satisfies the user specified

minimum support threshold, then X is called

frequent itemset.

In our context, items are concepts while

transactions are sentences. It can be represented as

⇒

Y, in which X is an antecedent and Y is a

consequent of this rule, and X and Y are two

itemsets. However, in our quest for correlated

concept pairs, we treat rule X

⇒

Y, to be equivalent

to rule Y

⇒

Association rule mining typically results in large

amounts of redundant rules. Due to the large amount

of redundant rules; various measures have been

developed to help in evaluating the interestingness

of the association rules. Some of the existing

ontology learning tools such as Text2Onto (Cimiano

et al. 2005) and OntoLearn (Velardi et al. 2005) use

traditional confidence measure in extracting non-

taxonomic relations. The confidence of a rule X

⇒

is defined as the ratio of the support for the itemsets

∪

Y to the support for the itemset X. If itemset Z

= X

∪

Y is a frequent itemset and the confidence of

⇒

Y is no less than the user-specified minimum

confidence, then the rule X

⇒

Y, is an association

rule. As mentioned by (Sheikh et al. 2005), support-

confidence framework suffers from rare itemset

problem. Yet, rare itemset in an association rule may

represent an important relationship exist between

concepts. It is therefore important, from an ontology

learning standpoint, to recognize all these rare

itemsets.

In our proposed framework, we therefore use lift

as the interestingness measure for association rules.

Lift allows to measure how many times more often

X and Y occurs together than expected if they were

statistically independent. Lift does not suffer from

the rare itemset problem (Sheikh et al. 2005). The

lift measure is defined over [0,∞] and can be

interpreted as follows:

In our search for correlated concept pairs, lift

values greater than 1 is desirable. Typically, the

higher the lift value, the more likely that occurrence

of X and Y together, is not just random occurrence,

but, because of some relationships occur between

them.

In our experiment, we use Weka, an open source

data mining software to perform the association rule

mining (Hall et al. 2009). Association rules

generated in this task are further categorized into 2

groups based on the number of itemsets present in

each association rule: (a) direct 2-itemsets

association rules and (b) n-itemsets association rules

where n > 2 (see Table 2). The group of n-itemsets

association rules are then used as candidates in Task

III to further derive more indirect 2-itemsets

association rules.

Table 2: Number of association rules generated by Weka

using support-lift framework.

Description Single

sentence

Multiple

sentences

Total number of association

rules

60 156

Number of direct 2-itemsets

association rules

50 105

Number of n-itemsets

association rules (n >2)

10 51

3.3.3 Task III: Indirect 2-itemsets

Association Rule Detector (I2ARD)

The aim of this phase is to further extract indirect 2-

itemsets association rules from n-itemsets

association rules where n >2. In order to achieve this

objective, we have developed an indirect 2-itemsets

Association Rule Detector (I2ARD) using Perl. The

main idea behind this detector is to employ the anti-

monotone constraint, which means that, if an itemset

I satisfies the constraint, so does any of its subset

(Sheikh et al. 2005). In our I2ARD, we use the anti-

monotone constraint to generate indirect 2-itemsets

sub association rules from association rules with

more than 2-itemsets. An example of our I2ARD is

shown in Figure 3.

MINING NON-TAXONOMIC CONCEPT PAIRS FROM UNSTRUCTURED TEXT - A Concept Correlation Search

Framework

711

Figure 3: Examples of our Indirect 2-itemsets Association

Rule Detector (I2ARD) when n =3 and n =4.

In Table 3, we can see that our I2ARD is capable

of producing a substantial number of indirect 2-

itemsets association rules from n-itemsets

association rules where n > 2 for both original

approach and our approach. We noted that the

number of indirect 2-itemsets association rules

generated for both approaches doubled the original

number of n-itemsets association rules (n>2). This

can be attributed to the nature of the text collection,

in which the association rule miner returned

association rules with maximum of 3-itemsets. For

each 3-itemsets association rules, our I2ARD

produced two indirect 2-itemsets association rules.

Some of these indirect 2-itemsets association rules

may have existed as direct 2-itemsets association

rules discovered earlier by the association rule

miner. These redundant rules indicate that the rules

produced using our I2ARD are interesting and can

be considered for potential concept pairs.

Table 3: Number of indirect 2-itemsets association rules

generated by our I2ARD.

Description Single

sentence

Multiple

sentences

Number of n-itemsets

association rules (n>2)

10 51

Number of indirect 2-

itemsets association rules

20 102

3.3.4 Task IV: Association Rule Filter

In this task, our objective is to aggregate all direct 2-

itemsets association rules discovered in Task II with

all indirect 2-itemsets association rules discovered in

Task III. In the process of aggregation, we eliminate

all redundant and symmetric rules. The rationale of

having all symmetric rules removed is that, we treat

rule X=> Y and rule Y=> X the same in our concept

pair extraction. The resulting output is a list of

unique concept pairs that are not redundant. We

applied the association rule filter on both the original

approach of one-sentence window as well as our

approach of extending the search window size to

multiple adjacent sentences. Table 4 shows the

number of unique concept pairs produced in Phase

III. We noted that our approach of combining

multiple sentences is capable of generating more

than double the number of unique concept pairs

generated by the original approach of one-sentence

window.

Table 4: Number of unique concept pairs.

Description Single

sentence

Multiple

sentences

Number of direct 2-itemsets

association rules

50 105

Number of indirect 2-itemsets

association rules

20 102

Number of unique concept

pairs

57 130

3.4 Phase IV: Concept Pair Classifier

The final phase of our concept correlation search

framework is to make an assessment on the

correlated concept pairs found in Phase III. Our two

stages classifying strategy is to (a) classify the

correlated concept pairs to highly related, related

and not related; and (b) further classify the highly

related and related concept pairs to taxonomic

concept pairs and non-taxonomic concept pairs.

3.4.1 Stage I: Result Evaluator

Evaluation of ontology relationships learning

systems against any gold standard is notoriously

difficult as there are not many gold standards that

are available for evaluation (Gulla et al. 2009).

Nevertheless, there is always another option in

which domain experts are engaged in performing the

manual evaluation. In this stage, we have engaged 2

domain experts to rate the suggested relationships

independently. We presented all unique concept

pairs found in Phase III to the experts for their

I2ARD

W Æ X, Y, Z

W, X, Y Æ Z

I2ARD

When n = 4,

W, X Æ Y, Z

I2ARD

X Æ Y, Z X ÆY, XÆ Z

X, Y Æ Z XÆZ, YÆ Z

I2ARD

When n = 3,

WÆ Y, W ÆZ,

XÆ Y, XÆ Z

WÆ X, W ÆY,

WÆ Z

WÆ Z, X ÆZ,

YÆ Z

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

712

review. Each expert was asked to rank the concept

pairs as highly related (there is definitely a

relationship between the two concepts with a

numerical score of 1.0), related (there is probably a

relationship between the two concepts, score of 0.5)

or not related (these two concepts are not related,

score of 0).

Based on domain experts’ feedback for each concept

pairs, we computed the average score for each

concept pair and determined its relevance as shown

in Table 5.

Table 5: Score range matrix.

Ranking Score Range

Highly Related > 0.5

Related 0.5

Not Related < 0.5

For our purpose of non-taxonomic concept pair

extraction, we are only interested in highly related

and related concept pairs.

3.4.2 Stage II: Non-taxonomic Concept

Pair Finder

We further classify the highly related and related

concept pairs identified in Stage I into taxonomic

concept pairs and non-taxonomic concept pairs. Our

approach is based on the adoption of domain

ontology and exploiting the knowledge in the

taxonomy to isolate all taxonomic relations. If any of

the highly related and related concept pairs are

found in the domain ontology having super-class and

sub-class relations, these concept pairs are then

classified as taxonomic concept pairs. The remaining

concept pairs would then be classified as non-

taxonomic concept pairs.

4 EVALUATION AND

DISCUSSION

In this section, we present the evaluation results for

various methods developed for Phase III and Phase

IV of our concept correlation search framework.

4.1 Evaluating Correlated Concept

Generator

Table 6 presents the number of highly related and

related concept pairs for different search window

size used in the correlated concept generator as

discussed in Section 3.3.1. The ranking of the

concept pairs found is determined based on

evaluation by domain experts. It is interesting to

note that our approach of using multiple adjacent

sentences as search window size offered an addition

of 81.25% of highly related concept pairs and an

addition of 153.85% of related concept pairs as

compared to the traditional approach of using a

single sentence as search window size. This

vindicates our proposed approach of extending the

search window size, and also confirms the

assumption that short search window size of a single

sentence is deficient in extracting non-taxonomic

relations.

Table 6: Number of concept pairs versus different

approaches.

Approach

Number of

Highly Related

Concept Pairs

Number of

Related Concept

Pairs

Single Sentence

(Original Approach)

32 13

Multiple Sentences

(Our Approach)

58 33

% Increase 81.25% 153.85%

4.2 Examining Support-lift Framework

To demonstrate the effectiveness of our choice of

the interestingness measure for association rules, we

compared the association rules generated by two

different interestingness measures—i.e. (a) the

support-confidence and (b) the support-lift. In the

experiment, we use minimum support value of 0.01,

minimum confidence value of 0.1 and minimum lift

value of 1.01. Table 7 displays that our proposed

approach of using support-lift framework produces

higher percentage of both highly related and related

concept pairs in comparison to the original

approach. Thus, this technique is proven to be useful

for relation extraction in which a significant lift

value can be more important than a high confidence

rule (Alvarez 2003).

Table 7: Support-confidence measure versus support-lift

measure.

Interestingness Measure

% of Highly

Related Concept

Pairs

% of Related

Concept

Pairs

Support-Confidence

(Original Approach)

44.85% 36.76%

Support-Lift

(Our Approach)

49.23% 36.92%

% Change 4.38% 0.16%

We went on further to investigate the intersection of

association rules produced by different frameworks.

The rationale behind this investigation is to assess

MINING NON-TAXONOMIC CONCEPT PAIRS FROM UNSTRUCTURED TEXT - A Concept Correlation Search

Framework

713

the significance of rules produce by each

framework. If there is a substantial number of

overlapping rules found, this indicates that each

framework is capable of generating significance

rules. Figure 4 shows the intersection of association

rules produced by our support-lift framework against

the popular support-confidence framework.

Figure 4: Rules produce by various frameworks.

Interestingly, we noted that 97 concepts pairs are

found in both frameworks and 88.7% of them are

ranked as either highly related or related (see Table

8). Besides, we also studied rules generated solely

by each framework. The rules found in support-lift

framework alone seem to be of higher relevance in

comparison to rules found in support-confidence

framework alone. This implies that correlations

generated by the support-lift approach are of more

relevance to the domain experts as compared to the

correlations generated by the support-confidence

approach.

Table 8: Relevancy of association rules produce by

various frameworks.

Framework

% of Highly

Concept Pairs

% of Highly

Related Concept

Pairs

Overlap Rules 51.55% 37.11%

Rules from Support-

Lift Framework Only

42.42% 36.36%

Rules from Support-

Confidence

Framework Only

30.77% 33.33%

4.3 Evaluating Indirect 2-itemsets

Association Rule Detector (I2ARD)

Evaluation of I2ARD involved examination of

ranking on the indirect 2-itemsets association rules.

Out of the 25 indirect 2-itemsets association rules

detected, 21 of these rules are ranked as highly

related or related. By employing our I2ARD, the

number of highly related concept pairs increase by

10.34% whereas the number of related concept pairs

increase by 45.45% (see Table 9). This signifies the

importance of mining indirect 2-itemsets association

rules from n-itemsets association rules where n>2.

Table 9: Association rule miner versus association rule

miner + I2ARD.

Approach

Number of Highly

Related Concept

Pairs

Number of

Concept Pairs

Association Rule

Miner only

58 33

Association Rule

Miner + I2ARD

64 48

% Increase 10.34% 45.45%

4.4 Evaluating the Concept Correlation

Search Framework

In order to evaluate our proposed concept correlation

search framework, we compare the outcome of our

framework with the outcome of a baseline approach.

The main difference between the baseline approach

and our concept correlation search framework is in

the components in Phase III of our framework (see

Table 10).

Table 10: Baseline approach versus our approach.

Tasks

Baseline

Approach

Our Approach

Search window size

Single

sentence

Multiple

sentence

Association Rule Miner Yes Yes

Indirect 2-itemsets

Association Rule Detector

(I2ARD)

No Yes

Table 11 exhibits the number of highly related

concept pairs and related concept pairs discovered in

both approaches. The results indicate that our

approach is capable of extracting a significantly

larger number of highly related concept pairs and

more than double the number of relevant concept

pairs in comparison to the baseline approach.

Table 11: Comparing our concept correlation search

framework against baseline.

Approach

Number of Highly

Related Concept

Pairs

Number of Related

Concept Pairs

Baseline

Approach

32 13

Our Approach 64 48

% Increase 100.00% 269.23%

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

714

4.5 Identifying the Non-taxonomic

Relations

We use the taxonomy of existing domain ontology

to distinguish taxonomic concept pairs from non-

taxonomic concept pairs for all the highly related

and related concept pairs found using our proposed

framework. With the knowledge that taxonomic

relation exists between a super class and a subclass

represented in the domain ontology, if a concept pair

is not found to be having an ancestral relation with

each other, it can be regarded as non-taxonomic

concept pair (given the knowledge captured within

the domain ontology). From our experiment, we

have identified 4 taxonomic concept pairs and 108

non-taxonomic concept pairs out of a total of 130

concept pairs extracted (see Table 12).

Table 12: Number of concept pairs versus types.

Type

Number of Highly Related

Concept Pairs

Number of

Related Concept

Pairs

Non-

taxonomic

concept

pairs

60 48

Taxonomic

concept

pairs

4 0

Table 13 shows some examples of the taxonomic

concept pairs and non-taxonomic concept pairs for

the domain being studied. Since the fisheries

oceanography domain is a marriage between the

fisheries domain and the oceanography domain, it is

interesting to note that our proposed approach is

capable of finding correlated concept pairs within

each domain and across both domains. Concept pairs

across both domains are highlighted in bold (see

Table 13). These concept pairs are of special interest

to the domain experts as they provide clues to the

domain experts on the potential interactions between

both domains.

Table 13: Example of concept pairs generated from the

Fisheries Oceanography journals.

Taxonomic Concept

Pairs

Non-taxonomic Concept

Pairs

fish, capelin fish, length

fish, salmon

fish, temperature

salmon, pacific salmon

fish, depth

summer, seasons

summer, migration

summer, production

salinity, depth

temperature, depth

5 CONCLUDING REMARKS

In this paper, we presented a concept correlation

search framework to extract non-taxonomic concept

pairs from unstructured text, and applied it to the

marine biology domain. The novel features of our

framework are that: (a) we search for correlated

concept pairs within multiple adjacent sentences.

This is an extension to the traditional approach of

using a single sentence in search of correlated

concept pairs; (b) we apply the lift interestingness

measure to association rule mining to assess the

degree to which the concept pairs are of interest

within our context; and (c) we derive new

correlations between pairs of concepts from n-

itemsets association rules where n>2. Our results

show that these features generate more and better

concept pairs in comparison to existing ontology

learning tools that use traditional association rule

mining to mine non-taxonomic relations. Our

framework also distinguishes non-taxonomic

concept pairs from taxonomic concept pairs using

background knowledge existing in domain ontology.

These non-taxonomic concept pairs will further be

candidates for non-taxonomic relations extraction in

ontology learning. Our framework is domain-

independent and will bring us a step closer towards

the semi-automation of non-taxonomic relation

extraction in support of ontology learning.

As future line of research, we intend to work on

another least tackled problem in ontology learning,

which is the labelling of the non-taxonomic concept

pairs, i.e. to employ linguistic structure approach to

determine the most appropriate verb that connect the

correlated concept pairs.

ACKNOWLEDGEMENTS

This research is supported by a R&D grant from

CANARIE, Canada through the Network Enabled

Platform program. We would also like to extend our

gratitude to Dr. Isidora Katara for her valuable help

in the evaluation of the proposed framework.

REFERENCES

Agrawal, R., Imieliński, T. & Swami, A. 1993, "Mining

association rules between sets of items in large

databases", ACM SIGMOD Record, vol. 22, no. 2, pp.

207-216.

Alvarez, S. A. 2003, "Chi-squared computation for

association rules: Preliminary results",

MINING NON-TAXONOMIC CONCEPT PAIRS FROM UNSTRUCTURED TEXT - A Concept Correlation Search

Framework

715

Comput.Sci.Dept., Boston College, Chestnut Hill,

MA, Tech.Rep.BC-CS-2003-01.

Chagnoux, M., Hernandez, N. & Aussenac-Gilles, N.

2008, "An interactive pattern based approach for

extracting non-taxonomic relations from texts",

Workshop on Ontology Learning and Population

(associated to ECAI 2008)(OLP), University of Patras,

Patras, Greece, pp. 1–6.

Cimiano, P., Pivk, A., Schmidt-Thieme, L. & Staab, S.

2005, "Learning taxonomic relations from

heterogeneous sources of evidence", Ontology

Learning from Text: Methods, evaluation and

applications, pp. 59–73.

Gulla, J. A., Brasethvik, T. & Kvarv, G. S. 2009,

"Association Rules and Cosine Similarities in

Ontology Relationship Learning", Enterprise

Information Systems, pp. 201-212.

Hall, M., Frank, E., Holmes, G., Pfahringer, B.,

Reutemann, P. & Witten, I.H. 2009, "The WEKA data

mining software: An update", ACM SIGKDD

Explorations Newsletter, vol. 11, no. 1, pp. 10-18.

Maedche, A. & Staab, S. 2000, "Discovering conceptual

relations from text", ECAICiteseer, pp. 321.

Sánchez, D. & Moreno, A. 2008, "Learning non-

taxonomic relationships from web documents for

domain ontology construction", Data & Knowledge

Engineering, vol. 64, no. 3, pp. 600-623.

Schutz, A. & Buitelaar, P. 2005, RelExt: A Tool for

Relation Extraction from Text in Ontology Extension.

Shamsfard, M. & Barforoush, A. A. 2003, "The state of

the art in ontology learning: a framework for

comparison", The Knowledge Engineering Review,

vol. 18, no. 04, pp. 293-316.

Shamsfard, M. & Barforoush, A. A. 2004, "Learning

ontologies from natural language texts", International

Journal of Human-Computer Studies, vol. 60, no. 1,

pp. 17-63.

Sheikh, L., Tanveer, B. & Hamdani, M. 2005, "Interesting

measures for mining association rules", Multitopic

Conference, 2004. Proceedings of INMIC 2004. 8th

International IEEE, pp. 641.

Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F.,

Buitelaar, P., Cimiano, P. & Magnini, B. 2005,

"Evaluation of OntoLearn, a methodology for

automatic learning of domain ontologies", Ontology

Learning from Text: Methods, evaluation and

applications, pp. 92–106.

Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C. &

Nevill-Manning, C. G. 1999, "KEA: Practical

automatic keyphrase extraction", Proceedings of the

4th ACM conference on Digital libraries ACM, pp.

254.

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

716