A REFINING METHOD OF OBTAINED ATTRIBUTES

TO CHARACTERIZE UNDEFINED CONCEPTS

USING SEARCH ENGINE

Noriyuki Okumura

and Yuto Hatakoshi

Department of Electronics and Computer Science, Nagano National College of Technology, 381-8550 Nagano, Japan

Faculty of Culture and Information Science, Doshisha University, Kyo-tanabe, 610-0394 Kyoto, Japan

Keywords:

Concept-base, Machine learning, Search engine, Time series, Obtaining attributes, Undeﬁned concept.

Abstract:

In this paper, we propose a method to resolve problems of the attributes obtaining method using WWW

search engine characterizing undeﬁned concepts, which do not exist in Concept-base. Concept-base is an

key database constituting Word Association Mechanism to perform commonsense judgment. Concept-base

was constructed automatically by electronic dictionaries and newspapers. Therefore Concept-base has about

120,000 statically deﬁned concepts. Nevertheless, it has no effective learning system. This paper proposes

a method to make Concept-base to learn concepts dynamically using Auto Feedback system. In addition,

a removal method of noise at-tributes is also proposed. We present attributes reﬁnement method that paid

attention to changing with time of the Internet. Furthermore, we inspect the effectiveness by an evaluation

experiment.

1 INTRODUCTION

The objective of this research is to construct an auto-

matically learning method using WWW for existing

Concept-base(Okumura et al., 2007). Concept-base

is a large-scale Knowledge-base constructed by elec-

tronic dictionaries and newspapers. Nevertheless, it

has no effective learning system. Therefore, we need

a machine learning system for existing Concept-base.

Concept-base has a large number of concepts,

which have some Attribute-Weight pairs. It is difﬁ-

cult to deal with new concepts such as new words,

proper nouns, and etc. generated momently be-

cause Concept-base was constructed by static data.

We proposed Auto-Feedback method(Tsuzi et al.,

2004) using a search engine

to resolve the prob-

lems. However, this method obtained Attribute-

Weight pairs in retrieved point. As were shown in

earlier reports(Gulla et al., 2007; Gordon et al., 2010),

words gathering systems were proposed. These meth-

ods were not suitable for our Concept-base because

these systems did not work in the long period. The

method which paid attention to time series(Horiuchi

and Uchida, 2011), but this method did not reﬁne at-

http://www.google.co.jp/

tributes. Consequently, this method had the problem

that different results were obtained whenever we re-

trieved new concepts.

This paper proposes a method to resolve above-

mentioned problems. Proposed method statistically

reﬁnes Attribute-Weight pairs, which were obtained

for long periods. By evaluating experiments, we

showed that the proposed method was superior to the

method in the past from the standpoint of obtaining

Attribute-Weight pairs.

2 METHOD

In the following, we present a method to reﬁne

Attribute-Weight pairs obtained by search engine.

First, our Concept-base, Auto Feedback and Revision

of Morphological Analysis are brieﬂy depicted. Sec-

ond, our proposal method was described. Finally, we

presented the evaluation method.

2.1 Concept-base

Concept-base(Okumura et al., 2007) is a large-scale

Knowledge-base constructed by electronic dictionar-

493

Okumura N. and Hatakoshi Y..

A REFINING METHOD OF OBTAINED ATTRIBUTES TO CHARACTERIZE UNDEFINED CONCEPTS USING SEARCH ENGINE.

DOI: 10.5220/0003697804930497

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (SSW-2011), pages 493-497

ISBN: 978-989-8425-80-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Concept (automobile) is extended to second or-

der attribute. This ﬁgure shows association words for the

concept (Automobile).

ies and newspapers. Headwords in dictionaries were

assumed to be concepts and content words in explana-

tion sentences were assumed to be attributes for head-

words (concepts). A concept (A) consists of pairs of

attributes (a

) which characterizing the concept (A)

and weights (w

) which mean the importance of each

attributes (eif is a natural number for each concepts,

’znum’ is a number of attributes)(1).

A = (a

, w

)|0 < i < znum+ 1. (1)

Attributes for each concept were also deﬁned in

Concept-base as concepts. Therefore, one concept

was deﬁned as attributes chain model of n-th-order

dimension. In this paper, Concept-base has about

120,000 concepts, and each concept has 30 attributes

on average. Fig.1 shows the example of concept (au-

tomobile). eAutomobilef has attributes (engine, car,

tire, etc.). eEnginef, eCarf, and eTiref are also de-

ﬁned in Concept-base. Thus, eEnginef has attributes

(Combustion, Motor, etc.).

In this paper, we aim to construct an automati-

cally learning method for Concept-base using search

engine.

2.2 Auto Feedback

An undeﬁned concept in Concept-base was input, and

the documents that were de-scribed about the unde-

ﬁned concept, were obtained from the retrieval result

pages of search engine. The words included in the re-

trieval result pages were attributes of undeﬁned con-

cepts. The weight of each attribute was granted by

t f and id f. t f was the frequency that undeﬁned con-

cepts appear in the retrieval result pages. id f was cal-

culated from the number of the retrieval pages and the

number of all pages of search engine. Table.1 showed

examples of the obtained attributes of undeﬁned con-

cepts.

In this research, we obtained 100 candidate at-

tributes descending in weight order by Auto Feed-

back. The Auto Feedback got attributes at the point

in time when I retrieved undeﬁned words. Therefore

retrieval results were inﬂuenced by a temporary topic,

and it was considered that Auto Feedback was not

able to obtain attributes deﬁnitely.

Table 1: The attributes of undeﬁned concepts gHarrison

Fordh and gFinePixh.

Harrison Ford FinePix

attributes weights attributes weights

movie 225.16 digital 331.21

actor 120.77 camera 326.95

appearance 87.46 pixel 301.11

2.3 Revision of Morphological Analysis

This paper used MeCab(Kudo et al., 2004) as a Mor-

phological Analyzer. Japanese have no custom leav-

ing a space between words like English. A problem

to divide sentences needlessly too much happened

when we used a Morphological Analyzer. It unnec-

essarily divid-ed a sentence into words by the default

MeCabfs setting. It had an original revision rule for

this problem. However, we set a simple rule without

using its rule.

1. Connecting words and phrases in the parenthesis.

2. Connecting if nouns were next to each other.

For example, in the case of a sentence gJiEVJh,

uJiEVJiNAUSICAA/of Valley of the Windjvwas di-

vided withuv,uJv,uiEVJv before reviewing setting,

and the title of the movie is divided needlessly. We

united nouns to be adjacent by uv after the setting

changed, we can extract uJiEVJv(Table 2).

2.4 Proposal Method

Auto Feedback was a method to learn undeﬁned con-

cepts on the spot. Consequently, the method paid no

attention to changing with time of Internet. Proposal

method re-peats the Auto Feedback trial many times

and reﬁnes attributes and weights of undeﬁned con-

cepts statistically (Fig. 2).

2.5 Evaluation Method

In this section, we explain the evaluation method of

our work.

Evaluation Method. Three subjects evaluated

these acquired all attributes (about 20,000 words).

We adopted attributes which two or three subjects

answered suitable as correct words. In all of Auto

Feedback trials, we calculated precision (Eq.3), recall

(Eq.4), and F-measure (Eq.4).

About Recall, there may be correct attributes other

than acquired attributes using Auto Feedback. How-

ever, it was difﬁcult to collect attributes that human

beings thought suitable by a questionnaire. In this

research, we adopted correct attributes evaluated by

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

494

Table 2: uJiEVJv for leaving a space between words.

Before revision After revision

Acquired attributes Evaluation Acquired attributes Evaluation

Right Right

Wrong JiEVJ Right

J Wrong - -

iEVJ Right - -

Figure 2: The ﬂowchart of proposal method.

three subjects to a numerator of Recall. Therefore

Recall takes 1.0 value when we output all acquired

attributes. We evaluate the method in the past by av-

eraging all trials.

precision =

∑

i=1

correct attributes

all obtained attributes

(2)

recall =

∑

i=1

correct attributesinsellectedattributes

correct attributes

(3)

F −measure =

∑

i=1

2× precision× recall

precision+recall

(4)

Evaluation Data. We use 49 undeﬁned concepts in

Concept-base as evaluation data.

3 RESULTS

We obtained attributes for 49 undeﬁned concepts by

Auto Feedback for one month. The number of ob-

tained attributes except the repetition was 302 on av-

erage. For each undeﬁned concepts, we sorted the at-

tributes for the threshold at the number of times which

attributes was obtained every ten times in the experi-

ment period. The horizontal axis of Fig. 3 shows the

number of times that was not obtained as attributes

in entire experiment period. About evaluation data,

Figure 3: Changing of F-measure in entire experiment pe-

riod.

Figure 4: Comparison proposal method with average of all

Auto Feedback trials.

we calculated F-measure with each threshold. Fig. 3

shows a change of the average.

Numerical value of the horizontal axis of Fig. 4

means a number appropriated to undeﬁned concepts

in Table 3. The horizontal axis of Fig. 4 is sorted in

order of the average of all Auto Feedback trials using

the method in the past (250 trials).

Table 4 shows samples of the retrieval result (un-

deﬁned concept: Google) of Auto Feedback once tri-

als and the result of proposed method that reﬁned

Attribute-Weight pairs based on the number of ap-

pearance.

4 DISCUSSION

When number of times that was not obtained as at-

tributes in Fig. 3 is smaller than 200, F-measure takes

the maximum. In other words, when number of times

that was attributes were obtained is greater than 50,

F-measure takes 0.42 of the maximum. We adopted

A REFINING METHOD OF OBTAINED ATTRIBUTES TO CHARACTERIZE UNDEFINED CONCEPTS USING

SEARCH ENGINE

495

Table 3: Evaluation data (written in English notation for explanation).

Undeﬁned concepts (Evaluation Data)

1 Ichiro 18 Sorting works 35 relation between Japan & China

2 Inﬂuenza 19 Kyoto Sanga FC 36 Hokkaido Nippon-Ham Fighters

3 Barack Obama 20 Keihan Electric Railway 37 confront Waseda and Keio

4 cartel 21 All Japan University Road relay 38 outﬂow of pictures

5 Google 22 the Northen Territories 39 Hideki Matsui

6 smartphone 23 Chiba Lotte Marines 40 Yokohama APEC

7 Sony Ericsson 24 Jun Natsukawa 41 Hiro Mizushima

8 Yu Darvish 25 astronaut 42 Tomomi Kasai

9 Saeko Darvish 26 Yuko Ogura 43 Kawori Manabe

10 mine of Chile 27 Senkaku 44 Yutaka Takenouchi

11 Domino’s Pizza 28 Senkaku’s video 45 Singing contest of red and white

12 Myanmar 29 Naoko Yamazaki 46 news of Marriage

13 Your Party 30 University of Yamanashi 47 Shin-ichi Hatori

14 rare earth element 31 Hisashi Iwakuma 48 self-defense

15 LAWSON 32 new year letter 49 group infection

16 Yonaguni Island 33 a great war of Warring period - -

17 World Volley 34 Yuki Saito - -

Table 4: Comparison of obtained Attribute-Weight pairs of

Proposed Method and Auto Feedback in the past.

Auto Feedback in the past Proposed Method

Attributes Weights Attributes Weights

Retrieval Result 907.821 USA 219.071

Many 664.259 Firewall 177.136

High 442.840 How to use 133.609

China Government 404.549 Company 131.618

New 309.988 Many 128.388

China 280.353 Site 127.669

Search Engine 258.543 Access Analyzing 119.792

Detailed 243.562 Search 104.181

USA 231.078 All over the world 97.671

Facebook 221.420 Google Map 90.177

this value (200) for the threshold and sorted the at-

tributes. As a comparison experiment, we calculated

precision, recall, and F-measure for all of Auto Feed-

back trials. Fig. 4 shows relations of F-measure in

all Auto Feedback trials and sorting by the proposal

method.

When number of trials that attributes were ob-

tained is greater than 50, F-measure takes maximum

value in Fig. 3. We guessed that the reﬁnement

method of the attributes works effectively. We calcu-

lated F-measure wheneverAuto Feedback was carried

out and found average about 250 trials.

5 CONCLUSIONS

In this research, we proposed reﬁning method of ob-

tained attributes for undeﬁned concepts using Auto

Feedback. Proposed method reﬁned Auto Feedback

attributes based on the number of appearance statisti-

cally.

In addition, we showed that higher F-measure

score was provided than Auto Feed-back in the past.

It is necessary to examine weighting method for

reﬁned attributes. We showed that it is effective

to limit weights using dispersion by a precedence

study(Hatakoshi, 2010). We compound with a prece-

dence study and proposal method to gain high perfor-

mance.

ACKNOWLEDGEMENTS

This work was supported by Grant-in-Aid for Young

Scientists (B) (No. 23720222)

REFERENCES

Gordon, J., Van Durme, B. D., and Schubert, L. K. (2010).

Learning from the web: Extracting general world

knowledge from noisy text. In Proceedings of the

AAAI 2010 Workshop on Collaboratively-built Knowl-

edge Sources and Artiﬁcial Intelligence.

Gulla, J. A., Borch, H. O., and Ingvaldsen, J. E. (2007).

Ontology learning for search applications. In Pro-

ceedings of the 2007 OTM Confederated international

conference on On the move to meaningful internet sys-

tems: CoopIS, DOA, ODBASE, GADA, and IS - Vol-

ume Part I, OTM’07, pages 1050–1062, Berlin, Hei-

delberg. Springer-Verlag.

Hatakoshi, Y. (2010). An acquisition method of attributes

for unknown words considering time factor. In The

73rd National Convention of Information Processing

Society of Japan.

Horiuchi, Y. and Uchida, O. (2011). Extraction of unsteadi-

ness of concept attributes using weblog articles. In

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

496

Proceedings of the 10th WSEAS international confer-

ence on Applied computer and applied computational

science, ACACOS’11, pages 137–141, Stevens Point,

Wisconsin, USA. World Scientiﬁc and Engineering

Academy and Society (WSEAS).

Kudo, T., Yamamoto, K., and Matsumoto, Y. (2004). Ap-

plying conditional random ﬁelds to japanese morpho-

logical analysis. In In Proc. of EMNLP, pages 230–

237.

Okumura, N., Yoshimura, E., Watabe, H., and Kawaoka,

T. (2007). An association method using concept-

base. In Proceedings of the 11th international confer-

ence, KES 2007 and XVII Italian workshop on neu-

ral networks conference on Knowledge-based intel-

ligent information and engineering systems: Part I,

KES’07/WIRN’07, pages 604–611, Berlin, Heidel-

berg. Springer-Verlag.

Tsuzi, Y., Hirokazu, W., and Tsukasa, K. (2004). The

method of acquisition of the new concept and its at-

tribute using the world wide web. In The 18th Annual

Conference of the Japanese Society for Artiﬁcial In-

telligence.

A REFINING METHOD OF OBTAINED ATTRIBUTES TO CHARACTERIZE UNDEFINED CONCEPTS USING

SEARCH ENGINE

497