CONCEPTS EXTRACTION BASED ON HTML DOCUMENTS

STRUCTURE

Rim Zarrad

, Narjes Doggaz

and Ezzeddine Zagrouba

RIADI Laboratory, Team of research SIIVA, High Institute of Computer Science , University El Manar, Tunis, Tunisia

URPAH laboratory, Faculty of Sciences of Tunis, University of Tunis El Manar, Tunis, Tunisia

Keywords: Document Structure, Keyword Extraction, Title, Web.

Abstract: The traditional methods to acquire automatically the ontology concepts from a textual corpus often privilege

the analysis of the text itself, whether they are based on a statistical or linguistic approach. In this paper, we

extend these methods by considering the document structure which provides interesting information on the

significances contained in the texts. Our approach focuses on the structure of the HTML documents in order

to extract the most relevant concepts of a given field. The candidate terms are extracted and filtered by

analyzing their occurrences in the titles and in the links belonging to the documents and by considering the

used styles.

1 INTRODUCTION

Due to the rise of the Web and the need to have

structured knowledge, a big part of research

concentrates on the formalization of ontologies

(Berners-Lee et al., 2001). The methods dealing

with ontologies building are mainly based on the

techniques of natural language processing. These

methods can be clustered in three groups according

to the chosen technique: the methods mainly based

on the distribution of the linguistic units (Bourrigault

et al., 2005), the ones based on the statistical

approaches (Velardi, 2002) and those based on the

semantic and/or the linguistic approaches (Séguéla,

1999, Morin, 1999). The methods which are based

on the distributional processing distinguish the

concepts of the ontology and organize them in

structured systems reflecting a conceptual hierarchy.

The statistical methods tend to extract conceptual

links which are difficult to dissociate without

recourse to an expert of the field. The linguistic

approach does not consider the corpus in a

comprehensive manner but locally. It is based on the

properties of the language which define those of the

objects of the field.

In this paper, we propose an approach which

differs from the traditional approaches since it uses

information on the document structure to extract

relevant information. Our approach studies each

material form in the text (title, style, and hyperlink)

in order to extract the concepts of the ontology. This

paper is organized as follows. In section 2, we

present the related works which study the document

structure in order to build ontologies. In section 3,

we present our approach by giving the different

steps to extract the concepts of the ontology. The

concept refiner step is described in details in section

4. We give in section 5 some experimental results to

evaluate the proposed approach. Finally, we

conclude our paper by a summary and some

perspectives.

2 RELATED WORKS

The traditional methods of building ontology and

concept extraction from texts often privilege the

analysis of the text itself, whether they are based on

a statistical or linguistic approach. New approaches

appeared in the last decade, they exploit the material

formatting of the texts in order to build conceptual

models or ontologies. In fact, the visual properties of

texts are not just an ornamentation of the text but

constitute an important component implied in the

significance. The approach described in (Karoui et

al., 2004) studies the structure of Web pages to build

a database table. It uses a clustering procedure to

extract the concepts of the ontology. It is based on

the TF*IDF measure (Joachims, 1997) which is a

weighting method often used in the field of

503

Zarrad R., Doggaz N. and Zagrouba E..

CONCEPTS EXTRACTION BASED ON HTML DOCUMENTS STRUCTURE.

DOI: 10.5220/0003748305030506

In Proceedings of the 4th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2012), pages 503-506

ISBN: 978-989-8425-95-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

knowledge extraction.

TFi = ni / Σj n j (1)

where n

is the number of occurrences of the i

candidate term (CT) in the document.

IDF = log ( |D| / |{dj / ti є dj }| ) (2)

Where d

is the j

document of the corpus and |D| is

the total number of documents of the corpus.

Among the methods based on the document

structure, several ones focuse on the titles belonging

to the document. The method presented by (Hazman

et al., 2009) uses the hierarchical structure of the

HTML headings for identifying new concepts and

their taxonomical relationship between seed

concepts and between each other. An analysis

(Lopez et al., 2010) was made on a corpus of web

documents. The results are interesting when

analyzing terms which appear in the titles. The titles

play an important role in the level of the material

organization of the text: they segment, make

taxonomies, and provide a denomination for the

obtained segments. Moreover, it is obvious that the

text snippets which are strongly marked from a

typographical point of view are more important than

those which have not such marking. The cognitive

psychologists proved by experiments that the styles

used in the text have a big impact on its

comprehension.

3 PROPOSED APPROACH

Our objective is to extend the traditional approaches

which aim to extract the concepts and the links

between them by considering the text structure in the

corpus. The proposed approach described below is

based on the study of several web sites in order to

find the concepts of the field.

3.1 The Corpus Constitution and

Pre-processor

The corpus must be representative of the field for

which we try to build ontology. Our approach is

based on a corpus of HTML documents collected

using Google Web Search API. It is a library offered

by the Google programmers in order to extract the

results of a request. The corpus pre-processor is

performed in three steps:

Tagger. Before being analyzed, the corpus is treated

by the Tree tagger (Schmid, 1994). This tagger

associates each instance of a word with its

grammatical category and with its canonical form.

Parser. We use HtmlParser 1.6 which is a free

library for the extraction and the text processing of

the corpus generated from the web. It extracts data

from the tags of the HTML documents.

Stop-Lists. A general stop-list is used to locate the

blank words which occur in the corpus (articles,

prepositions…). Another stop-list is created in order

to check if a title has an empty informational content

(introduction, conclusion…).

3.2 Extraction of Candidate Terms

Our objective is to detect within the corpus the

interesting terms which are candidates to represent

the concepts of ontology. They are linguistic units

which qualify an object of the real-world. We extract

two types of syntagms according to their structures.

Firstly, we extract syntagms composed by only one

word, they can be either a “Name” or a “Named

Entity”. Secondly, we consider syntagms containing

two words: they have as syntactic structure the

sequence “Adjective Name”. The CT which are

extracted will be then filtered according to their

appearance in the titles, styles and hyperlinks used in

the corpus.

4 THE CONCEPTS REFINER

To consider the relevance of a syntagms S in the

corpus, we was inspired by the measure TF*IDF

(Joachims, 1997). Indeed, we propose a new

measure CR*IDF (CR_IDF) where CR, the Corpus

Relevance of a candidate term S, corresponds to its

relevance in all the documents of the corpus. The

candidate terms will be then filtered chosing those

having a value CR_IDF higher than a given

threshold. The selected candidate terms represent the

relevant terms or concepts of the field.

Definition 1: Given a corpus C = {d

, d

,…, d

} of

m HTML documents (d

), we define the Corpus

Relevance CR of each syntagm S as the sum of its

Normalized Relevance NR in all the documents of

the corpus.

()

(, )

() (, )

(,)

RSd

CR S NR S d

RS d

⎛⎞

⎜⎟

⎝⎠

∑∑

∑

(3)

where n

(j) is the number of syntagms in the

document d

and R(S,dj)is the relevance of the

syntagm S in the document dj. The denominator is

the sum of the relevance of all the syntagms in this

document. The normalized relevance of a syntagm S

in the document d

is used to avoid the problems

ICAART 2012 - International Conference on Agents and Artificial Intelligence

504

related to the length of the document.

Definition 2: The relevance R of a syntagm S in a

document d

is computed by:

(, ) () ()

j Title Style

RSd R S R S=+

(4)

where R

Title

(S) and R

Style

(S) are the Title and Style

relevances of the syntagm S in the document.

In the rest of this section, we consider the

document d

of the corpus and we define the

relevance of each syntagm in the titles and the styles

of the document.

4.1 Title Relevance

The titles and the subtitles generally contain the

relevant terms of the field. It would be judicious,

then, to consider them for the selection of concepts.

Moreover, we note that the syntagm relevance in a

given title strongly depends on the length of this

title. Let title length be the number of syntagms

which it contains. Thus more the number of

syntagms in a title is low, more the syntagm

relevance in this title is high. To illustrate this idea,

let consider the two following titles "La

spectroscopie" (spectroscopy) and "Calendriers et

instruments de mesure du temps" (schedule and

tools of the time measurement) extracted from a

French astronomy Wikipedia document. In the first

title, the obtained syntagm (spectroscopy) is more

representative than the four syntagms composing the

second title which are less significant (schedule,

tool, time and measurement).

Definition 3: We note

()

,...,

Tt t=

the set of the

titles appearing in the level i of a HTML document

where n

is the number of titles of the level i. The

relevance r

of a syntagm S which appears in the

title

is computed as follows:

()

(5)

where

()

corresponds to the length of the title

(i.e. the number of syntagms in this title).

Applying this formula to the two titles given in

the example above, we have r

spectroscopy

= 1, r

schedule

0.25, r

tool

= 0.25, r

time

= 0.25, r

measurement

= 0.25. We can

note that these results reinforce our intuition.

Definition 4: The relevance R

(i) of a syntagm S in

all the titles appearing in the level i of a HTML

document is computed as follows:

() ( )

irt

∑

(6)

where n

is the number of titles of the level i.

Now, we can calculate the title relevance of a

syntagm in a document.

Definition 5: The Title relevance R

Title

(S)

of a

syntagm S corresponds to its relevance in all the

titles of the document. It is computed as follows:

111

() * () (*())

Title s k

iik

RS Ri ilt

−

===

∑∑∑

(7)

We consider that more the depth level of a title is

low, more the syntagm relevance in this title is high.

Indeed, we give the weight 1/i to each level i.

4.2 Style and Hyperlink Frequency

The words having a particular style (bold, italic) or

belonging to the hyperlinks of the documents are

easily detectable in the corpus by analyzing the tags

of HTML documents. For each CT extracted, we

define four coefficients: f

Bold

(S) which corresponds

to the Bold style frequency of a syntagm S in a

HTML document, f

Italic

(S) which consists in its

Italic style frequency, f

url

(S) which corresponds to

its hyperlink frequency and f

Rest

(S) corresponding to

the appearance frequency in the rest of the

document.

Definition 6: Let be Format = {Bold, Italic, URL,

Rest}. We define the style relevance R

Style

(S)

of a

syntagm S as follows:

() * ()

Style Style Style

Style Format

RS w fS

∈

∑

(8)

where w

Style

is a weight given to each style and it

was fixed experimentally.

5 EXPERIMENTATION

We have evaluated the contribution of our method

on a corpus of 23 documents of Wikipedia which are

related to the astronomy field. The total number of

candidate terms extracted from these documents is

2801. After the refinement step, only 134 concepts

(105 one-word concepts and 29 two-word concepts)

remained. The refinement of these CT is realized by

removing the syntagms belonging to a limited

number of documents and those having a CR*IDF

value less than a constant threshold. To validate the

extracted concepts, we verify their belonging to a

CONCEPTS EXTRACTION BASED ON HTML DOCUMENTS STRUCTURE

505

Table 1: Evaluation of the obtained concepts using CR, CR_IDF and TF_IDF mesures.

Measure

Extracted concepts

CR CR_IDF

TF_IDF

Type Number

Affirmed

Concepts

Precision

(%)

Affirmed

Concepts

Precision

(%)

Affirmed

Concepts

Precision

(%)

One-word

Concept

105 44 41.90% 71 67.62%

61 58.09%

Two-word

Concept

29 16 55.17% 18 62.07%

9 31.03%

Total 134 60 44.78% 89 66.42%

70 52.24 %

specialized dictionary in astronomy and space

(Cotardière and Penot, 1999). We have also fixed

Bold

=3, w

Italic

=2, w

url

=2 and w

rest

=1.

In table 1, we present the results obtained using

CR (Corpus Relevance) and CR_IDF measures. The

CR_IDF measure gives more interesting results than

CR ones. This can be explained by the fact that the

use of IDF factor removes CT which are the less

specific to the considered field. To evaluate our

approach, we compare our results (CR_IDF) with

the most used measure in the literature TF_IDF. We

note that our CR_IDF measure which considers the

documents structure gives better results than

TF_IDF. These results show the impact of

considering the documents structure in the extraction

of concepts by giving the most relevant ones.

6 CONCLUSIONS AND FUTURE

WORK

In this paper, we have defined a strategy which

extracts automatically the concepts of the ontology

from a corpus of Web documents. This strategy is

based on the study of the document structure by

extracting the typographical titles, links and

markings. Indeed, the structure of the documents

provides interesting information on the significance

contained in the texts. We can extend the work

performed by analyzing the hierarchy of the titles in

each document in order to extract the hierarchical

links to lead to ontology. The text can be then

apprehended not like a linear succession of blocks of

various natures, but like a structure of elements of

high level which include other elements.

REFERENCES

Berners-Lee, T. Hendler, J., Lassila, O, 2001. The

Semantic Web. Scientific American, pp. 28–37

Bourrigault, D., Fabre, C., Frérot, C., Jacques, M. P.,

Ozdowska, S., 2005. Syntex, analyseur syntaxique de

corpus. In: Actes des 12èmes journées sur le

Traitement Automatique des Langues Naturelles, pp.

17-20, Dourdan, France

Cotardière, P., Penot, J. P., 1999. Dictionnaire de

l'Astronomie et de l'Espace. eds. Larousse-Bordas

Hazman, M., El-Beltagy, S.R., Rafaa, A., 2009. Ontology

Learning from Domain Specific Web Documents.

International Journal of Metadata, Semantics and

Ontologies, Vol 4, Number 1-2, pp. 24-33

Joachims, T., 1997. A probabilistic analysis of the

Rocchio algorithm with TFIDF for text categorization.

In: Proc. 14th International Conference on Machine

Learning, pp. 143- 151, Morgan Kaufmann

Karoui, L., Aufaure, M. and Bennacer, N., 2004. Ontology

Discovery from Web Pages: Application to Tourism.

In: Knowledge Discovery and Ontologies Workshop at

ECML/PKDD

Lopez, C., Prince, V., Roche, M., 2010. Automatic Titling

of Electronic Documents with Noun Phrase

Extraction. In: Proceedings of Soft Computing and

Pattern Recognition, SOCPAR'10, pp. 168-171, Paris,

France

Morin, E., 1999. Using Lexico-Syntactic Patterns to

Extract Semantic Relations between terms from

Technical Corpus. In: Proceedings of the 5

International Congress on Terminology and

Knowledge Engineering (TKE'99), pp 268-278

Schmid, H., 1994. Probabilistic Part-of-Speech Tagging

Using Decision Trees. In: Proceedings of the

International Conference on New Methods in

Language Processing

Séguéla, P., 1999. Adaptation semi-automatique d'une

base de marqueurs de relations sémantiques sur des

corpus spécialisés. Revue Terminologies, Number 19,

pp. 52-61

Velardi, P., Fabriani, P. and Missikoff, M. , 2002. Using

text processing techniques to automatically enrich a

domain ontology, In Proceedings of the ACM

Conference on Formal Ontologies and Information

Systems, pp 270-284

ICAART 2012 - International Conference on Agents and Artificial Intelligence

506