SEMI-SUPERVISED INFORMATION EXTRACTION FROM

VARIABLE-LENGTH WEB-PAGE LISTS

Daniel Nikovski, Alan Esenther

Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, U.S.A.

Akihiro Baba

Mitsubishi Electric Corporation, 5-1-1 Ofuna, Kamakura, Kanagawa 247-8501, Japan

Keywords:

Service oriented architectures, System integration, Information extraction, Web mining.

Abstract:

We propose two methods for constructing automated programs for extraction of information from a class

of web pages that are very common and of high practical signiﬁcance — variable-length lists of records with

identical structure. Whereas most existing methods would require multiple example instances of the target web

page in order to be able to construct extraction rules, our algorithms require only a single example instance.

The ﬁrst method analyzes the document object model (DOM) tree of the web page to identify repeatable

structure that includes all of the speciﬁed data ﬁelds of interest. The second method provides an interactive

way of discovering the list node of the DOM tree by visualizing the correspondence between portions of XPath

expressions and visual elements in the web page. Both methods construct extraction rules in the form of XPath

expressions, facilitating ease of deployment and integration with other information systems.

1 INTRODUCTION

Modern information systems integrate, aggregate, and

manage information from a large number of sources,

such as databases, user input, sensors, remote web

services provided by other information systems, etc.

Successful integration of all these data sources de-

pends critically on the existence of data exchange

protocols and application programmable interfaces

(APIs) that have been agreed upon by both the

providers and users of such information.

Service-oriented architectures (SOA) hold

much promise for facilitating the building and

re-conﬁguration of ﬂexible information systems that

integrate many data sources. In order for a data

source to be available for integration in an SOA

application, it is typically converted to a web service:

a software system designed to support interoperable

machine-to-machine operation over a network.

Interoperability is achieved by means of adherence to

web service standards for describing and accessing

the functionality of the service (e.g., WSDL), and

data formats and protocols for transporting the

data between clients and servers (e.g., the SOAP

protocol).

However, most of the content of what is arguably

the largest existing source of information — the

World Wide Web (WWW) — is not accessible by

means of web services. The reason for this is the

intended audience of information published on the

WWW: human readers, and not computers. The vast

majority of existing web pages are intended to be read

by people, and their encoding (HTML) is optimized

for efﬁcient rendering into graphical form by web

browsers. As a result, neither the HTML encoding,

nor the ﬁnal graphical rendering are easy to interpret

by machines.

There have been several principled attempts to

make such information accessible by machines. One

solution involves representing the information as an

XML ﬁle which is either communicated directly to a

remote machine using web-services protocols such as

SOAP, or transformed to an HTML ﬁle by means of

suitable presentation templates, such as XSLT ﬁles,

and further rendered into graphical form to be read by

humans. Although this is a convenientsolution, a very

small percentage of existing web applications provide

their contents in both XML and HTML formats.

One major reason for this is technical — content

in XML format is usually provided by means of a ded-

261

Nikovski D., Esenther A. and Baba A. (2009).

SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS.

In Proceedings of the 11th International Conference on Enterprise Information Systems - Databases and Information Systems Integration, pages

261-266

DOI: 10.5220/0001858402610266

 SciTePress

icated web service which takes time, effort, and re-

sources to develop. Furthermore, apart from the cost

issue, this can be an unsurmountable obstacle in cases

when legacy web applications have to be integrated

into a novel system, and modiﬁcations to the exist-

ing application cannot be made. This is, in fact, a

very common situation in system integration projects,

when the developers of the original application have

often left the organization since the time the applica-

tion was deployed.

Consequently, for the purposes of system integra-

tion of legacy IT systems, it is very desirable to ﬁnd

ways to extract information that is intended to be used

by humans and encoded in HTML format, and present

it to remote machines in a computer-interpretable for-

mat such as XML. This is a problem that has been

under investigation in the ﬁeld of information extrac-

tion. We review the existing approaches and their lim-

itations in the next section, and propose two novel al-

gorithms for semi-supervised information extraction

from web pages with lists of variable length in Sec-

tion 3. In Section 4, we describe one possible im-

plementation of the proposed methods, and its role in

a system integration solution and method. Section 5

proposes directions for expanding these solutions to a

wider class of web pages, and concludes the paper.

2 WEB INFORMATION

EXTRACTION

The ﬁeld of information extraction is an area of in-

formation technology that is concerned with extract-

ing useful information from natural language text that

is intended to be read and interpreted by humans.

Such text can be produced either by other humans

(e.g., a classiﬁed ad), or generated by machines, pos-

sibly using the content of a structured database (e.g., a

product description page on an e-commerce web site)

(Laender et al., 2002). Although humans do not nec-

essarily make a signiﬁcant distinction between these

two cases, as far as their ability to inteprete the text

is concerned, the difference between them has enor-

mous importance as regards the success of interpre-

tation of such text by machines. Understanding free-

form natural language that has been generated by hu-

mans is a very complicated problem whose complete

solution is not in the foreseeable future. In contrast, if

text has been generated by a machine, using a boiler-

plate template for page layout and presentation (such

as an XSLT ﬁle), and a database for actual content,

the rate of success of automated information extrac-

tion methods can be very high. This type of text is

often called semi-structured data, and due to its high

practical signiﬁcance, is the focus of this paper.

The usual method for extracting information from

web pages that are output by legacy applications is by

means of programs called wrappers (van den Heuvel

and Thiran, 2003). The simplest approach is to write

such wrappers manually, for example using a general-

purpose programming language such as Java or Perl.

Since this can be difﬁcult, tedious, and error-prone,

various methods have been proposed for automating

the development of wrappers. Although most of these

methods focus on creating extraction rules that are ap-

plied to web pages by an extraction tool, they differ

signiﬁcantly according to how they apply the induced

rules on web pages, and according to how they actu-

ally induce these rules.

Regarding the ﬁrst difference, some methods ap-

ply the rules directly to the stream of tokens in the

web page. In such cases, the rules can be encoded as

regular expressions, context-free grammars, or using

more advanced specialized languages. One advantage

of these methods is that they can easily ﬁlter out irrel-

evant text, e.g. interstitial ads in web pages. However,

ﬁnding the rules that would extract all the needed in-

formation and only the needed information is not a

trivial problem.

Other methods use the fact that a web page en-

coded in HTML is not just a stream of tokens, but

has a tree-like structure. This structure is in fact rec-

ognized by web browsers when they transform the

HTML code into a Document Object Model (DOM)

tree prior to rendering it on screen. When a needed

data item can be found in one of the leaves of the

DOM tree, an extraction rule for its retrieval can be

encoded by means of a standard XPath expression

that speciﬁes the path that has to be traversed in the

DOM tree to reach the respective leaf. Applying such

rules to new web pages in a deployed system is very

straightforward: an embedded web browser is used to

retrieve the web page and create its DOM tree, after

which the XPath expression is applied to retrieve the

data. Fully automated tools such as W4F (Sahuguet

and Azavant, 1999) and XWRAP (Liu et al., 2000)

operate on the DOM tree. (Although, they use differ-

ent languages for representing the extraction rules.)

Regarding the second difference among wrapper

construction tools — how extraction rules are induced

— there are several principal approaches. Supervised

methods require explicit instruction on where the data

ﬁelds are in a web page, in the form of examples. One

practical way for a human user to do this is to point

to the data items on a rendered web page, for example

by highlighting them by means of a computer mouse,

and after that the corresponding extraction rules are

automatically generated. As noted, when the extrac-

ICEIS 2009 - International Conference on Enterprise Information Systems

262

tion rules are encoded using XPath expressions, they

can be readily generated by means of the DOM tree

of the web page.

Supervised method work very well when the web

page has a ﬁxed structure, and only the needed data

items vary between different instances of this web

page. However, when the structure of the DOM tree

itself changes over time, a single example of where

the needed data items are is not sufﬁcient. Exhaus-

tive enumeration of all possible cases is not possible,

either — apart from being very time consuming, it

would result in multiple extraction rules, each with

limited validity.

In contrast, unsupervised methods do not require

explicit labeling of data items within example web

pages. Instead, they require only input of several in-

stances of a web page, after which the methods can

infer from the instances what part of the web page

is data, and what part is formatting and labeling text.

These methods usually compare multiple instances of

a web page, and treat the variable part as data ﬁelds

and the non-changing part as a formatting template.

One shortcoming of those methods is that they require

multiple example instances, and it is sometimes not

clear how many examples would be sufﬁcient for cor-

rect rule generation.

3 INFORMATION EXTRACTION

FROM VARIABLE-LENGTH

LISTS

The practical case we are considering is the extrac-

tion of data ﬁelds from web pages that contain lists of

records formatted identically, but these lists can be of

variable length. One example of such web pages are

search results for products, articles, addresses, ﬁlms,

etc. — practically all search engines return such lists.

Another example are tables that contain variable num-

ber of rows, such as train schedules, class rosters,

sports rankings, etc. In this case, the table is the list,

each row in the table is a record, and each cell in the

table is a data ﬁeld. What is common between these

cases is that the structure of the web pages varies due

to the variable number of records, but does so in a

fairly regular manner: any new records have the same

structure as the previous ones.

Writing extraction rules for this case manually can

be challenging, because the structure of the DOM tree

of the web page is variable. A supervised method

would require labeling of all occurrencesof data ﬁelds

in a possibly unlimited number of instances of web

pages that contain lists of progressively increasing

length. Unsupervised methods would not need label-

ing, but would still require a dataset of web page ex-

amples of unknown size (possibly very large), and are

also likely to return as data such ﬁelds that do change

across page instances, but are not of interest.

To address these deﬁciencies, we propose two

methods for information extraction from such pages

that require only one instance of a data record to

be labeled explicitly on only one instance of a web

page. Both methods use XPath to encode the extrac-

tion rules, but neither of them requires manual coding

of these extraction rules. The ﬁrst method is com-

putationally more difﬁcult, but does not require any

familiarity with XPath, whereas the second method is

computationally simpler, but is suitable for users who

are familiar with XPath expressions.

In both cases, a human user uses a web browser

embedded in a wrapper construction application

(WCA) to navigate to a target web page that contains

a list of at least two records, and pinpointsseveral data

ﬁelds on the web page that belong to the same record.

Let the number of records on the web page be denoted

by m, such that m ≥ 2. If there are n such data ﬁelds,

we will denote them by F

, j = 1, n. We assume that

each data ﬁeld F

contains the value of an output vari-

able of interest V

, such that the value of this variable

, j

for some record i

, that is, v

, j

= F

, j = 1, n. The

goal is to be able to extract all values v

i, j

, i = 1, m,

and j = 1, n, for all web pages where m varies arbi-

trarily, by using only the pinpointed location of data

ﬁeld examples F

, j = 1, n. Two methods for the gen-

eration of extraction rules that can accomplish this are

described below.

3.1 Automated Induction of Rules

The ﬁrst method starts by determining the absolute

XPath expressions for all data ﬁelds. An absolute

XPath expression consists of a sequence of tree nodes

separated by slash signs, where nodes to the left of

a slash sign represent parent nodes, and nodes to

the right of a slash sign represent children nodes in

the DOM tree. For example, the second cell in the

ﬁrst record of the ﬁrst table of an HTML document

could be identiﬁed by the absolute XPath expres-

sion “/html[1]/body[1]/table[1]/tbody[1]/tr[1]/td[2]”

(see Fig. 1). Let the XPath expression of data ﬁeld

example F

be denoted by E

In the second step of the algorithm, the least

common ancestor (LCA) node of all data ﬁeld

examples F

, j = 1, n is found. This can be achieved

by traversing all path expressions E

simultaneously

from left to right, and stopping when a mismatch

between the tags of any two paths is encountered.

SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS

263

Figure 1: A typical DOM tree of an HTML document that

contains a list (table) of variable number of records (rows)

of the same structure (here, three data ﬁelds per record).

Information to be extracted is contained in the leaves of the

tree. Each node in the tree, leaf or not, can be identiﬁed by

a unique XPath expression.

The longest portion of all paths E

that matches is

the path to the LCA node of all data ﬁelds in the

DOM tree of the HTML document. (It may or may

not correspond to the entire record i

that contains all

example data ﬁelds.) For instance, if two cells in the

ﬁrst row of a table were designated as examples, e.g.

“/html[1]/body[1]/table[1]/tbody[1]/tr[1]/td[2]” and

“/html[1]/body[1]/table[1]/tbody[1]/tr[1]/td[4]”,

the path to their LCA node will be

“/html[1]/body[1]/table[1]/tbody[1]/tr[1]”, that

is, the LCA node will be the one that corresponds

to the ﬁrst row of the table. Let this LCA node be

denoted by L.

In the third step of the algorithm, the sub-tree

whose root is node L is identiﬁed and stored in vari-

able S. By deﬁnition, it does contain all nodes that

correspond to example data ﬁelds, but, as noted, it

does not necessarily correspond to an entire record.

In the fourth step of the algorithm, in order to ﬁnd

the DOM-tree node that corresponds to the entire list

of records, all ancestors A of node L are traversed in

a bottom up manner (starting from L and moving up

to the root of the DOM tree), while at the same time

matching the tree stored in variable S to all subtrees

whose root nodes are siblings or cousins of L, that is,

they are at the same level as L, and have A as their an-

cestor. The lowest ancestor A where a match is iden-

tiﬁed is the node that corresponds to the entire list of

records.

In our example, if the path to L is

“/html[1]/body[1]/table[1]/tbody[1]/tr[1]”, then

all ancestor nodes to L will be examined, starting

with A=“/html[1]/body[1]/table[1]/tbody[1]”. In

this case, the tree S will have a root and N children

denoted by tags “td[1]”, “td[2]”, ”td[3]”, ..., “td[N]”,

where N is the number of columns in the data table,

N ≥ n. Now, since there is at least one other record

(row) with the same structure (because of the require-

ment that m ≥ 2), there will be a cousin of the node

L (“/html[1]/body[1]/table[1]/tbody[1]/tr[1]”) which

is the root of a tree with the same structure as S. In

this case, this will be another row in the table, for ex-

ample L

, “/html[1]/body[1]/table[1]/tbody[1]/tr[2]”.

At this point, a match will be established, and

A=“/html[1]/body[1]/table[1]/tbody[1]” will be

declared the list node.

Note that this method of discovering the list node

is robust to multiple nesting of tags. In this case, there

exists a “tbody” tag between the “table” and “tr” tags

in a table, so the list node will be the one correspond-

ing to the “tbody” tag, and not the one corresponding

to the “table” node. In other words, the list node is

the deepest node in the DOM-tree (respectively, the

rightmost tag in XPath) whose sub-trees correspond

to individual records.

Once this node has been identiﬁed, an XPath for

individual data ﬁeld F

can be constructed by sim-

ply omitting the qualifying index in the tag that fol-

lows A in the XPath expression E

. For example, if

=“/html[1]/body[1]/table[1]/tbody[1]/tr[1]/td[2]”,

then a suitable extraction path for ﬁeld F

would

be “/html[1]/body[1]/table[1]/tbody[1]/tr/td[2]”. Its

meaning is, as expected, to return the second cell in

any row of the ﬁrst table of the document.

3.2 Interactive Construction of Rules

The main computational effort in the automated con-

struction algorithm described above is in ﬁnding the

list node A whose immediate children correspond to

the individual records in the list. For added accuracy

and ease of implementation, this step can be com-

pleted manually by a system integration engineer in

an interactive tool, if this engineer is familiar with

XPath.

In practice, the engineer could select the node in

the XPath expression that corresponds to the entire

record in the web page that contains the data ﬁelds

, j = 1, n. To this end, the XPath expression E

for

one data ﬁeld F

, j = j

is displayed to the user si-

multaneously with the target web page. Then, when

the user clicks on a node in the XPath expression,

the corresponding visual element of the web page is

highlighted in the web browser. When multiple nodes

correspond to the same visual element, for example

when there is nesting of multiple HMTL tags, the user

should select the highest element in the DOM tree,

i.e., the node that is farthest left in the XPath expres-

sion. The selected tree node that corresponds to the

ICEIS 2009 - International Conference on Enterprise Information Systems

264

entire ﬁrst record will then be an immediate child of

the list node A, per our deﬁnition. The construction of

the extraction rules can then proceed as in the com-

pletely automated method.

4 IMPLEMENTATION OF THE

INFORMATION EXTRACTION

METHODS

The proposed idea for information extraction from

web pages has been implemented as an extension of

the Firefox browser. This extension operates by al-

lowing the user to navigate to any web page using

the normal web navigation facilities of the browser,

and designate that page as a target for information ex-

traction. By highlighting individual text elements on

the web page and using a context menu, a set of data

ﬁelds F

, j = 1, n are designated, and their XPath ex-

pressions E

are obtained from the DOM tree of the

target web page. All of these XPath expressions are

displayed in the Firefox extension, next to the target

web page.

By placing the mouse over individual tags in any

XPath expression, the corresponding visual element

(cell, row, table, etc.) on the target page is highlighted

using a different background color (Fig. 2). In gen-

eral, the user proceeds over tags right to left, stopping

when the entire list (as opposed to a single record

only) is highlighted. Clicking on the corresponding

tag then identiﬁes the list node A, and extraction rules

can be constructed.

The implemented tool provides a high-quality

method for extracting information from structured

web pages. This is due to the use of XPath for speci-

ﬁcation of the information to be extracted; intuitively,

when the data records have been generated from a

speciﬁc data schema, the XPath expression generated

by the tool uniquely identiﬁes a ﬁeld in that schema.

In contrast, character matching methods do not pro-

vide such unique identiﬁcation, and might result in a

large number of false matches.

5 RELATED WORK

The proposed technique for automated construction

of rules for information extraction from web pages

is based on two main ideas: the use of XPath for

rule speciﬁcation and execution, and the analysis of

the DOM tree of the HTML document. Individu-

ally, these two ideas have been proposed previously.

Schwartz discussed the use of XPath for information

extraction from web pages, and pointed out that the

main challenge is to build the correct XPath expres-

sion that would remain robust to variations in page

structure (Schwartz, 2007). Liu proposed a variety of

algorithms for learning extraction rules from exam-

ple HTML documents that operate by analyzing the

DOM-tree of these examples (Liu, 2007). Some of

these algorithms even allowed for imperfect matching

between the sub-trees of the individual records. How-

ever, he did not use XPath as the language for encod-

ing of the rules, and furthermore, these algorithms re-

quired multiple examples of web pages, whereas both

algorithm proposed in this paper require only a single

example.

6 CONCLUSIONS

We have described two methods that allow easy con-

struction of wrappers for extracting information from

one of the most frequent and useful sources of infor-

mation — web pages with variable number of records

of identical structure. The output of most search en-

gines falls into this category, as well as most web-

based interfaces for querying databases.

Even though a number of methods for wrap-

per generation have already been proposed (some of

which fairly advanced), most of them do not provide

a very convenient solution to this problem of high

practical importance. Both supervised and unsuper-

vised methods require a large number of examples to

be able to learn the correct extraction rules. In con-

trast, the two methods proposed in this paper would

require only one example web page, and need only

data ﬁelds in a single record to be identiﬁed. As a re-

sult, the construction of wrappers is not substantially

more difﬁcult than browsing to a web page and high-

lighting the data ﬁelds of interest.

Once the data ﬁelds have been speciﬁed, the ﬁrst

algorithm proceeds by identifying the node in the

DOM tree that comprises all data ﬁelds of interest.

Under the assumption that all other records must in-

clude an identically structured sub-tree, the algorithm

looks for such matching sub-trees elsewhere in the

DOM tree. The node that is the lowest common an-

cestor of all such sub-trees is identiﬁed as the list node

of the entire web page.

In contrast, the second proposed method relies on

interactive identiﬁcation of the list node of the DOM

tree, with the help of a system integration engineer.

The method is intuitive, and requires only basic fa-

miliarity with XPath. List node identiﬁcation is per-

formed by scanning the XPath expression for any sin-

gle data ﬁeld in a bottom-up manner, and providing

SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS

265

Figure 2: An interactive tool for construction of extraction rules for information extraction from web pages.

visual feedback to the engineer as to what visual el-

ement in the web page corresponds to the currently

scanned tag. As soon as a tag is found that comprises

the entire list of records, the corresponding DOM-tree

node of this tag is identiﬁed as the list node of the web

page. Both algorithms rely on the expressive capabil-

ities and widespread use of the XPath language for

querying XML documents. This also facilitates the

deployment of wrappers when integrating them with

other information systems.

While the described methods are convenient and

handle well web pages of the described type (variable-

length lists with records of ﬁxed structure), there exist

other variations in the structure of web pages that are

of practical interest and could be addressed in the fu-

ture by possible extensions of these algorithms. One

such variation is interstitial ads — advertisements that

appear among items (records) of interest. When such

ads are placed as direct children of the list node of

the DOM-tree, the XPath expression for a data vari-

able can match accidentally some content from the

ad. One possibility for eliminating such erroneous

matches is to include the entire context of a data vari-

able in its XPath expression, so that valid data records

are identiﬁed ﬁrst, and the extraction of individual

data ﬁelds occurs only after that. We hope that this

type of matching could possibly be implemented by

modifying the XPath extraction rules in an appropri-

ate manner, to be investigated in the future.

REFERENCES

Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and

Teixeira, J. S. (2002). A brief survey of Web data ex-

traction tools. SIGMOD Record (ACM Special Interest

Group on Management of Data), 31(2):84–93.

Liu, B. (2007). Web Data Mining: Exploring Hyperlinks,

Contents, and Usage Data. Data-Centric Systems and

Applications. Springer.

Liu, L., Pu, C., and Han, W. (2000). XWRAP: An XML-

enabled wrapper construction system for web infor-

mation sources. In Proceedings of the International

Conference on Data Engineering, pages 611–621.

Sahuguet, A. and Azavant, F. (1999). Building light-weight

wrappers for legacy web data-sources using W4F. In

25th Conference on Very Large Database Systems,

pages 738–741, Edingurgh, UK.

Schwartz, R. L. (2007). HTML scraping with XPath. Linux

Magazine, 2007(4).

van den Heuvel, W.-J. and Thiran, P. (2003). A method-

ology for designing federated enterprise models with

conceptualized legacy wrappers. In Proceedings of

the Fifth International Conference on Enterprise In-

formation Systems ICEIS’03, pages 353–358.

ICEIS 2009 - International Conference on Enterprise Information Systems

266