Page Analysis by 2D Conditional Random Fields

Atsuhiro Takasu

National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda, Tokyo, Japan

Keywords:

Page Analysis, Information Extraction, 2D CRF.

Abstract:

This paper applies two-dimensional conditional random ﬁelds (2D CRF) to page analysis and information

extraction. In this paper we discuss features and labels for information extraction by 2D CRF. We evaluated

the method by applying it to the problem of extracting bibliographic components from scanned title pages

of academic papers. The experimental results show that 2D CRF improves the performance of information

extraction compared to chain-model CRF.

1 INTRODUCTION

Meta data is important for document utilization and

retrieval. It can help focusing on a speciﬁc facet in

retrieval. For example, we can retrieve documents

written by speciﬁc author if the meta data include a

ﬁeld of authors. Meta data have been usually made

manually. However, it is hard to enumerate all pos-

sible facets used in retrieval in advance. In addition,

manual creation of meta data is labor-intensive work.

Information extraction (IE) plays an important

role for meta data creation because documents usually

contain the information necessary for meta data. Al-

though IE is a traditional research topic, it still attracts

researchers and various machine learning techniques

have been examined to improve the extraction accu-

racy. Among them, conditional random ﬁelds (CRF)

(Lafferty et al., 2001) is a popular one. Council et

al. develops a reference string parser called ParsCit

based on CRF (Councill et al., 2008). It detects ref-

erence strings in academic papers and extracts biblio-

graphic components such as author’s name and article

title. It is a string parser and the chain-model CRF is

used. Zhu et al. proposed two dimensional CRF for

IE from Web pages (Zhu et al., 2005). They exploit

both layout and textual information for IE. Some re-

searchers applied the two dimensional CRF for page

image understanding (Nicolas et al., 2007; Montreuil

et al., 2007).

The purpose of information extraction discussed

in this paper is to segment tokens in two-dimensional

space into logical units and assign labels to them as in

(Takasu, 2008). The intended application of the pro-

posed method is extraction of bibliographic informat-

ion such as titles and authors from academic papers.

Academic papers usually contain bibliographic infor-

mation in the ﬁrst page as shown in Figure 1 and in

reference sections. We ﬁrst separate each page into

portions via a page layout analysis that are shown

by red rectangles in Figure 1. We call them a cell.

Usually the cells do not always correspond to biblio-

graphic component, which is represented with black

rectangles in Figure 1. For example, title in the ﬁgure

is segmented into multiple cells. We call the bounding

rectangles corresponding to bibliographic component

a logical component. The problem discussed in this

paper is to reconﬁgure cells into logical components,

and assign a label to each component.

title

author

key words

abstract

Figure 1: Example of page layout.

n this paper we apply two dimensional CRF to

bibliographic component extraction from pages. The

task is similar to the study (Councill et al., 2008), but

we use two dimensional CRF to exploit both layout

and textual information. The rest of this paper is or-

ganized as follows. Section 2 describes the two di-

mensional CRF used in this paper. Section 3 reports

an experimental evaluation by bibliographic compo-

nent extraction from scanned academic papers.

564

Takasu A..

Page Analysis by 2D Conditional Random Fields.

DOI: 10.5220/0004266505640567

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2013), pages 564-567

ISBN: 978-989-8565-41-9

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

, z

Figure 2: Data Structure for 2D CRF.

left

column

right

column

bottom row

top row

, z

Figure 3: Augmented Data Structure.

2 2D MODEL FOR PAGE

ANALYSIS

2.1 Data Structure

We assume that a page is decomposed into rectangles

called cell through some page segmentation proce-

dure and they forms an n × m matrix. There usually

exist cells spanning over multi-columns and/or multi-

rows. However, we assume they are decomposed to

form an n × m matrix as in (Zhu et al., 2005). We

denote ijth cell as x

i, j

that contains two kinds of in-

formation:

• layout information such as height, width, font

types, and

• text included in the cell.

For a page, x := {x

i, j

}

i, j

denotes the set of cells in the

page.

2.2 Feature Vector

Let’s consider a set F of feature functions. For a cell

i, j

, each feature function f ∈ F calculates a feature

value f(x

i, j

). A feature function can be an indicator

function that judges whether a cell contains a speciﬁc

word. Or, it can be distance function that calculates

the distance between text of the cell and a speciﬁc

string. It can be extractor of a speciﬁc layout feature

such as the width or can be a function to calculate the

area of the cell, too.

The feature functions F deﬁne a feature vector

i, j

:= ( f(x

i, j

))

f∈F

. For a page, x := {x

i, j

}

i, j

denote

the set of feature vectors of the page.

2.3 Labels

CRF associates observed feature vector of each cell to

a hidden label with probability. Let L denote the set

of labels. We denote the label of ijth cell as z

i, j

. For a

page, z := {z

i, j

}

i, j

denotes the set of labels of the page

as in Figure 2.

Basically the labels correspond to types of logical

components such as an author and a title. However,

a logical component may be split into multiple cells.

To treat the situation, we use additional labels that de-

note the boundary of cells constituting a logical com-

ponent.

2.4 Page Boundary

In order to treat boundary conditions of pages, we in-

troduce special labels {

L denotes the aug-

mented label set, i.e., L ∪ {

}. We add cells

• top row whose label is

, i.e., z

0, j

for all

columns,

• bottom row whose label is

, i.e., z

n+1, j

for

all columns,

• left column whose label is

, i.e., z

i,0

for all

rows, and

• right column whose label is

i.e., z

i,m+1

for

all rows,

as in Fig. 3.

2.5 Likelihood of 2D CRF

Two types of parameters are introduced in CRF. One

is about observed feature vectors. For each feature

function f ∈ F and a label l ∈ L, let λ

denote a

weight of the combination of f and l. Then, the likeli-

hood that we observe the feature value f(x

i, j

) for the

hidden label l is proportional to

exp



f(x

i, j

)



(1)

The other is about label relationship between adja-

cent cells. It is further split into horizontaland vertical

relationship. For each pair (l, g) ∈ L

of labels, let’s

introduce horizontal and vertical parameters φ

and

, respectively. Then, the likelihood that two labels

l and g appear in horizontally and vertically adjacent

cells are proportional to

exp(φ

) (2)

PageAnalysisby2DConditionalRandomFields

565

and

exp(θ

) , (3)

respectively.

Using these types of likelihood, the joint likeli-

hood of the page feature vectors x and their labels z is

given by

Pr(x, z)

∝ exp

∑

i, j

g(i, j)

exp

∑

i=1

i,m+1

{z }

for right column

·exp

∑

j=1

n+1, j

{z }

for bottom row

, (4)

where we abbreviate

∑

i=1

∑

j=1

∑

i, j

, and

g(i, j) :=

∑

f∈F

i, j

f(x

i, j

) + φ

i−1, j

i, j

+ θ

i, j−1

i, j

. (5)

Note that n and m differ depending on a page, but we

use the same symbol for simplicity.

The conditional likelihood for CRF is given by

Pr(z | x) :=

Pr(x, z)

Pr(x)

Pr(x, z)

∑

′ Pr(x, z

′

)

, (6)

where the denominator is called partition function and

it is denoted as Z(x).

2.6 Parameter Estimation

For parameter estimation, we use training data T con-

sisting of pages represented by a pair (x, z) of feature

vectors and labels of cells in the page.

Let us consider a regulalized log likelihood of the

training data given by

L(λ, φ, θ)

:= log

∏

(x,z)∈T

Pr(z | x)

−

∑

f,l

l f

−

∑

l,l

′

−

∑

l,l

′

, (7)

where the second, third, and fourth terms are L

regu-

larization of the parameters. We estimate the parame-

ters that maximize the likelihood, i.e.,

argmax

λ,φ,θ

L(λ, φ, θ) . (8)

We can solve the optimization problem by the

quasi-Newton method such as L-BFGS in the same

way as (Zhu et al., 2005). We omit the details due to

the page limitation.

3 EXPERIMENTAL RESULT

We applied the proposed model to extract biblio-

graphic components in title pages of academic jour-

nals.

3.1 Data Sets

The task of the experiment is to assign labels to cells

from scanned and OCRed title pages. We evaluated

proposed model using the following three kinds of

academic papers.

• Papers issued by the Information Processing So-

ciety of Japan (IPSJ): In this experiment, we used

the papers issued in 2003.

• Papers issued by the (IEICE-E) : In this experi-

ment, we used the papers issued in 2003.

• Papers issued by the (IEICE-J) : In this experi-

ment, we used the papers issued in 2003 and 2004.

For each dataset, we applied 5-fold cross validation.

3.2 Evaluation Metric

As for the evaluation metric, we used the accuracy

that the model assigns a label to each cell in the test

matrices. We regarded the model succeeds in labeling

only when it assigns correct labels to all cells in the

matrix. The accuracy is deﬁned by

the number of successfully labeled matrices

total number of test matrices

3.3 Experiment Procedure

To make ground truth data, we manually extracted

bounding rectangles that correspond to a logical com-

ponent in the ﬁrst page of the papers. Extracted log-

ical components are article title, author, abstract and

keyword. Since a logical component usually consists

of multiple cells, the label of each cell is assigned with

the label of the logical component that contains the

cell.

We used the following features of each cell (Ohta

et al., 2010):

• abscissa,

• ordinate,

• width,

• height,

• gap between adjacent cells,

• averaged characters’ width in the cell,

• averaged characters’ height in the cell,

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

566

Table 1: Extraction accuracy.

Method IPSJ IEICE-E IEICE-J

Chain 0.938 0.949 0.798

2D 0.962 0.964 0.855

• number of characters in the cell,

• proportion of alphanumerics,

• proportion of hiragana and katakana,

• proportion of symbols, and

• presence of predeﬁned keywords.

3.4 Experimental Result

For comparison, we applied a chain-model CRF ex-

amined in (Ohta et al., 2010). The OCR we used

in this experiment made a character sequence from

scanned document image according to the result of its

layout analysis. In this experiment, we convertedeach

character sequences into a word sequence and applied

chain-model CRF.

Table 1 shows the extraction accuracy. ”Chain”

stands for the result when we used the chain model,

whereas ”2D” stands for the result of the proposed

method. As shown in the table, two dimensional CRF

achieved better performance than the chain model.

We obtained more improvement for the data set

”IEICE-J”. This is because the OCR often analyzed

the layout of ”IEICE-J” pages incorrectly. It resulted

in generating incorrectly ordered sequences and de-

graded the accuracy of the chain-model CRF. In con-

trast, two dimensional CRF is not affected by the or-

der of cells by OCR. Therefore, it can improve the

extraction accuracy.

4 CONCLUSIONS

This paper examines a two dimensional CRF for

extracting bibliographic components from scanned

page images of academic papers. We experimentally

showed that the proposed method is effective espe-

cially for the pages whose layout is incorrectly ana-

lyzed.

Currently we use two dimensional CRF that treats

matrices. With this model, we can assign a label to

each cell but we need a post-processing that extracts

logical components by merging cell. We plan to ex-

tend the model to treat tree structured data such as

XY-tree. It enables us to extract logical components

as well as labeling simultaneously. In this paper we

manually determined the augmented labels for merg-

ing cells into logical component. We are interested in

designing the augmented labels systematically.

REFERENCES

Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit:

An open-source crf reference string parsing package.

In Intl. Conf. on Language Resources and Evaluation

(LREC 2008), pages 661 – 667.

Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-

ditional random ﬁelds: Probabilistic models for seg-

menting and labeling sequence data. In International

Conference on Machine Learning (ICML 2001), pages

282 – 289.

Montreuil, F., Grosicki, E., Heutte, L., and Nicolas, S.

(2007). Unconstrained handwritten document lay-

out extraction using 2d conditional random ﬁelds. In

International Conference on Document Analysis and

Recognition (ICDAR 2009), pages 407 – 411.

Nicolas, S., Dardenne, J., Paquet, T., and Heutte, L. (2007).

Document image segmentation using a 2d conditional

random ﬁeld model. In International Conference on

Document Analysis and Recognition (ICDAR 2007),

pages 407 – 411.

Ohta, M., Inoue, R., and Takasu, A. (2010). “Empirical

Evaluation of Active Sampling for CRF-based Anal-

ysis of Pages”. In International Conference on Infor-

mation Reuse and Integration (IEEE IRI2010), pages

13–18.

Takasu, A. (2008). “Information Extraction by Two Dimen-

sional Parser”. In Proc. IEEE Intl. Conf. on Tools with

Artiﬁcial Intelligence, pages 333–340.

Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y.

(2005). 2d conditional random ﬁelds for web informa-

tion extraction. In International Conference on Ma-

chine Learning (ICML 2005).

PageAnalysisby2DConditionalRandomFields

567