Feature-based Word Spotting in Ancient Printed

Documents

Khurram Khurshid

, Claudie Faure

and Nicole Vincent

Laboratoire CRIP5 – SIP, Université Paris Descartes, 45 rue des Saints-Pères

75270, Paris Cedex 06, France

UMR CNRS 5141 - GET ENST, 46 rue Barrault, 75634 Paris Cedex 13, France

Abstract. Word spotting/matching in ancient printed documents is an ex-

tremely challenging task. The classical methods, like correlation, seem to fail

when tested on ancient documents. So for that, we have formulated a multi-step

document analysis mechanism which mainly revolves around finding the words

and their characters in the text and attributing each character by some multi-

dimensional features. Words are matched by comparing these multi-

dimensional features of the characters using Dynamic Time warping (DTW).

We have tested this approach on ancient document images provided by the Bib-

liothèque Interuniversitaire de Médecine, Paris. Our Initial experiments exhibit

encouraging results having more than 90% precision and recall rates.

1 Introduction

Spotting words in documents written in the Latin alphabet has received considerable

attention lately. Although a lot of work has already been done in the field of word

spotting, it still remains an inviting and challenging field of research mainly because

the results achieved so far are not satisfactory for huge volumes of data; specially if

the document base consists of a set of ancient printed documents of relatively de-

graded quality.

Lot of work has been done in the domain of word spotting and optical character rec-

ognition in ancient document images. There are plenty of issues and problems related

to ancient printed documents which are discussed in detail in [10] and [12]. These

problems provide a big challenge for the researchers working in this domain to

achieve better results. In this paper though, we will not be addressing these issues.

Rath and Manmatha [1,11] introduced an approach which involves grouping word

images into clusters of similar words by using word image matching. Four profile

features for the word images are found which are then matched using different me-

thods [1]. [7] has used the corner feature correspondences to rank word images by

similarity in historical handwritten manuscripts. Telugu scripts have been characteri-

zed by wavelet representations of the words in [5]. But this wavelet representation

does not give good results for the Latin letters [5]. Adamek et al. introduced the

Khurshid K., Faure C. and Vincent N. (2008).

Feature-based Word Spotting in Ancient Printed Documents.

In Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems, pages 193-198

 SciTePress

matching of word contours for holistic word recognition. The closed word contours

are extracted & matched using elastic contour matching technique [9].

2 Proposed Method

Our model is based on the extraction of different multi-dimensional features for the

character images for the purpose of word matching. As opposed to [1] where features

are extracted from the whole word image, we find the characters of the word and the

features are extracted from these character images, thus giving more precision in

word spotting as later proved by the results.

First of all the document image is binarized using Niblack [3] algorithm. We have

modified the original Niblack to gain better results. The text in the document image is

separated from graphics and the words in the document are extracted by applying

RLSA [2] and finding the connected components in the RLSA image. For each word

component, characters are found using its connected component analysis. The con-

nected components are then fixed using a 2-step process to get the characters for each

of which, we extract a set of features. The query word is searched in the document by

matching the character features using Dynamic Time Warping [6]. Test query is proc-

essed in the same way and the features of the query word’s characters are matched

with the already stored features of the characters in words using DTW. The words for

which the matching distance is less than a threshold are the resulting spotted words.

We now see in detail the different processing stages:

2.1 Binarization

To get better results for word spotting, binarization has to be very good. For that, we

modified the Niblack algorithm [3] to make it more efficient for the ancient docu-

ments and gain better results. The Binarization threshold is found out using the fol-

lowing formula:

NPT

kmT

∑

−

)(

(1)

Where:

k = -0.2, m = mean grey value, p

= pixel value of grey scale image , NPT =

number of pixels

The binarization is tested on different BIUM images [8], having smaller resolution

(550 x 913) as well as higher resolution (1536 x 2549). The results obtained are ex-

tremely satisfactory for different types of documents.

2.2 Words Extraction using RLSA

We consider the words as connected components in the document image where a

horizontal Run Length Smoothing Algorithm (RLSA) has been applied. So after

194

binarization we apply a horizontal RLSA [2] with threshold = 5, and find the con-

nected components in the resulting image to extract the word components (fig 1).

a b

c d

Fig. 1. a) original image b) binary image using modified Niblack c) horizontal RLSA d)

Words extracted.

The word components found in the document image also contain the figures/images

as large components. So to remove figures, we only have to remove those compo-

nents which do not satisfy the following criteria:

Component Area < (Mean comp. area x 5) AND Component height <

(Mean comp. height x 4)

2.3 Extraction of Characters

As we know that a characters in ideal case should be the connected components in the

word, so we carry out a connected component analysis on the extracted word images

to get the characters. But a character may be broken into multiple components or

multiple characters may form a single component. So once we have the components,

we need to fix them so that they correspond to characters. For that, we apply a 2-pass

fixing method (figure 2). In the first pass, we fix the characters comprising 2 or more

components on top of each other. In the 2nd pass, we fix the characters which are

broken into more components. These components overlap with each other at some

point. These include characters like r, g etc.

Fig. 2. a) Original components b) After pass-1 c) After pass-2.

2.4 Feature Extraction

We have employed a set of six features for the character images:

195

Vertical Projection Profile – sum of intensity values in vertical direction; calculated

in gray scale character image and normalized to get valued between 0 and 1.

Upper Character Profile – for a binarized character image, for each column we find

the distance of the first ink pixel from the top of bounding box.

Lower Character Profile – just like in upper word profile, here we find the distance

of the last ink pixel from the top of bounding box. Both upper and lower profiles are

normalized between 0 and 1.

Vertical Histogram – number of ink pixels in one column of binarized character image.

Ink/non-ink Transitions – to capture part of inner structure of a character, we find

the number of non-ink to ink transitions in each binarized character image column.

Middle Row Transition– for the central row of the character image we find the transi-

tional vector for ink/non-ink transitions. We place a 1 for every transition and 0 for

all the non-transitions in the row. Initially we found row transitions for 3 central rows

and applied logical OR on them to get mean central row transitional vector. It means

that if there is an ink pixel in any of the three central rows, we take it as an ink pixel.

We tested using both ways but found that using only the central row gives better

results. So we stuck to that.

For each word, we find the features for each of its characters. After processing a

document image, we create its index file in which we store the location of the word

components, the location of each of its characters and the features of each character.

2.5 Word Matching by DTW

For two words to be considered eligible for matching, we have set bounds on the ratio

of their lengths (number of characters). If the ratio does not lie within a specific inter-

val, we do not try matching the two words. Now for matching the words, we match

the features of the words’ characters using DTW. The advantage of using DTW is

that it is able to account for the nonlinear stretch and compression of the words as it

finds a common time axis[6]. So two same words which differ in dimension will be

matched correctly unlike correlation [4] where the words need to be of the same di-

mension to be matched.

So for matching 2 characters, we treat the feature vectors of both as two series X

= (x

... x

) and Y = (y

.. y

). To determine the DTW distance/cost between these

two time series, we find a matrix D of m x n order which shows the cost/ distance of

aligning the 2 subsequences. The entries of the matrix D are found as:

),(

)1,1(

),1(

)1,(

min),(

yxd

jiD

jiD +

⎪

⎭

⎪

⎬

⎫

⎪

⎩

⎪

⎨

⎧

−−

−

(2)

Here for d(x

, y

), we have used the Euclidean distance in the feature space:

Once all the values of D are calculated, the warping path is determined by back-

tracking along the minimum cost path starting from (m , n). The final matching cost is

the cost D(m , n) divided by the number of steps of the warping path. Two words are

196

ranked similar if this final matching cost is less than a character-threshold (which has

been found after a set of experimentation). More details of DTW can be found in [6]

and [1]. One character of query word is matched with different number of neighbor-

ing characters in the test word depending upon the size of the query word. For each

query character, we find its best match (if exists) from the inspected test characters

and add the character matching cost to the total word cost. After matching the charac-

ters of the 2 words, we normalize the total word cost (which is just the sum of the

individual character costs divided by the number of characters matched ). If for two

words, this normalized word-cost is less than our word-threshold, then we say that the

2 words are same.

3 Experimental Results

The absence of a benchmark image database of ancient printed document database for

word spotting motivated us not only to apply our method to some set of documents but

also to test other approaches on the same set of documents for a rational comparison.

The other approaches include correlation as word matching measurement, the word

feature representation presented in [1] and word matching using our 6 feature set for

word images. We compared the results of all these with our character matching aproach.

The experiments were carried out on the document images provided by the Biblio-

thèque Interuniversitaire de Médecine, Paris [8]. We chose 35 words of different

lengths from different document images for query. The evaluation is done by comput-

ing recall and precision percentages. Results for word spotting using correlation ap-

proach are not very satisfactory with precision just over 70%. By using feature

matching of the whole word images as proposed in [1], the precision dropped to 53%.

Testing by using our 6 features for the whole word images, it can be noticed from

table 1 that our feature set achieves higher recall and precision rates than that of [1].

And by our method of matching words using character features, the results achieved

are much better than either correlation or [1] as shown in table1.

Table 1. Comparison of different approaches.

Method Precision Recall

Correlation

71% 77%

Matching using 4 features of words[1]

53%

69%

Matching using our 6 word features 74% 85%

Matching using character features 92% 97%

4 Conclusions

We have proposed a new method for word spotting which is based on the matching of

features of the characters images by Dynamic Time Warping. The results obtained by

using this method are very encouraging. The number of false positives is very less as

197

compared to the results obtained by cross-correlation as well as [1] which shows the

prospects of taking this method even further by improving different stages and adding

more features to achieve even higher percentages. Currently, the word spotting is

done only on the horizontal text but our upcoming work is focusing on the word spot-

ting in vertical text lines. This work can also be adapted to handwritten manuscripts.

References

1. Tony M. Rath, R. Manmatha,: Word Spotting for historical documents, IJDAR (2007)

9:139-152

2. K. Y. Wang, R. G. Casey and F. M. Wahl,: Document analysis system, IBM J.

Res.Development, Vol. 26, pp. 647-656, (1982).

3. Graham Leedham, Chen Yan, Kalyan Takru, Joie Hadi Nata Tan and Li Mian,: Comparison

of Some Thresholding Algorithms for Text/Background Segmentation in Difficult Docu-

ment Images, 7th International Conference on Document Analysis and Recognition

ICDAR, (2003).

4. P. J. Burt, C. Yen, X. Xu,: Local Correlation Measures for Motion Analysis: a Comparative

Study, IEEE Conf. Pattern Recognition Image Processing (1982), pp. 269-274.

5. A. K. Pujari, C.D. Naidu, B.C. Jinaga,: An adaptive character recogniser for telugu scripts

using multiresolution analysis and associative memory, ICVGIP (2002).

6. Keogh, E. and Pazzani, M.,: Derivative Dynamic Time Warping, First SIAM International

Conference on Data Mining, Chicago, (2001).

7. Jamie L. Rothfeder, Shaolei Feng and Toni M. Rath,: Using corner feature correspondences

to rank word images by similarity, Conference on Computer Vision and Pattern Recogni-

tion Workshop, Madison, USA, (2003), pp. 30-35.

8. Digital Library of BIUM (Bibliothèque Interuniversitaire de Médecine, Paris),

http://www.bium.univ-paris5.fr/histmed/medica.htm

9. Adamek, T., O'Connor, N. E. and Smeaton, A. F.,: Word matching using single closed con-

tours for indexing handwritten historical documents, IJDAR (2007), 9, 153 - 16

10. A.Antonacopoulos, Karatzas D., Krawczyk H. and Wiszniewski B.,: The Lifecycle of a

Digital Historical Document: Structure and Content, ACM Symposium on Document En-

gineering, (2004), 147 –154.

11. Tony M. Rath, R. Manmatha,: Features for Word Spotting in Historical Manuscripts, Sev-

enth International Conference on Document Analysis and Recognition ICDAR, (2003).

12. Baird H. S.,: Difficult and urgent open problems in document image analysis for libraries,

1st International workshop on Document Image Analysis for Libraries, (2004).

198