Feature-based Word Spotting in Ancient Printed Documents

Khurram Khurshid, Claudie Faure, Nicole Vincent

2008

Abstract

Word spotting/matching in ancient printed documents is an extremely challenging task. The classical methods, like correlation, seem to fail when tested on ancient documents. So for that, we have formulated a multi-step document analysis mechanism which mainly revolves around finding the words and their characters in the text and attributing each character by some multi-dimensional features. Words are matched by comparing these multi-dimensional features of the characters using Dynamic Time warping (DTW). We have tested this approach on ancient document images provided by the Bibliothèque Interuniversitaire de Médecine, Paris. Our Initial experiments exhibit encouraging results having more than 90% precision and recall rates.

References

  1. Tony M. Rath, R. Manmatha,: Word Spotting for historical documents, IJDAR (2007) 9:139-152
  2. K. Y. Wang, R. G. Casey and F. M. Wahl,: Document analysis system, IBM J. Res.Development, Vol. 26, pp. 647-656, (1982).
  3. Graham Leedham, Chen Yan, Kalyan Takru, Joie Hadi Nata Tan and Li Mian,: Comparison of Some Thresholding Algorithms for Text/Background Segmentation in Difficult Document Images, 7th International Conference on Document Analysis and Recognition ICDAR, (2003).
  4. P. J. Burt, C. Yen, X. Xu,: Local Correlation Measures for Motion Analysis: a Comparative Study, IEEE Conf. Pattern Recognition Image Processing (1982), pp. 269-274.
  5. A. K. Pujari, C.D. Naidu, B.C. Jinaga,: An adaptive character recogniser for telugu scripts using multiresolution analysis and associative memory, ICVGIP (2002).
  6. Keogh, E. and Pazzani, M.,: Derivative Dynamic Time Warping, First SIAM International Conference on Data Mining, Chicago, (2001).
  7. Jamie L. Rothfeder, Shaolei Feng and Toni M. Rath,: Using corner feature correspondences to rank word images by similarity, Conference on Computer Vision and Pattern Recognition Workshop, Madison, USA, (2003), pp. 30-35.
  8. Digital Library of BIUM (Bibliothèque Interuniversitaire de Médecine, Paris), http://www.bium.univ-paris5.fr/histmed/medica.htm
  9. Adamek, T., O'Connor, N. E. and Smeaton, A. F.,: Word matching using single closed contours for indexing handwritten historical documents, IJDAR (2007), 9, 153 - 16
  10. A.Antonacopoulos, Karatzas D., Krawczyk H. and Wiszniewski B.,: The Lifecycle of a Digital Historical Document: Structure and Content, ACM Symposium on Document Engineering, (2004), 147 -154.
  11. Tony M. Rath, R. Manmatha,: Features for Word Spotting in Historical Manuscripts, Seventh International Conference on Document Analysis and Recognition ICDAR, (2003).
  12. Baird H. S.,: Difficult and urgent open problems in document image analysis for libraries, 1st International workshop on Document Image Analysis for Libraries, (2004).
Download


Paper Citation


in Harvard Style

Khurshid K., Faure C. and Vincent N. (2008). Feature-based Word Spotting in Ancient Printed Documents . In Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2008) ISBN 978-989-8111-42-5, pages 193-198. DOI: 10.5220/0001729201930198


in Bibtex Style

@conference{pris08,
author={Khurram Khurshid and Claudie Faure and Nicole Vincent},
title={Feature-based Word Spotting in Ancient Printed Documents},
booktitle={Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2008)},
year={2008},
pages={193-198},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001729201930198},
isbn={978-989-8111-42-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2008)
TI - Feature-based Word Spotting in Ancient Printed Documents
SN - 978-989-8111-42-5
AU - Khurshid K.
AU - Faure C.
AU - Vincent N.
PY - 2008
SP - 193
EP - 198
DO - 10.5220/0001729201930198