A Robust Page Frame Detection Method for Complex Historical Document Images

Mohammad Reza, Md. Rakib, Syed Bukhari, Andreas Dengel

2019

Abstract

Document layout analysis is the most important part of converting scanned page images into search-able full text. An intensive amount of research is going on in the field of structured and semi-structured documents (journal articles, books, magazines, invoices) but not much in historical documents. Historical document digitization is a more challenging task than regular structured documents due to poor image quality, damaged characters, big amount of textual and non-textual noise. In the scientific community, the extraneous symbols from the neighboring page are considered as textual noise, while the appearances of black borders, speckles, ruler, different types of image etc. along the border of the documents are considered as non-textual noise. Existing historical document analysis method cannot handle all of this noise which is a very strong reason of getting undesired texts as a result from the output of Optical Character Recognition (OCR) that needs to be removed afterward with a lot of extra afford. This paper presents a new perspective especially for the historical document image cleanup by detecting the page frame of the document. The goal of this method is to find actual contents area of the document and ignore noises along the page border. We use morphological transforms, the line segment detector, and geometric matching algorithm to find an ideal page frame of the document. After the implementation of page frame method, we also evaluate our approach over 16th-19th century printed historical documents. We have noticed in the result that OCR performance for the historical documents increased by 4.49% after applying our page frame detection method. In addition, we are able to increase the OCR accuracy around 6.69% for contemporary documents too.

Download


Paper Citation


in Harvard Style

Reza M., Rakib M., Bukhari S. and Dengel A. (2019). A Robust Page Frame Detection Method for Complex Historical Document Images.In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-351-3, pages 556-564. DOI: 10.5220/0007382405560564


in Bibtex Style

@conference{icpram19,
author={Mohammad Reza and Md. Rakib and Syed Bukhari and Andreas Dengel},
title={A Robust Page Frame Detection Method for Complex Historical Document Images},
booktitle={Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2019},
pages={556-564},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007382405560564},
isbn={978-989-758-351-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - A Robust Page Frame Detection Method for Complex Historical Document Images
SN - 978-989-758-351-3
AU - Reza M.
AU - Rakib M.
AU - Bukhari S.
AU - Dengel A.
PY - 2019
SP - 556
EP - 564
DO - 10.5220/0007382405560564