loading
Documents

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Xiaofan Feng 1 ; Abdou Youssef 1 and Sithu Sudarsan 2

Affiliations: 1 The George Washington University, United States ; 2 US Food and Drug Administration, United States

ISBN: 978-989-8565-29-7

Keyword(s): Scanned Document Identification, Maximum A-Posterior Estimation, Information Retrieval.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Data Mining in Electronic Commerce ; Information Extraction ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Mining Text and Semi-Structured Data ; Symbolic Systems

Abstract: Identification of low-quality scanned documents is not trivial in real-world settings. Existing research mainly focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image processing techniques for document identification rely on clean data and large differences among templates. Both these approaches fail to maintain accuracy in the context of noisy data or when document templates are too similar to each other. In this paper, a probabilistic approach is proposed to identify the document template of scanned documents. The proposed algorithm works on imperfect OCR output and document collections containing very similar templates. Through experiment and analysis, this novel probabilistic approach is shown to achieve high accuracy on different data sets.

PDF ImageFull Text

Download
Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 35.175.248.25

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Feng, X.; Feng, X.; Youssef, A. and Sudarsan, S. (2012). Robust Template Identification of Scanned Documents.In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 103-110. DOI: 10.5220/0004144601030110

@conference{kdir12,
author={Xiaofan Feng. and Xiaofan Feng. and Abdou Youssef. and Sithu Sudarsan.},
title={Robust Template Identification of Scanned Documents},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={103-110},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004144601030110},
isbn={978-989-8565-29-7},
}

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Robust Template Identification of Scanned Documents
SN - 978-989-8565-29-7
AU - Feng, X.
AU - Feng, X.
AU - Youssef, A.
AU - Sudarsan, S.
PY - 2012
SP - 103
EP - 110
DO - 10.5220/0004144601030110

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.