Robust Template Identification of Scanned Documents

Xiaofan Feng, Abdou Youssef, Sithu Sudarsan

2012

Abstract

Identification of low-quality scanned documents is not trivial in real-world settings. Existing research mainly focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image processing techniques for document identification rely on clean data and large differences among templates. Both these approaches fail to maintain accuracy in the context of noisy data or when document templates are too similar to each other. In this paper, a probabilistic approach is proposed to identify the document template of scanned documents. The proposed algorithm works on imperfect OCR output and document collections containing very similar templates. Through experiment and analysis, this novel probabilistic approach is shown to achieve high accuracy on different data sets.

References

  1. Blei, D. M., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3:993-1022.
  2. Cunningham, H., Maynard, D., Bontcheva, K., and Tabla, V. (2002). Gate: A framework and graphical develDeerwester, S. Improving information retrieval with latent semantic indexing. In Proceedings of the 51st ASIS Annual Meeting, ASIS 7888.
  3. Esser, D., Schuster, D., Muthmann, K., Berger, M., and Schill, A. (2011). Automatic indexing of scanned documents - a layout-based approach. In Document Recognition and Retrieval XVIII.
  4. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7899, pages 50-57.
  5. Hu, J., Kashi, R., and Wilfong, G. (2000). Comparison and classification of documents based on layout similarity. Inf. Retr., 2:227-243.
  6. Jinhui Liu, A. K. J. (2000). Image-based form document retrieval. Pattern Recognition, 33:503-513.
  7. Lu, Y. and Tan, C. L. (2004). Information retrieval in document image databases. IEEE Transactions on Knowledge and Data Engineering, 16:1398-1410.
  8. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press.
  9. Salton, G. (1986). Another look at automatic text-retrieval systems. Commun. ACM, 29:648-656.
  10. Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. In Communications of the ACM, volume 18.
  11. Shin, C., Doermann, D., and Rosenfeld, A. (2001). Classification of document pages using structure-based features. International Journal on Document Analysis and Recognition, 3:232-247.
  12. Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduction to Data Mining. Addison-Wesley.
  13. T. S. Jayram, Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., and Zhu, H. (2006). Avatar information extraction system. In IEEE Data Engineering Bulletin 29.
  14. Zheng, Y., Li, H., and Doermann, D. (2005). A parallel-line detection algorithm based on HMM decoding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:777-792.
Download


Paper Citation


in Harvard Style

Feng X., Youssef A. and Sudarsan S. (2012). Robust Template Identification of Scanned Documents . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 103-110. DOI: 10.5220/0004144601030110


in Bibtex Style

@conference{kdir12,
author={Xiaofan Feng and Abdou Youssef and Sithu Sudarsan},
title={Robust Template Identification of Scanned Documents},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={103-110},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004144601030110},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Robust Template Identification of Scanned Documents
SN - 978-989-8565-29-7
AU - Feng X.
AU - Youssef A.
AU - Sudarsan S.
PY - 2012
SP - 103
EP - 110
DO - 10.5220/0004144601030110