A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text

Anirban Chakraborty, Kripabandhu Ghosh, Utpal Roy

2014

Abstract

OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose an algorithm of detecting OCR errors and improving retrieval performance from the erroneous corpus. We present two versions of the algorithm: one based on word cooccurrence and the other based on Pointwise Mutual Information. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is a blank space. We have tested our algorithm on erroneous Bangla FIRE collection and obtained significant improvements.

References

  1. A. Singhal, G. S. and Buckley, C. (1996). Length normalization in degraded text collections. pages 149-162. In Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval.
  2. Chaudhuri, B. and Pal, U. (1996). Ocr error detection and correction of an inflectional indian language script. Pattern Recognition, 3:245 - 249.
  3. Fataicha, Y., Cheriet, M., Nie, Y., and Suen, Y. (2006). Retrieving poorly degraded ocr documents. Int. J. Doc. Anal. Recognit., 8(1):1-99999.
  4. Garain, U., Paik, J., Pal, T., Majumder, P., Doermann, D., and Oard, D. (2013). Overview of the fire 2011 risot task. volume 7536. Springer.
  5. Ghosh, K. and Chakraborty, A. (2012). Improving ir performance from ocred text using cooccurrence. FIRE RISOT track 2012 working notes.
  6. Ghosh, K. and Parui, S. K. (2013). Retrieval from ocr text : Risot track. volume 7536, pages 214-226. Springer.
  7. Harman, D. (1995). Overview of the fourth text retrieval conference. pages 1-24. The Fourth Text Retrieval Conference.
  8. K. Taghva, J. B. and Condit, A. (1994). Results of applying probabilistic ir to ocr text. pages 202-211. In The Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
  9. Kang, H.-K. and Choi, K.-S. (1997). Two-level document ranking using mutual information in natural language information retrieval. Inf. Process. Manage., 33(3):289-306.
  10. Kantor, P. and Voorhees, E. (1996). Report on the trec-5 confusion track. pages 65-74. The Fifth Text Retrieval Conference.
  11. Kolak, O. and Resnik, P. (2002). Ocr error correction using a noisy channel model. pages 257-262. Proceedings of the Second International Conference on Human Language Technology Research.
  12. Magdy, W. and Darwish, K. (2006). Arabic ocr error correction using character segment correction, language modeling, and shallow morphology. pages 408-414.
  13. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing.
  14. Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. (2007). Yass: Yet another suffix stripper. ACM Trans. Inf. Syst., 25(4).
  15. Paik, J. H., Mitra, M., Parui, S. K., and Järvelin, K. (2011a). Gras: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst., 29(4):19:1-19:24.
  16. Paik, J. H., Pal, D., and Parui, S. K. (2011b). A novel corpus-based stemming algorithm using cooccurrence statistics. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR 7811, pages 863-872, New York, NY, USA. ACM.
  17. Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. McGraw-Hill series in psychology. McGraw-Hill.
Download


Paper Citation


in Harvard Style

Chakraborty A., Ghosh K. and Roy U. (2014). A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 450-456. DOI: 10.5220/0005157304500456


in Bibtex Style

@conference{kdir14,
author={Anirban Chakraborty and Kripabandhu Ghosh and Utpal Roy},
title={A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={450-456},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005157304500456},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text
SN - 978-989-758-048-2
AU - Chakraborty A.
AU - Ghosh K.
AU - Roy U.
PY - 2014
SP - 450
EP - 456
DO - 10.5220/0005157304500456