Error Correction for Information Retrieval of Czech Documents

Jiří Martínek, Pavel Král

Abstract

This paper proposes a novel system for information retrieval over a set of scanned documents in the Czech language. The documents are in the form of raster images and thus they are first converted into the text form by optical character recognition (OCR). Then OCR errors are corrected and the corrected texts are indexed and stored into a fulltext database. The database provides a possibility of searching over these documents. This paper describes all components of the above mentioned system with a particular focus on the proposed OCR correction method. We experimentally show that the proposed approach is efficient, because it corrects a significant number of errors. We also create a small Czech corpus to evaluate OCR error correction methods which represent another contribution of this paper.

References

Download


Paper Citation


in Harvard Style

Martínek J. and Král P. (2018). Error Correction for Information Retrieval of Czech Documents.In Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-275-2, pages 630-634. DOI: 10.5220/0006661906300634


in Bibtex Style

@conference{icaart18,
author={Jiří Martínek and Pavel Král},
title={Error Correction for Information Retrieval of Czech Documents},
booktitle={Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2018},
pages={630-634},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006661906300634},
isbn={978-989-758-275-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Error Correction for Information Retrieval of Czech Documents
SN - 978-989-758-275-2
AU - Martínek J.
AU - Král P.
PY - 2018
SP - 630
EP - 634
DO - 10.5220/0006661906300634