Entity Matching in OCRed Documents with Redundant Databases

Nihel Kooli, Abdel Belaïd

2015

Abstract

This paper presents an entity recognition approach on documents recognized by OCR (Optical Character Recognition). The recognition is formulated as a task of matching entities in a database with their representations in a document. A pre-processing step of entity resolution is performed on the database to provide a better representation of the entities. For this, a statistical model based on record linkage and record merge phases is used. Furthermore, documents recognized by OCR can contain noisy data and altered structure. An adapted method is proposed to retrieve the entities from their structures by tolerating possible OCR errors. A modified version of EROCS is applied to this problem by adapting the notion of segments to blocks provided by the OCR. It handles document segments to match the document to its corresponding entities. For efficiency, a process of data labeling in the document is applied in order to filter the compared entities and segments. The evaluation on business documents shows a significant improvement of matching rates compared to those of EROCS.

References

  1. Bilenko, M. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE International Conference on Data Mining, pages 87- 96.
  2. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16-23.
  3. Table 2: Entity matching rates.
Download


Paper Citation


in Harvard Style

Kooli N. and Belaïd A. (2015). Entity Matching in OCRed Documents with Redundant Databases . In Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-076-5, pages 165-172. DOI: 10.5220/0005177301650172


in Bibtex Style

@conference{icpram15,
author={Nihel Kooli and Abdel Belaïd},
title={Entity Matching in OCRed Documents with Redundant Databases},
booktitle={Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2015},
pages={165-172},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005177301650172},
isbn={978-989-758-076-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Entity Matching in OCRed Documents with Redundant Databases
SN - 978-989-758-076-5
AU - Kooli N.
AU - Belaïd A.
PY - 2015
SP - 165
EP - 172
DO - 10.5220/0005177301650172