Contract Metadata Identification in Czech Scanned Documents

Hien Ha, Aleš Horák, Minh Bui

Abstract

Although nowadays digital-born documents are generally prevalent, exchange of business documents often consists in processing their scanned image form as a general human-readable format with one-to-one correspondence to paper documents. Bulk processing of such scanned documents then requires human intervention to extract and enter the main document metadata. In this paper, we present the design and evaluation of a contract processing module in the OCRMiner system. The information extraction process allows to combine layout properties with text analysis as input to a rule-based extraction with confidence score propagation. The first results are evaluated with public Czech contract documents reaching the item extraction accuracy of almost 88%.

Download


Paper Citation


in Harvard Style

Ha H., Horák A. and Bui M. (2021). Contract Metadata Identification in Czech Scanned Documents.In Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-484-8, pages 795-802. DOI: 10.5220/0010243807950802


in Bibtex Style

@conference{icaart21,
author={Hien Ha and Aleš Horák and Minh Bui},
title={Contract Metadata Identification in Czech Scanned Documents},
booktitle={Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2021},
pages={795-802},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010243807950802},
isbn={978-989-758-484-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Contract Metadata Identification in Czech Scanned Documents
SN - 978-989-758-484-8
AU - Ha H.
AU - Horák A.
AU - Bui M.
PY - 2021
SP - 795
EP - 802
DO - 10.5220/0010243807950802