ARTIFACT: Architecture for Automated Generation of Distributed Information Extraction Pipelines

Michael Sildatke, Hendrik Karwanni, Bodo Kraft, Albert Zündorf

2022

Abstract

Companies often have to extract information from PDF documents by hand since these documents only are human-readable. To gain business value, companies attempt to automate these processes by using the newest technologies from research. In the field of table analysis, e.g., several hundred approaches were introduced in 2019. The formats of those PDF documents vary enormously and may change over time. Due to that, different and high adjustable extraction strategies are necessary to process the documents automatically, while specific steps are recurring. Thus, we provide an architectural pattern that ensures the modularization of strategies through microservices composed into pipelines. Crucial factors for success are identifying the most suitable pipeline and the reliability of their result. Therefore, the automated quality determination of pipelines creates two fundamental benefits. First, the provided system automatically identifies the best strategy for each input document at runtime. Second, the provided system automatically integrates new microservices into pipelines as soon as they increase overall quality. Hence, the pattern enables fast prototyping of the newest approaches from research while ensuring that they achieve the required quality to gain business value.

Download


Paper Citation


in Harvard Style

Sildatke M., Karwanni H., Kraft B. and Zündorf A. (2022). ARTIFACT: Architecture for Automated Generation of Distributed Information Extraction Pipelines. In Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-569-2, pages 17-28. DOI: 10.5220/0010987000003179


in Bibtex Style

@conference{iceis22,
author={Michael Sildatke and Hendrik Karwanni and Bodo Kraft and Albert Zündorf},
title={ARTIFACT: Architecture for Automated Generation of Distributed Information Extraction Pipelines},
booktitle={Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2022},
pages={17-28},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010987000003179},
isbn={978-989-758-569-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - ARTIFACT: Architecture for Automated Generation of Distributed Information Extraction Pipelines
SN - 978-989-758-569-2
AU - Sildatke M.
AU - Karwanni H.
AU - Kraft B.
AU - Zündorf A.
PY - 2022
SP - 17
EP - 28
DO - 10.5220/0010987000003179