FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction

Michael Sildatke, Hendrik Karwanni, Bodo Kraft, Albert Zündorf

2022

Abstract

Information Extraction (IE) processes are often business-critical, but very hard to automate due to a heterogeneous data basis. Specific document characteristics, also called features, influence the optimal way of processing. Architecture for Automated Generation of Distributed Information Extraction Pipelines (ARTIFACT) supports businesses in successively automating their IE processes by finding optimal IE pipelines. However, ARTIFACT treats each document the same way, and does not enable document-specific processing. Single solution strategies can perform extraordinarily well for documents with particular traits. While manual approvals are superfluous for these documents, ARTIFACT does not provide the opportunity for Fully Automatic Processing (FAP). Therefore, we introduce an enhanced pattern that integrates an extensible and domain-independent concept of feature detection based on microservices. Due to this, we create two fundamental benefits. First, the document-specific processing increases the quality of automated generated IE pipelines. Second, the system enables FAP to eliminate superfluous approval efforts.

Download


Paper Citation


in Harvard Style

Sildatke M., Karwanni H., Kraft B. and Zündorf A. (2022). FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction. In Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT, ISBN 978-989-758-588-3, pages 250-260. DOI: 10.5220/0011351100003266


in Bibtex Style

@conference{icsoft22,
author={Michael Sildatke and Hendrik Karwanni and Bodo Kraft and Albert Zündorf},
title={FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction},
booktitle={Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT,},
year={2022},
pages={250-260},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011351100003266},
isbn={978-989-758-588-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT,
TI - FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction
SN - 978-989-758-588-3
AU - Sildatke M.
AU - Karwanni H.
AU - Kraft B.
AU - Zündorf A.
PY - 2022
SP - 250
EP - 260
DO - 10.5220/0011351100003266