FUSION: Feature-based Processing of Heterogeneous Documents for

Automated Information Extraction

Michael Sildatke

, Hendrik Karwanni

, Bodo Kraft

and Albert Zündorf

FH Aachen, University of Applied Sciences, Germany

University of Kassel, Germany

Keywords:

Architectural Design, Refactoring and Patterns, Model-driven Software Engineering, Process Modeling,

Quality Management, Software and Systems Modeling, Enterprise Information Systems, Information

Extraction, Document Classiﬁcation, Feature Detection, Software Metrics and Measurement.

Abstract:

Information Extraction (IE) processes are often business-critical, but very hard to automate due to a het-

erogeneous data basis. Speciﬁc document characteristics, also called features, inﬂuence the optimal way of

processing. Architecture for Automated Generation of Distributed Information Extraction Pipelines (AR-

TIFACT) supports businesses in successively automating their IE processes by ﬁnding optimal IE pipelines.

However, ARTIFACT treats each document the same way, and does not enable document-speciﬁc process-

ing. Single solution strategies can perform extraordinarily well for documents with particular traits. While

manual approvals are superﬂuous for these documents, ARTIFACT does not provide the opportunity for Fully

Automatic Processing (FAP). Therefore, we introduce an enhanced pattern that integrates an extensible and

domain-independent concept of feature detection based on microservices. Due to this, we create two fun-

damental beneﬁts. First, the document-speciﬁc processing increases the quality of automated generated IE

pipelines. Second, the system enables FAP to eliminate superﬂuous approval efforts.

1 INTRODUCTION

In many businesses, heterogeneous and human-

readable documents are the input of underlying Infor-

mation Extraction (IE) processes. Due to an unstruc-

tured data basis, classic Extract, Transform, Load

(ETL) technologies reach their limits. Environments

may change rapidly so that the number of different

formats increases over time. Flexible architectures are

the basis for businesses to react to those changes at an

early stage.

The IE processes include different tasks, like doc-

ument conversion, element decomposition and infor-

mation extraction. Single software components can

solve speciﬁc tasks, while their composition results in

document processing pipelines. Architecture for Au-

tomated Generation of Distributed Information Ex-

traction Pipelines (ARTIFACT) provides an architec-

ture that enables the separation of concerns and the

automated generation of IE pipelines. Due to auto-

matic quality determination mechanisms, the archi-

tecture ﬁnds the global best pipelines for extracting

speciﬁc information on its own.

But in practice, speciﬁc document characteristics,

also called features, can limit the set of possible so-

lution strategies. If a document does not contain ta-

bles, e.g., any table-based extraction strategies will

not be suitable for that document. In this case, the

global best pipeline does not need to be the optimal

document-speciﬁc pipeline. Therefore, opportunity

costs may arise if the system does not consider these

document features during automated pipeline gener-

ation. Furthermore, pipelines can reach excellent re-

sults for documents with speciﬁc characteristics. In

these cases, manual checks at the end of the auto-

matic process are superﬂuous, and Fully Automatic

Processing (FAP) is possible.

This paper extends the ARTIFACT pattern to en-

able the system to consider critical features during

the automated pipeline generation. This way, we en-

sure that the provided system always generates the

best document-speciﬁc pipeline. Furthermore, we

empower the provided system to FAP of documents.

Hence, our extension increases business value and

eliminates superﬂuous efforts for manual approvals.

The paper is structured as follows: Section 2

describes the project that motivates our approach.

Section 3 describes related works, while Section 4

250

Sildatke, M., Karwanni, H., Kraft, B. and Zündorf, A.

FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction.

DOI: 10.5220/0011351100003266

In Proceedings of the 17th International Conference on Software Technologies (ICSOFT 2022), pages 250-260

ISBN: 978-989-758-588-3; ISSN: 2184-2833

presents the key concepts of ARTIFACT as the ar-

chitectural basis. Section 5 introduces Feature Based

Processing of Heterogeneous Documents for Auto-

mated Information Extraction (FUSION). Section 6

describes the experimental evaluation of FUSION in

the real-world project, while Section 7 summarizes

the paper and gives an overview of future work.

2 MOTIVATION

The introduced pattern is domain-independent and es-

pecially applicable to any domain with a lack of doc-

ument standards. The German energy industry serves

as an example to motivate the extension of ARTI-

FACT. Due to its microservice-based architecture and

generic concepts, the approach ensures extension of

scope and the adaptability for other domains.

Service providers collect product information

from about 3,150 energy suppliers with 15,000 differ-

ent electricity or gas products. Since there is no stan-

dard, the published documents containing informa-

tion about these products have various formats. The

document basis, therefore, is heterogeneous, and ser-

vice providers mostly have to extract this information

by hand, e.g., product prices.

ARTIFACT handles every document equally and

does not consider any document-speciﬁc characteris-

tics. But in practice, these characteristics, also called

features, are metadata that can inﬂuence the optimal

way of processing. While these features are mostly ir-

relevant for human extractors, they are critical for au-

tomated IE. Depending on these features, the process-

ing system should choose different solution strate-

gies. Thus, knowledge about these features and their

consideration during pipeline generation can strongly

support the success of automated IE.

Documents can have technical (domain-

independent), and content-based (domain-speciﬁc)

features. The following list contains typical technical

features of documents:

• Presence or absence of price tables. If price ta-

bles are present, pipelines will have to integrate

speciﬁc table decomposition components. Also,

they need particular extractors that have table el-

ements as input. In case of the absence of tables,

the pipeline should integrate any non-table-based

extractors.

• Scanned or screenshot documents. If a doc-

ument is scanned or screenshot, speciﬁc pre-

processing steps will be necessary. These steps

can, e.g., be content aligning or OCR-processing.

• Page segmentation. If the document has differ-

ent columns organizing the content, e.g., if the

document is a booklet, pipelines will have to inte-

grate speciﬁc segmentation decomposers to keep

the correct content ordering.

There are also different content-based features. In

the case of the underlying research project originating

from the German energy industry, the following list

contains an exemplary set of typical ones:

• The number of listed products. A document can

list one or more products. If there is only one

product, it is clear where the presented informa-

tion belongs. If there are several products, infor-

mation has to be related to the correct product.

• Price representations. Prices can have differ-

ent representations. While gross and net prices

are common in every domain, there can be more

domain-speciﬁc representations, e.g., net exclud-

ing transportation. Automated extractors have to

understand that all representations belong to the

same price.

• Consumption-based prices. Products can have

consumption-based prices. If customers consume

2.500 kWh per year, e.g., they will have to pay

25.45 ct/kWh. If they consume more than 2.500

kWh per year, they will have to pay 26.34 ct/kWh.

Hence, extractors have to extract additional infor-

mation besides the actual prices.

• Time-variable prices. Products can have time-

variable prices. The price for a consumed kWh

can be 26.34 cents between 06:00 and 22:00, e.g.,

and 25.54 cents between 22:00 and 6:00. Au-

tomated Extractors have to recognize these time

windows to extract prices correctly.

• Presence or absence of previous prices. Some

suppliers provide previous prices belonging to a

product to have a price comparison. These prices

are irrelevant, and automated extractors have to

ignore them during extraction.

• Presence or absence of regional limits. Some

suppliers limit their products to speciﬁc regions.

The presence of this information may require ad-

ditional extraction steps.

Based on related features, there are certain document

groups. All members of a particular group have simi-

lar characteristics (cf. Figure 1).

PDF

Group A

Document

Features

●

has_tables: true

●

...

Features

●

has_tables: false

●

...

Group B

Document

Figure 1: Feature-Based Document Group Examples.

FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction

251

Related document features can inﬂuence the optimal

way of processing. A group A document, e.g., re-

quires table-based, while a group B document requires

text-based processing strategies.

Furthermore, speciﬁc document groups may have

the potential of FAP because particular strategies may

perform extraordinarily well on them. If a document

only lists one product in a simple price representation,

e.g., less complex processing strategies may always

achieve correct results.

Hence, the automatic detection of document fea-

tures can enable document-speciﬁc processing and

supports the automated IE from very heterogeneous

data. Additionally, the knowledge about certain fea-

tures can enable FAP.

3 RELATED WORK

The challenges mentioned are primarily related to the

ﬁelds of Business Process Automation (BPA), docu-

ment classiﬁcation, IE, as well as agile development.

Business Process Management (BPM) is a dis-

cipline that combines knowledge from information

technology and management science to organize busi-

ness processes efﬁciently (van der Aalst, 2004).

Workﬂow Management (WFM) systems (Jablonski

and Bussler, 1996) or Process-Aware Information

Systems (PAISs) (Dumas et al., 2005), e.g., focus on

the best operational support for BPM.

Modern techniques like Robotic Process Automa-

tion (RPA) support the automation of business pro-

cesses. RPA replaces human interactions with robots

by automating manual inputs into the Graphical User

Interface (GUI) of an enterprise system (Ivan

c et al.,

2019). In RPA systems, robots learn rule-based be-

haviors by observing users interacting with affected

systems (Aguirre and Rodriguez, 2017).

However, the underlying problem deals with an

inﬁnite set of possible document formats, so rule-

based solution strategies are insufﬁcient.

The developed system needs the ability of docu-

ment classiﬁcation to aim at document-speciﬁc pro-

cesses. Document classiﬁcation has a long history,

and the ﬁrst ﬁndings originate from the 1960s (Borko

and Bernick, 1963, e.g.). In the area of text classiﬁ-

cation, it deals with the automated assigning of cate-

gories to a text document based on speciﬁc terms, also

called features (Sebastiani, 2002).

Different algorithms solve feature-based text clas-

siﬁcation, e.g., k-nearest-neighbor (Jiang et al., 2012)

or support vector machines (Lilleberg et al., 2015).

However, these approaches and algorithms deal with

problems in which the detection of features is less rel-

evant and the classiﬁcation itself is the focus.

Furthermore, the presented system must detect

features from non-machine-readable PDF documents

to solve the underlying problem. Also, the detection

of (non-)text-based features is relevant for determin-

ing critical characteristics of a document, e.g., the

presence or absence of tables. Therefore, the intro-

duced approach deals with problems in which the fea-

ture detection itself is the most relevant task.

IE deals with extracting structured information

from unstructured text (Cardie, 1997) and ETL pro-

cesses focus on the automation of extraction pipelines

(Albrecht and Naumann, 2008). Classic ETL tech-

nologies especially can handle structured data (Sk-

outas and Simitsis, 2006). (Schmidts et al., 2019),

e.g., introduce an approach to process semi-structured

tables in a well-deﬁned format. Since the underlying

problems deal with the processing of non-machine-

readable PDF documents, such approaches are not

sufﬁcient. Furthermore, Ontology-Based Information

Extraction (OBIE) systems focus on the IE from un-

structured text (Saggion et al., 2007), but are not de-

signed for the IE from PDF documents (Wimalasuriya

and Dou, 2010).

The combination of document classiﬁcation and

IE can support implementing more efﬁcient automa-

tion processes (Liu and Ng, 1996). Characteristics of

features may change over time or new features can

get relevant for classiﬁcation tasks. Furthermore, on-

going research leads to many upcoming approaches

for classiﬁcation or IE tasks. In table analysis, e.g.,

several hundred approaches arose in 2019 (Hashmi

et al., 2021). Agile development approaches like Ex-

treme Programming or Scrum allow fast prototyping

of new approaches and form the basis for ﬂexible de-

velopment processes (Beck, 2003; Rubin, 2012). Ad-

ditionally, these approaches can ensure short feedback

loops and form the basis for continuously measuring

business-relevant quality metrics (Fowler, 2013).

Emergent architectures can enable evolution by

adding code to ensure the adaption of new situations

(Hanson and Sussman, 2021). Also, concepts like

Domain-driven Design (DDD) require modular soft-

ware (Tarr et al., 1999; Evans and Evans, 2004).

(Newman, 2015) points out that microservices and

Microservice Architecture (MSA) can form a basis

for implementing modular and ﬂexible software.

Derived from the aforementioned ﬁndings, the in-

troduced approach uses ARTIFACT as its basis. AR-

TIFACT provides a microservice-based architecture

that integrates business metrics and the ﬂexible de-

sign of IE pipelines (Sildatke et al., 2022). Section 4

describes its concepts and the potential for extension

resulting from the underlying problem.

ICSOFT 2022 - 17th International Conference on Software Technologies

252

Component

Registry

Converters

Decomposers

Extractors

Developer

Specific

Component

Pipeline

Builder

Push Inform

Possible

Pipelines

Generate

& Test

Gold-Data

Component

Registry

Push

Pipeline

Qualities

Document

Processor

PDF

Input

Document

Request

extraction

Best Pipeline

Extract

Quality Determination Phase

Production Phase

Development Phase

Trigger Dev

Figure 2: Overview of the ARTIFACT Pattern.

4 ARTIFACT PATTERN

Figure 2 shows an overview of the core building

blocks of the ARTIFACT pattern.

In ARTIFACT, a pipeline consists of converters,

decomposers and extractors, each representing a mi-

croservice. An extractor is always the last component

of a pipeline and its result is speciﬁc information. De-

velopers can develop new components and push them

into the component registry, which manages all avail-

able components.

The pipeline builder can build all possible

pipelines automatically with available components. It

also tests all potential pipelines against a set of gold

documents to determine the quality of each potential

pipeline. The component registry can extract spe-

ciﬁc information in the production phase with the best

available pipeline based on the determined qualities.

Every time the component registry has an update, it

will initiate the quality determination, e.g., triggered

by the push of a new component.

Since the test data has to represent real-world con-

ditions, there may be constellations in which a spe-

ciﬁc document type dominates, e.g., documents con-

taining tables. In this case, the determination of the

global-best pipeline tends to select a suitable one for

the dominating document type (cf. Figure 3).

PDF

Type A

(60%)

Gold-Data

Pipeline

Builder

Test

Pipelines

Best (Type A)

Global-

Best Pipeline

Best (Type B)

Type B

(40%)

Figure 3: Biased Determination of Global Best Pipelines.

Due to this, executing the global best pipeline may

not always be the optimal document-speciﬁc solution.

Suitable document-speciﬁc pipelines may exist, while

the systems will not take them into account following

the concept of ARTIFACT (cf. Figure 4)

Component

Registry

Type B

Correct

Wrong

Executed Global-

Best Pipeline

Document-Specific

Best Pipeline

Figure 4: Document-Speciﬁc Best Pipeline Not Picked.

In ARTIFACT, a higher-level BPMN process con-

trols the overall IE process. The last task of this pro-

cess is a manual approval step that makes human in-

teraction mandatory. Due to this process conﬁgura-

tion, the implementation of FAP without any manual

effort is not possible. To address these challenges, the

provided approach extends the ARTIFACT pattern. It

empowers the system to generate document-speciﬁc

pipelines and creates the ability for FAP.

5 FUSION PATTERN

The following section introduces the FUSION pattern

as an extension of ARTIFACT, while Figure 5 shows

the building blocks. FUSION enables automated

feature detection for input documents by providing

the feature detector registry microservice. The label

extender and feature matcher microservices ensure

the automated generation of best document-speciﬁc

pipelines. Based on the ARTIFACT core, developers

can continuously extend the system.

Hence, it combines the concepts of document-

speciﬁc processing and feature-based quality determi-

nation. The following subsections introduce the core

concepts of FUSION.

5.1 Label Extension

Since document features are relevant for generating

the optimal document-based pipeline, they have to be

part of gold documents. To take features into account

FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction

253

Input

Document

Developer

ARTIFACT Core (Black Box)

FUSION

Feature Detector

Registry

Label

Extender

Feature

Matcher

Document-Specific

Processing

Feature-Based

Quality Determination

Figure 5: Core Components of FUSION.

when determining pipeline qualities, we manually ex-

tend the base gold standard data as shown in Figure 6.

Processed

Documents

PDF

Extracted

Information

Gold Data

Document

Features

feature_1

...

feature_n

feature_1

...

feature_n

feature_1

...

feature_n

FUSION

ARTIFACT

Figure 6: Feature Labels in Gold-Standard.

To ﬁnd appropriate document groups, the system

requires information about suitable pipelines for each

gold standard document. To achieve this, FUSION

adapts the pipeline quality determination mechanism

of ARTIFACT to automatically extend the set of gold

data with required pipeline labels (cf. Figure 7).

Label Extender

(prev. Pipeline

Quality Determiner)

Test

Gold

Data

Pipeline

Builder

...

feature_n

PDF

feature_1

...

feature_n

...

feature_n

PDF

feature_1

...

feature_n

...

Extended

Gold Documents

Create

Possible

Pipelines

Gold-

Document 1

Suitable

Pipelines

Gold-

Document n

Suitable

Pipelines

Figure 7: Result of the Label Extender.

To extend the set of gold data, the pipeline builder

generates all possible pipelines and sends them to the

label extender. Afterwards, the label extender tests

each pipeline against each gold document. As a re-

sult, it automatically creates a list of extended gold

documents, while each extended gold document now

stores a list of suitable pipelines.

5.2 Feature Matching

The overall goal of the feature matcher is to ﬁnd the

best feature-based pipelines. In other words, it deter-

mines which pipeline the system will have to choose

in production if an unknown input document has a

speciﬁc set of features. Therefore, the feature matcher

matches relevant features from extended gold docu-

ments to an ordered set of optimal pipelines (cf. Fig-

ure 8).

Label Extender

Feature-Based

Pipeline Qualities

Feature Matcher

Extended Gold Docs

...

feature_n

feature_1

...

feature_n

Feature-

Combination 1

Pipelines

with Qualities

80%

35%

...

feature_n

feature_1

...

feature_n

Feature-

Combination m

Pipelines

with Qualities

75%

15%

...

Create

Figure 8: Result of the Feature Matcher.

Due to adaptability, developers can implement any

suitable matching strategy, depending on the spe-

ciﬁc problem. In the introduced project, the feature

matcher implements a rule-based strategy that builds

all possible and unique combinations of boolean fea-

tures.

If it knows the relevant features, the system will

be able to ﬁnd the optimal pipeline for an unknown

input document using feature matching.

5.3 Feature Detection

FUSION introduces automated feature detection to

determine the relevant features of an input document.

To achieve this, developers can implement differ-

ent feature detectors that detect the value of a speciﬁc

feature for any input document (cf. Figure 9).

Detection

Request

Feature Detector

Detected

Value

PDF

Input Document

Figure 9: In- and Output of a Feature Detector.

A feature detector is a microservice that consumes

a speciﬁc type of document and detects the value of

a particular feature, e.g., the presence or absence of

tables in a PDF document.

FUSION also introduces a feature detector reg-

istry, where developers can deploy feature detectors.

The feature detector registry manages all available

feature detectors and ensures that a registering detec-

tor meets all quality requirements. For this, the reg-

istry requests the feature detector quality determiner

(cf. Figure 10).

ICSOFT 2022 - 17th International Conference on Software Technologies

254

Feature Detector

Registry

Feature

Detectors

Feature

Detector

Feature Detector

Quality Determiner

Request

Quality

Figure 10: Registration to the Feature Detector Registry.

Domain experts or managers can deﬁne the neces-

sary quality criteria, as shown in Table 1.

Table 1: Quality Criteria for Relevant Features.

Information Required Quality

GLOBAL 85%

Speciﬁc Criteria for Feature 1 90%

... ...

Speciﬁc Criteria for Feature n 75%

A GLOBAL setting deﬁnes the required quality for all

features that experts can overwrite to deﬁne speciﬁc

criteria for particular features if necessary.

To determine the quality of a speciﬁc feature de-

tector, the feature detector quality determiner tests

this detector against each gold document (cf. Fig-

ure 11).

Feature Detector

Quality Determiner

Feature

Detector

Gold-Data

Detector

Quality

Feature Detector

Registry

...

feature_1

...

feature_n

Feature Matcher

Features

Test

Figure 11: Quality Determination of Feature Detectors.

Afterwards, the determiner responds with the speciﬁc

detector quality so that the registry can check if the

detector meets the required criteria. If it reaches the

requirements, the detector will be registered to the

registry to be available for speciﬁc feature detection.

Finally, the feature detector registry informs the fea-

ture matcher about the available features. The feature

matcher only considers features that a corresponding

detector can detect reliably.

This approach enables the rapid creation of new

feature detectors or the improvement of existing ones.

5.4 Feature-based Document Processing

To be able to use document-speciﬁc pipelines, FU-

SION introduces the concept of feature-based doc-

ument processing. As a basis for that, the feature

matcher sends the information about feature-based

pipeline qualities to the component registry (pre-

sented with ARTIFACT) after each matching, trig-

gered by the registration of new feature detectors.

The document processor requests the feature de-

tector registry to generate a feature-enriched docu-

ment to achieve feature-based processing (cf. Fig-

ure 12).

Feature

Detector Registry

feature_1

...

feature_n

Feature

Detectors

Requested

Feature

Detected

Value

Document

Processor

feature_1

...

feature_n

Feature-Enriched

Document

Input

Document

Input Document

Figure 12: Generation of a Feature-Enriched Document.

The component registry can now ﬁnd the best

feature-based pipeline for a processed document (cf.

Figure 13).

Component

Registry

Doc. Processor

Extract Information

and Response

feature_1

...

feature_n

Feature-Based

Pipeline Qualities

...

feature_n

feature_1

...

feature_n

80%

35%

...

feature_n

feature_1

...

feature_n

75%

15%

...

Feature-

Enriched

Document

Find Feature-

Based Pipeline

Figure 13: Feature-Based Document Processing.

The component registry maps the document features

of an incoming document with the list of feature-

based pipeline qualities and executes the best per-

forming pipeline.

5.5 Result Routing

To realize FAP, FUSION introduces result routing.

The basis for result routing is a conﬁguration with de-

ﬁned criteria for FAP (cf. Table 2).

Table 2: Document Processor Conﬁguration.

Information FAP-Limit

GLOBAL 95%

Speciﬁc Criteria for Information 1 90%

... ...

Speciﬁc Criteria for Information 2 100%

FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction

255

Experts can deﬁne the GLOBAL criteria to treat

all information the same way. In the case of a need

to adjust the limit for speciﬁc information, e.g., if it is

more or less business-critical, they can overwrite the

GLOBAL conﬁguration. If the extraction of all in-

formation meets the required FAP criteria, the system

will route the document to FAP. Figure 14 shows an

overview of the FUSION architecture, including the

concept of result routing.

6 EXPERIMENTAL EVALUATION

In this section, we demonstrate the practical applica-

tion of our FUSION pattern in the project motivated

in Section 2. However, the approach is extensible and

highly adjustable. Therefore, it allows ﬂexible scope

extension and the adaption to other domains. Since

the introduced approach focuses on the automated de-

tection of document features, we do not describe de-

veloped IE components further.

6.1 Features & Conﬁguration

The data basis is very heterogeneous, and we identi-

ﬁed an initial set of several key document features that

characterize an input document. According to the de-

ﬁned features, we developed several feature detectors.

Using the FUSION architecture, the system automati-

cally tests each feature detector against the set of gold

documents. Table 3 shows the quality of the devel-

oped detectors.

Table 3: Achieved Feature Detector Qualities.

Feature Detector Quality

HAS_TABLES 86%

IS_SCREENSHOT 56%

IS_COLUMN_SEPARATED 45%

HAS_EXACTLY_ONE_PRODUCT 86%

HAS_GROSS_PRICES 91%

HAS_NET_PRICES 91%

HAS_OTHER_PRICE_REPRESENTATIONS 87%

HAS_STAGGERED_PRODUCTS 92%

HAS_TIME_VARIABLE_PRODUCTS 92%

HAS_REGIONAL_CONDITIONS 45%

We reached the required quality for seven features

during the evaluation. Therefore, the system consid-

ers seven relevant features when generating feature-

based pipelines.

We deﬁned a global FAP-limit of 95% for all in-

formation. The document processor, therefore, will

route each input document for FAP reaching this

limit.

6.2 Achieved Results

Following the simple rule-based generation of all pos-

sible feature combinations with seven features leads

to 2

= 128 feature combinations. Due to speciﬁc

domain knowledge about mutual exclusive character-

istics, we know that the number of feature combi-

nations is much smaller in reality. Hence, the rule-

based feature matcher determines a set of 35 unique

feature combinations in 1300 gold documents, while

each feature combination has at least 30 related doc-

uments.

Table 4 shows the comparison of IE results using

the global-best pipeline (ARTIFACT) and the feature-

based pipeline (FUSION). The comparison indicates

that feature-based pipelines increase quality without

adding new extraction-speciﬁc components.

Table 4: Comparison of IE Results.

Information ∅ARTIFACT ∅FUSION ± %-Pts.

DateOfValidity 91% 91% + 0

BasicPrices 59% 77% + 18

CommodityPrices 54% 74% + 20

CustomerGroups 55% 84% + 29

MeteringPrices 34% 66% + 32

ProductType 55% 82% + 27

ProductCategory 55% 86% + 31

The project is still at an early stage, so improve-

ments in converters, decomposers and extractors will

also improve the results in the future.

However, the system detected six feature combi-

nations for which speciﬁc pipelines perform extraor-

dinarily well for all requested information. Since the

extractions fulﬁll all FAP criteria in the test, the sys-

tem will route input documents for FAP, if they have

one of the six feature combinations. 174 documents

belong to one of the six feature combinations, so we

identiﬁed a potential for FAP of about 13% in gold

data.

During the productive use of the FUSION pattern,

the system processed 524 unknown documents and

automatically related 76 of these 524 to one of the six

feature combinations fulﬁlling the criteria for FAP.

The documents were routed to FAP, while an audit

showed that all results were correct. Hence, we aim

at a quote of 14% FAP in practice.

6.3 Implementation

In the following subsection, we present an exemplary

implementation of our FUSION architecture pattern.

Due to the extensibility and ﬂexibility of the AR-

TIFACT reference implementation, we decided to

build on this implementation. Therefore, we focus on

ICSOFT 2022 - 17th International Conference on Software Technologies

256

Document

Processor

PDF

Input Document

Feature Detector

Quality Determiner

Feature Detector

Registry

Available Features

Extracted

Information

Request Extraction

Gold

Data

Test

Label Extender

(prev. Pipeline Quality

Determiner)

Pipeline

Builder

Inform

Possible

Pipelines

Data

Transfer

Component

Registry

Result Routing

Feature-Based

Pipeline Qualities

Full

Auto-

mation

Feature

Matcher

Ext.

Gold-

Docs

FUSION Component

ARTIFACT Component (Ext.)

Request

Features

Feature-Enriched

Document

Test

Developer

Detector Quality

Detector Info

Component

Detector

Extractor

App

FAP-Limits Reached? Yes

Figure 14: Overview of the FUSION Architecture.

the new and modiﬁed services and only brieﬂy men-

tion unchanged services from ARTIFACT.

Our current implementation of FUSION utilizes

microservices encapsulated in Docker

images. We

again realize the communication between microser-

vices via Representational State Transfer (REST)

calls and heavily rely on OpenAPI

to generate server

stubs and client SDKs.

6.3.1 Feature Detectors

As stated in Subsection 5.3, feature detectors are used

to detect a speciﬁc feature inside a document. Analo-

gous to components in ARTIFACT, we implemented

every feature detector as a microservice. Due to this,

developers can choose the most suitable language and

framework for a speciﬁc detection problem.

Each feature detector microservice offers two end-

points. The ﬁrst endpoint provides information about

the detector, e.g., its name and what type of feature

it can detect. The second endpoint performs the de-

tection, receiving a document and returning a boolean

ﬂag that indicates whether the document contains the

feature.

6.3.2 Feature Detector Quality Determiner

As also mentioned in Subsection 5.3, we need to as-

sess the quality of each feature detector and, there-

fore, implemented a FastAPI

webservice that tests a

detector against the set of gold documents by detect-

ing the respective feature in each gold document. The

determiner calculates the resulting quality by dividing

the number of correct results by the number of gold

documents.

https://www.docker.com/

https://swagger.io/speciﬁcation/

https://fastapi.tiangolo.com/

6.3.3 Feature Detector Registry

As stated in Subsection 5.3, we implement a mi-

croservice that manages the feature detectors. This

microservice utilizes a Java 17 stack with Maven and

Spring Boot

. After we provide the required feature

quality criteria mentioned in Subsection 6.1, we can

start registering detectors at the feature detector reg-

istry. The registration process is mostly analogous to

that of the component registry of ARTIFACT.

Code Listing 1 shows the registration of a fea-

ture detector. When a feature detector tries to regis-

ter at the feature detector registry, the registry queries

the information endpoint of the detector to determine

its feature type. After that, the registry queries the

feature detector quality determiner for the detector’s

quality. If the received quality is high enough, the

registry informs the feature matcher by forwarding all

relevant features. In contrast to ARTIFACT, the reg-

istration of the feature detector does not require the

resulting pipelines, and the registry accepts the fea-

ture detector registration at this point.

void addDetector(InetSocketAddress address){

FeatureDetectorInfo detectorInfo =

,→ requestDetectorInfo(address);

FeatureDetectorQuality quality =

,→ requestDetectorQuality(detectorInfo

,→ , address);

if (hasRequiredQuality(quality)) {

Detector detector = new Detector(address,

,→ quality);

addresses.put(detector.getName(),

,→ detector);

List<String> allFeatures = addresses

.values().stream()

.map(Detector::getDetects).toList();

https://spring.io/

FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction

257

notifyFeatureMatcher(allFeatures);

}

Code Listing 1: Registration of New Detectors.

6.3.4 Component Registry Extension

Compared to ARTIFACT, the component registry re-

quires slight modiﬁcations to handle feature-enriched

documents. Additionally, each extracted information

now contains the quality of the used pipeline in or-

der to determine if FAP is possible. Finally, as men-

tioned in Subsection 5.4, there is no longer one global

pipeline for all documents. Instead, the system uses

feature-based pipelines depending on the features of

the received document. If the system cannot ﬁnd an

exact feature match, we use a pipeline whose features

are closest to the features of the document. In this

case, FAP is not possible.

6.3.5 Label Extender

The label extender microservice combines the la-

bel extension introduced in Subsection 5.1 and the

pipeline generation into one single microservice. The

label extender is a Python FastAPI web service that

also replaces the pipeline quality determiner intro-

duced in ARTIFACT. After label extension is per-

formed, the service triggers the feature matcher ser-

vice in order to generate document-speciﬁc pipelines.

Code Listing 2 illustrates the steps to perform la-

bel extension for all gold documents. For all infor-

mation types, every pipeline is run against each doc-

ument. If the pipeline result matches the expected re-

sult, the pipeline will be considered suitable for the

document.

def extend_labels(self):

extended_docs =[]

for gold_doc in self.get_gold_documents():

suitable_pipelines =[]

for info in DomainModel.information:

for pipeline in self.

,→ get_pipelines_for_information(

,→ info):

result =PipelineExecutor.

,→ execute_pipeline(

pipeline=pipeline,

gold_document=gold_doc

)

if result ==gold_doc.get(info):

suitable_pipelines.append(pipeline)

extended_docs.append(

LabeledGoldDocument(

gold_document=gold_doc,

suitable_pipelines=suitable_pipelines

)

return extended_docs

Code Listing 2: Label Extension of Gold Documents.

6.3.6 Feature Matcher

As explained in Subsection 5.2, the feature matcher

determines the best document-speciﬁc pipelines. As

stated there, the current implementation features a

rule-based matching strategy for pipeline generation.

Code Listing 3 illustrates the steps to create the

feature-based pipelines. For each feature combina-

tion, we select all gold documents that match this

exact feature set. After that, we extract the suitable

pipelines we received from the label extender and cal-

culate each pipeline’s quality. This results in a list of

all possible feature-based pipelines and their qualities.

After feature matching is complete, the service sends

all created feature-based pipelines to the component

registry. This enables the registry to process incom-

ing documents.

def match_features(self, labeled_docs: List[

,→ LabeledGoldDocument]):

feature_based_pipeline_qualities =[]

for feature_combination in self.

,→ generate_feature_combinations():

suitable_pipelines =[]

matching_gold_docs =self.

,→ match_labeled_documents(

,→ labeled_docs, feature_combination

,→ )

for labeled_doc in matching_gold_docs:

suitable_pipelines.extend(labeled_doc.

,→ suitable_pipelines)

for suitable_pipeline in set(

,→ suitable_pipelines):

feature_based_pipeline_qualities.append

,→ (

FeatureBasedPipelineQuality(

features=feature_combination,

pipeline=suitable_pipeline,

quality=suitable_pipelines.count(

,→ suitable_pipeline) /len(

,→ matching_gold_docs)

)

return feature_based_pipeline_qualities

Code Listing 3: Feature-Based Pipeline Generation.

6.3.7 Document Processor

The document processor is a Java Spring Boot web

service that provides a Camunda

process engine and

extends the process orchestrator introduced in ARTI-

FACT. As stated in Subsection 5.5, this microservice

https://camunda.com/

ICSOFT 2022 - 17th International Conference on Software Technologies

258

manages the overall process and decides if the infor-

mation extraction quality justiﬁes FAP.

When the user uploads a document for informa-

tion extraction, the processor ﬁrst calls the feature

detector registry to extract the features of the given

document. The resulting feature-enriched document

is then forwarded to the component registry for in-

formation extraction. After extraction, the processor

receives the extracted information together with their

estimated quality. If the quality of the resulting in-

formation is not high enough (see Subsection 6.1),

the extracted information must be checked manually.

Otherwise, the information can be processed automat-

ically.

Code Listing 4 shows the implementation of a Ca-

munda service task that receives a document from the

DOCUMENT variable and returns a document that

additionally contains the detected features in the vari-

able ENRICHED_DOCUMENT. This variable is then

used in subsequent service tasks to query the compo-

nent registry for the document information.

void execute(DelegateExecution exec) {

FileValue docFile = exec.getVariableTyped("

,→ DOCUMENT");

Document document = toDocument(docFile);

FeatureEnrichedDocument serverResult =

,→ sendDocumentToServer(document);

exec.setVariable("ENRICHED_DOCUMENT",

,→ serverResult);

}

Code Listing 4: Detection Service Task.

In Code Listing 5, the following Camunda service

task receives the ENRICHED_DOCUMENT variable

and requests the extraction of all information within

the document. If the quality of the resulting infor-

mation is not high enough, the extracted information

must be checked manually. Otherwise, the informa-

tion can be processed automatically.

void execute(DelegateExecution exec) {

ObjectValue docFile = exec.getVariableTyped

,→ ("ENRICHED_DOCUMENT");

FeatureEnrichedDocument document = docFile.

,→ getValue(FeatureEnrichedDocument.

,→ class);

List<InformationWithQuality> result =

,→ sendDocumentToServer(document);

if (result == null || result.isEmpty()) {

throw new RuntimeException("No results

,→ received!");

}

double overallConfidence = result

.stream()

.mapToDouble(InformationWithQuality::

,→ confidence)

.min().orElse(0);

List<Information> information = result

.stream()

.map(InformationWithQuality::information)

.toList();

exec.setVariable("

,→ FULLY_AUTOMATED_PROCESSING",

,→ overallConfidence >= 0.95);

exec.setVariable("RESULT", information);

}

Code Listing 5: Extraction Service Task.

7 CONCLUSION & FUTURE

WORK

With FUSION, we provide an extension of ARTI-

FACT as an architectural pattern that enables the de-

tection of document characteristics, called features, to

generate document-speciﬁc IE pipelines.

ARTIFACT treats each input document the same

and provides the automated generation of global-best

pipelines. It also integrates a task for manual ap-

provals, which is superﬂuous if an IE pipeline per-

forms extraordinarily well for a speciﬁc document.

Therefore, it does not enable FAP of documents.

To achieve document-speciﬁc processing and

FAP, we introduced the concepts of feature match-

ing (cf. Subsection 5.2), feature detection (cf. Sub-

section 5.3) and feature-based document processing

(cf. Figure 13) to generate and use feature-based

pipelines.

The experimental evaluation has shown that these

concepts increase the pipeline results in comparison

to the core concepts of ARTIFACT (cf. Table 4).

Hence, the introduced approach supports the auto-

mated generation of IE without adding or improv-

ing existing pipeline components and, therefore, gains

business value.

The evaluation also has shown that a simple rule-

based feature matching strategy is suitable in the case

of the introduced domain and relevant features. Fur-

thermore, the introduced concepts empower the im-

plemented system to ﬁnd document groups for which

speciﬁc IE pipelines perform extraordinarily well.

Due to the automated feature detection of unknown

input documents, the system automatically identiﬁes

the potential for FAP. In the case of the application

in the real-world project, we identiﬁed a potential of

14% for FAP. Therefore, the concept of result routing

(cf. Subsection 5.5) reduces the costs for superﬂuous

manual approval tasks and gains business value.

FUSION: Feature-based Processing of Heterogeneous Documents for Automated Information Extraction

259

We would like to apply different machine-

learning-based algorithms for feature matching to ﬁnd

more suitable document classiﬁcations in future work.

Also, we would like to use these algorithms to deter-

mine which features may have more or less impact on

the optimal way of processing. In parallel, we would

like to improve the quality of the detectors that cur-

rently do not reach the required quality criteria to con-

sider the corresponding features.

REFERENCES

Aguirre, S. and Rodriguez, A. (2017). Automation of

a business process using robotic process automation

(RPA): A case study. In Communications in Computer

and Information Science, Communications in com-

puter and information science, pages 65–71. Springer

International Publishing, Cham.

Albrecht, A. and Naumann, F. (2008). Managing ETL pro-

cesses. In NTII.

Beck, K. (2003). Extreme Programming - die revolu-

tionäre Methode für Softwareentwicklung in kleinen

Teams ; [das Manifest]. Pearson Deutschland GmbH,

München.

Borko, H. and Bernick, M. (1963). Automatic document

classiﬁcation. J. ACM, 10(2):151–162.

Cardie, C. (1997). Empirical Methods in Information Ex-

traction. page 15.

Dumas, M., Van Der Aalst, W. M., and ter Hofstede, A.

H. M. (2005). Process-aware information systems.

John Wiley & Sons, Nashville, TN.

Evans, E. and Evans, E. J. (2004). Domain-driven De-

sign - Tackling Complexity in the Heart of Software.

Addison-Wesley Professional, Boston.

Fowler, M. (2013). An appropriate use of metrics.

Hanson, C. and Sussman, G. J. (2021). Software Design for

Flexibility - How to Avoid Programming Yourself into

a Corner. MIT Press, Cambridge.

Hashmi, K. A., Liwicki, M., Stricker, D., Afzal, M. A.,

Afzal, M. A., and Afzal, M. Z. (2021). Current Sta-

tus and Performance Analysis of Table Recognition in

Document Images with Deep Neural Networks.

Ivan

c, L., Suša Vugec, D., and Bosilj Vukši

c, V. (2019).

Robotic process automation: Systematic literature re-

view. In Business Process Management: Blockchain

and Central and Eastern Europe Forum, Lecture notes

in business information processing, pages 280–295.

Springer International Publishing, Cham.

Jablonski, S. and Bussler, C. (1996). Workﬂow Manage-

ment: Modeling Concepts, Architecture, and Imple-

mentation.

Jiang, S., Pang, G., Wu, M., and Kuang, L. (2012). An

improved k-nearest-neighbor algorithm for text cate-

gorization. Expert Syst. Appl., 39(1):1503–1509.

Lilleberg, J., Zhu, Y., and Zhang, Y. (2015). Support

vector machines and word2vec for text classiﬁcation

with semantic features. In 2015 IEEE 14th Interna-

tional Conference on Cognitive Informatics & Cogni-

tive Computing (ICCI*CC). IEEE.

Liu, Q. and Ng, P. A. (1996). Document classiﬁcation and

information extraction. In Document Processing and

Retrieval, pages 97–145. Springer US, Boston, MA.

Newman, S. (2015). Building Microservices: Designing

Fine-Grained Systems. O’Reilly Media, ﬁrst edition

edition.

Rubin, K. S. (2012). Essential Scrum - A Practical Guide

to the Most Popular Agile Process. Addison-Wesley

Professional, Boston, 01. edition.

Saggion, H., Funk, A., Maynard, D., and Bontcheva, K.

(2007). Ontology-Based information extraction for

business intelligence. In The Semantic Web, Lecture

notes in computer science, pages 843–856. Springer

Berlin Heidelberg, Berlin, Heidelberg.

Schmidts, O., Kraft, B., Siebigteroth, I., and Zündorf, A.

(2019). Schema matching with frequent changes on

semi-structured input ﬁles: A machine learning ap-

proach on biological product data. In Proceedings of

the 21st International Conference on Enterprise Infor-

mation Systems. SCITEPRESS - Science and Technol-

ogy Publications.

Sebastiani, F. (2002). Machine learning in automated text

categorization. ACM Comput. Surv., 34(1):1–47.

Sildatke, M., Karwanni, H., Kraft, B., and Zündorf,

A. (2022). ARTIFACT: Architecture for Auto-

mated Generation of Distributed Information Extrac-

tion Pipelines. In Proceedings of the 24th Interna-

tional Conference on Enterprise Information Systems

- Volume 2, pages 17–28.

Skoutas, D. and Simitsis, A. (2006). Designing ETL pro-

cesses using semantic web technologies. In Proceed-

ings of the 9th ACM international workshop on Data

warehousing and OLAP - DOLAP ’06, New York,

New York, USA. ACM Press.

Tarr, P., Ossher, H., Harrison, W., and Sutton, S. (1999).

N degrees of separation: multi-dimensional separa-

tion of concerns. In Proceedings of the 1999 Inter-

national Conference on Software Engineering (IEEE

Cat. No.99CB37002), pages 107–119.

van der Aalst, W. M. P. (2004). Business process manage-

ment demystiﬁed: A tutorial on models, systems and

standards for workﬂow management. In Lectures on

Concurrency and Petri Nets, Lecture notes in com-

puter science, pages 1–65. Springer Berlin Heidel-

berg, Berlin, Heidelberg.

Wimalasuriya, D. C. and Dou, D. (2010). Ontology-based

information extraction: An introduction and a survey

of current approaches. J. Inf. Sci., 36(3):306–323.

ICSOFT 2022 - 17th International Conference on Software Technologies

260