Goal-based Evaluation of Text Mining Results in an Industrial Use Case

Jens Drawehn, Matthias Blohm, Maximilien Kintz and Monika Kochanowski

Fraunhofer Institute for Industrial Engineering IAO, Nobelstraße 12, 70569 Stuttgart, Germany

monika.kochanowski@iao.fraunhofer.de

Keywords:

Text Mining, Feature Extraction, Artiﬁcial Intelligence, Evaluation.

Abstract:

Artiﬁcial intelligence boosted the interest in text mining solutions in the last few years. Especially in non-

English-speaking countries, where there might not be clear market leaders, a variety of solutions for different

text mining scenarios has become available. Most of them support special use cases and have strengths and

weaknesses in others. In text or page classiﬁcation, standard measures like precision, recall, sensitivity or

F1-score are prevalent. However, evaluation of feature extraction results requires more tailored approaches.

We experienced many issues on the way to benchmarking feature extraction results from text, like whether

a result is correct, partly correct, helpful or useless. The main contribution of this work is a method for

designing a tailored evaluation procedure in an individual text extraction benchmark for one speciﬁc use case.

In this context, we propose a general way of mapping the common CRISP-DM process to particularities of

text mining projects. Furthermore, we describe possible goals of information extraction, the features to be

extracted, suitable evaluation criteria and a corresponding customized scoring system. This is applied in detail

in an industrial use case.

1 INTRODUCTION

How does artiﬁcial intelligence help to deal with

the loads of documents a company receives every

day? Text mining solutions have become available

in great number and with different features (Evel-

son, B. and Sridharan, S. and Perdoni, R., 2019;

Davis et al., 2019; Capterra Inc., 2019; PAT Research,

2019). However, choosing the right solution for one

case is not a trivial task. Especially in non-English-

speaking countries, there might be no clear market

leaders. The solutions differ in set-up effort, strengths

and weaknesses and of course in price. For choosing

and customizing a solution, typical software selection

projects are the right choice. In the case of choosing a

text mining solution, evaluating the results of feature

extraction tasks is crucial.

In text or page classiﬁcation, standard measures

from data science can be applied for giving an

overview of the outcome. For feature extraction tasks

like entity recognition, the question whether a result

is helpful, detrimental, useful or useless is more difﬁ-

cult to answer, depending on the application scenario.

Therefore, it is necessary to deﬁne tailored measures

for evaluating the outcome of feature extraction solu-

tions for text documents in order to choose the right

product. As a basis the standard measures like preci-

sion, recall, sensitivity and F1-score are used and ap-

plied to the goals of a speciﬁc data extraction result.

However, these measures show limits when it comes

to evaluating feature extraction results. The main con-

tribution of this paper is a method for goal-based eval-

uation of feature extraction results from text while

comparing various approaches for information extrac-

tion. We apply the method to an industrial use case,

which we use to evaluate the method.

This paper is structured as follows: Section 2 ex-

plains our general approach for mapping the common

CRISP-DM process (Chapman et al., 2000) to text

mining. Section 3 references related work in the ﬁeld

of feature extraction tasks from unstructured text. In

Section 4 we apply the described method in an indus-

trial use case. Finally, Section 5 gives a conclusion

and outlook.

2 METHODOLOGY

CRISP-DM is a well-established methodology for

data mining. Our text mining approach is structured

accordingly to this model as described in Chapman

et al. (2000). The focus on the evaluation of text min-

ing results is outlined for each step in Figure 1.

Drawehn, J., Blohm, M., Kintz, M. and Kochanowski, M.

Goal-based Evaluation of Text Mining Results in an Industrial Use Case.

DOI: 10.5220/0008973801830191

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 183-191

ISBN: 978-989-758-397-1; ISSN: 2184-4313

183

Figure 1: Contributions of this paper inside CRISP-DM as

in Chapman et al. (2000).

In the ﬁrst phase of Business Understanding a

metagoal is deﬁned: ﬁnding the best text-mining

solution. This contains multiple feature extraction

tasks, which support different scenarios and business

goals. In many text mining projects, incoming docu-

ments are classiﬁed and features are extracted. Subse-

quently, this information is used to support and/or op-

timize the appropriate business process to fulﬁll a cer-

tain business goal. Different kinds of improvements

such as process automation, decision support and as-

sistance functions are possible. These determine the

features to be extracted and the demand on quality of

the extracted features. In this work we structure the

goals and the features accordingly.

Data Understanding comprises inspecting the

documents to be processed with regard to ﬁle for-

mat and resolution, thus ensuring that the documents

can be processed correctly by the text mining solu-

tions. Another issue to be considered is the type of

content that provides orientation on what text mining

approaches are suitable. Depending on the use case,

part of the data understanding is the analysis of possi-

ble feature extraction results, therefore preparing the

deﬁnition of metrics on these results.

In the next phase of Data Preparation, the docu-

ments are prepared to be processed by the text mining

solutions. Typical steps are ﬁle format conversions

and (if this is not done by the text mining solutions

themselves) content preparation tasks such as contrast

optimization and deskewing. Another task is the com-

pilation of a test set of documents that can be used

later to evaluate the extraction results.

The tasks of the Modeling Phase are usually car-

ried out by the providers of the text mining solutions.

In the industrial use case described later on, multiple

providers exist, therefore it is possible to improve the

results by combination of feature extraction results in

an optimization steps after the evaluation phase.

For the Evaluation of feature extraction results,

standard measures like precision, recall and F1-score

exist. While such measures are suitable to evaluate

the efﬁciency of feature extraction methods in gen-

eral, they do not reﬂect the particular requirements

and the objectives of a certain usage scenario. We

introduce custom metrics for the multiple goals as de-

ﬁned in the phase of business understanding. Further-

more, a consolidation of these metrics is considered

for ﬁnding the overall best text mining solution refer-

ring to the metagoal.

As a result of the described course of action, the

achieved knowledge about tools, features, documents

and usage scenarios can be used to iterate through the

phases of CRISP-DM (Chapman et al., 2000) and to

redeﬁne artifacts if necessary to get improved results,

for example by combining text mining solution, ﬁne

tuning single solutions, reﬁne the scenarios etc.

The phase of Deployment is out of scope for this

paper.

3 RELATED WORK

The procedure of our use case is based on the CRISP-

DM methodology described in Chapman et al. (2000).

Other work in the area of text mining that orients itself

on CRISP-DM was done for example by Carnerud

(2014), who described this process with the goal of

topic modeling for large conference proceeding pa-

pers. During the phase of data preparation, which

may include processing of large scanned document

ﬁles with many pages, automated document seg-

mentation mechanisms like those of Wiedemann and

Heyer (2017) may provide helpful preliminary work

in order to facilitate the following data extraction

tasks.

Nadeau and Sekine (2007) provide an overview of

different standard evaluation metrics for named en-

tity recognition (NER) tasks: The CONLL scoring

protocol (Tjong Kim Sang and De Meulder, 2003)

only counts exact matches, while the advanced ACE

(Doddington et al., 2004) evaluation also considers

more complex cases like partial matches or wrong

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

184

Figure 2: Information ﬂow in our industrial use case with four companies and three provider solutions.

type classiﬁcation of found entities. In MUC (Mes-

sage Understanding Conference) events (Grishman

and Sundheim, 1996), which generally target infor-

mation extraction tasks, a system’s performance is

evaluated based on two axes: whether the system is

able to ﬁnd the correct text match of an entity and

whether it is able to assign the correct type to the

found entity. For time-based expressions like date of

incidence in our scenarios, the TIMEX2 (Ferro et al.

(2005)) standard may be useful.

These metrics or other variations of standard pre-

cision and recall, as introduced by Manning et al.

(2008), may serve as suitable performance indicators

for many cases. A weighted F-measure, as it was in-

troduced by Chinchor (1992), allows to regulate im-

pacts of precision and recall results on the computed

F-score depending on the corresponding goals and

scenarios.

More work that deﬁnes evaluation metrics cus-

tomized for text mining was done by Suominen et al.

(2009), that deﬁne for example an extended metric for

precision and recall that is able to express partly cor-

rect values as well. Furthermore, the work of Esuli

and Sebastiani (2010) shows one way to properly as-

sess the overlaps between predicted and true result en-

tities. Similar to our case, Jiang et al. (2016) differen-

tiate between correct full matches and partly correct

matches that may not have the exact expected entity

boundaries but still provide useful information. They

also show how multiple NER systems can be com-

bined to form ensembles.

As shown by former works like that of Onan et al.

(2016), combining the outputs of different tools may

indeed increase the quality of text mining results. Es-

pecially in the ﬁeld of information extraction for med-

ical documents there has been a lot of recent research,

such as the work of Singhal et al. (2016), that investi-

gates how multiple modern machine learning models

can work together efﬁciently in order to form power-

ful ensembles. Furthermore, approaches like that of

Hamon and Grabar (2013), who used both rule-based

and machine learning techniques at the same time for

extracting ingredient names in recipe documents, al-

ready show the great potential of such hybrid tech-

niques. However, evaluation of (combined) feature

extraction results is not trivial.

To the best of our knowledge, there exist no stan-

dard evaluation measures usable for all kinds of data

extraction tasks in general. Instead, the deﬁnition of

correct and incorrect results strongly depends on spe-

ciﬁc feature declaration and the respective problem

context and therefore requires customized evaluation

plans like the method proposed in this work.

4 APPLICATION IN AN

INDUSTRIAL USE CASE

4.1 Description of Use Case

The industrial use case deals with reports describing

incidents and their implications and is a joint project

with four participating insurance companies. Its main

objective or metagoal as described in Section 2 is to

evaluate text mining solutions. Figure 2 shows the

information ﬂow in this project, beginning with input

documents from all participants among the companies

and ending in a tailored evaluation task for all three

solutions.

The reports originate from outside the compa-

nies, mainly in paper form, which are scanned and

standard OCR (Optical Character Recognition) tech-

niques such as Tesseract OCR (Google, 2019) are ap-

Goal-based Evaluation of Text Mining Results in an Industrial Use Case

185

Table 1: Importance of Features for Usage Scenarios.

Feature Scenario 1 (Black Box Processing) Scenario 2 (Provide Document Index)

author name values should be correct incorrect values acceptable

date of incident values must be correct values should be correct

cause of incident incorrect values acceptable incorrect values acceptable

all features missing values acceptable missing values acceptable

plied. We used three text mining solutions imple-

menting different approaches to extract features that

are necessary for further processing. Tool 1 and 2

are rule based and use rule languages such as UIMA

RUTA (Apache Software Foundation, 2019) to extract

feature values. The main effort is the creation of these

rules. Tool 3 applies a machine learning approach.

The main effort is annotating the correct feature val-

ues for numerous documents.

While the reports contain numerous features, we

describe our approach here by means of the three fea-

tures author name, date of incident and cause of in-

cident. Even though the document class and the fea-

tures to be extracted are consistent, the usage scenar-

ios of the four companies are different. While not ex-

ploring the usage scenarios in detail, we identiﬁed the

intended purpose and evaluated the results based on

the company-speciﬁc goals. Based on the results of

the individual tools, we created additional results by

combining result values of different tools.

4.2 Business Understanding

We identiﬁed the objectives of each participating

company and the speciﬁc demands on quality of the

extracted data. We found that we can sum up the ob-

jectives of all companies by deﬁning two usage sce-

narios: 1 black box processing and 2 provide docu-

ment index. In both scenarios, the goal is to extract

the three features mentioned in the last section. The

result of the multiple scenarios and business goals is

shown in an overview in Table 1 and described in the

following.

The objective of scenario 1 is to use the extracted

features for subsequent black box processing. To

avoid errors in the follow-up process, features with an

effect on control ﬂow (such as date of incident) should

have little or no incorrect results, whereas missing

values are less critical. In other words, it is better

to accept a smaller percentage of documents with ex-

tracted feature values, as long as the values are cor-

rect. For other features that have no controlling ef-

fect (such as author name) or that serve as additional

information (such as the detailed cause of incident),

it could be better to have more extracted values by

accepting a certain rate of incorrect values. Missing

values are acceptable for all three features in this sce-

nario, since it is easy to identify the respective docu-

ments and add the missing values manually if needed.

In scenario 2 provide document index, the ex-

tracted features enable users to access the document

content quickly and easily. Since we assume the users

to have the necessary skills to recognize incorrect val-

ues and to deal with them, incorrect values are accept-

able. As in scenario 1, missing values are acceptable

here as well.

4.3 Data Understanding

In both scenarios the documents were available as

multi page documents in the TIFF format with a res-

olution of 200 dpi. Each document contains reports

plus additional pages such as cover letters and invoice

documents. All documents have content of mixed

type (running text, tables, pictures and form sections)

and the layouts are heterogeneous.

As described above, the features to be extracted

from the reports are author name, date of incident and

cause of incident. Author name is of data type string

and we expect the author name to be present in all

reports. Date of incident is of type date. The detailed

cause of incident is given in text form, but there may

be several causes for one single incident, thus the data

type of this feature is a set of strings. Date of incident

and cause of incident should be present in most (but

not in all) reports. It is obvious that multiple metrics

need to be deﬁned for comparing the quality of the

feature extraction results of different tools, as partly

correct results etc. need to be considered.

4.4 Data Preparation

Because the report pages are scanned mostly in high

quality, only minor problems related to OCR occur.

Due to the amount of information, manual separa-

tion and page classiﬁcation is very time consuming

and not possible in a real life setting. As a conse-

quence, the text mining solutions process the report

pages in the same way as the additional pages, which

may cause incorrect extraction results. Another chal-

lenge is mixed content, making it more difﬁcult for

text mining solutions to deliver correct extraction re-

sults in special cases like feature values occurring in

a table.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

186

For evaluation purposes, a test set of 369 reports is

used. The correct values of all three features for these

documents are extracted manually. Author name is

present in 92% of the documents, date of incident in

90 % and cause of incident in 95 %. The average page

count of the test documents is 15.

4.5 Modeling

The modeling activities are carried out by the

providers of the text mining solutions and are based

on a small set of specially created example docu-

ments. Within these documents, the correct locations

of the features to be extracted are marked.

Details concerning the used tools, algorithms and

models are omitted here since they are not crucial for

our evaluation-focused approach.

4.6 Evaluation

Using the test set (see Section 4.4), it is easy to an-

swer the question if an extracted value is correct or

not. However, in many cases the extracted value is

not correct in the proper meaning of the word, but

nonetheless useful with respect to the goals of the us-

age scenario (see Table 1). Hence, a rating is needed

for useful results, taking into consideration the usage

goals. Incorrect but useful values may occur for sev-

eral reasons. One single document may contain dif-

ferent values for one feature, or character errors may

result from OCR.

To evaluate the results of the text mining solutions

in general, the prevalent measures precision, recall

and F1-score (Manning et al., 2008) are applied to

each individual feature. Precision is the percentage

of retrieved documents that are relevant. Here this

is the percentage of documents where a tool delivers

correct feature values relative to all documents where

the tool delivers any value. Recall is the percentage of

relevant documents that are retrieved. Here, this is the

percentage of documents where a tool delivers correct

feature values relative to all documents that contain

any value. F1-Score is the harmonic mean of preci-

sion and recall. It is often used as a standard measure

for evaluation.

For textual features, it is convenient to calculate

similarity between the correct values and the extracted

values using the Levenshtein distance. In cases where

a reference data set is available, the similarity mea-

sure is used to map the extracted values to the correct

values. For the feature author name, reference data

should be available. Therefore, extraction values for

author names that have a Levenshtein distance less or

equal to one to the correct value are considered as use-

ful. For the same reason, correct last names are useful.

Table 2 shows this rating.

For features of type date, ambiguities and vague-

ness are a challenge. One document may contain sev-

eral date values, as well as the extraction result for one

document may consist of several date values. Further-

more, a period of time (such as “beginning of Decem-

ber 2017”) may be given in a document and likewise

the extraction result may be a period. Table 3 shows

how extraction results are rated.

The last feature cause of incident is more vague

than the other features. It may appear at several

positions in continuous text and we expect no ex-

act matches for this feature. If causes are described

shortly using few keywords, it might be possible

to use rating functions (as for the other features).

For complex descriptions, a manual rating approach

seems reasonable. This feature and its rating is shown

in Table 4.

Tables 2, 3 and 4 show the usefulness of extrac-

tion values for both scenarios. Coming back to using

an evaluation based on standard measures, the extrac-

tion result can be True Positive (T P), False Positive

(FP), False Negative (FN) or True Negative (T N). In

contrast to classiﬁcation problems, we have to distin-

guish two different types of FPs here. If a document

contains no feature value and a tool delivers a value

as extraction result, we denoted this as False Positive

Invented (FP

). The second type (denoted as False

Positive Missed, FP

) occurs when a document con-

tains a feature value and a tool delivers another value

that we rate as not useful. Considering the standard

deﬁnitions of precision and recall, we compute preci-

sion and recall as follows, using T P for “number of

T Ps” etc. in the formulas:

precision =

T P

T P + FP

+ FP

recall =

T P

T P + FN + FP

Note that the formula of recall differs from the stan-

dard formula since each FP

indicates a document

containing a value that should have been found (and

was not). According to our deﬁnition given earlier

in this section the number of documents that contain

any value for this feature is T P + FN + FP

. On the

other hand, the formula of precision is equivalent to

the standard one since FP = FP

+ FP

F1-score is then as usual:

F1 =

2 ∗ precision ∗ recall

precision + recall

Goal-based Evaluation of Text Mining Results in an Industrial Use Case

187

Table 2: Rating of extraction results for author name.

Situation in document Extraction result Rating for scenario 1 Rating for scenario 2

full name full name useful useful

full name last name useful useful

more than one name at least one useful name useful useful

one name more than one name, at least one useful useful useful

full name ﬁrst name not useful not useful

Table 3: Rating of extraction results for date of incident.

Situation in document Extraction result Rating for scenario 1 Rating for scenario 2

one date correct date useful useful

one date incomplete date (year missing) not useful not useful

one date period with length one month or less, including correct date not useful useful

period date within period useful useful

one date more than one date, at least one correct not useful useful

more than one date at least one correct date not useful useful

Table 4: Rating of extraction results for cause of incident.

Situation in document Extraction result Rating for scenario 1 Rating for scenario 2

short description of cause (few keywords) all relevant keywords, nothing else useful useful

short description of cause (few keywords) all relevant keywords plus more irrelevant text useful useful

complex description complex description, matching partially manual rating required manual rating required

several different causes at least one correct cause useful useful

After rating all extraction results as T P, T N, FN, FP

and FP

, we calculate precision, recall and F1 for all

tools and all features. The next step is to identify the

best results with regard to the objectives of the two

scenarios. While highest F1 is considered the best re-

sult in general, in this setting we use a tailored result.

Using Table 1 the best result is deﬁned as follows:

• Incorrect Values Acceptable:

max(F1

, F1

, ...F1

) with F1

being the F1

score of tool i

• Values should be Correct:

max(F1

, F1

, ...F1

) with F1

being the F1

score of tool i and precision

≥ 0.9 at the same

time

• Values must be Correct:

max(precision

, precision

, ... precision

) with

precision

being the precision score of tool i

where precision

≥ 0.9 and recall

≥ 0.35 at the

same time

These thresholds are deﬁned analytically for the spe-

ciﬁc industrial use case and its goals. The intro-

duction and usage of a more ﬂexible scoring sys-

tem, such as a weighted F-measure (Chinchor, 1992),

that allows to control the impact of detected FPs and

FNs for different scenarios, remains part of our fu-

ture work, which includes ﬁnding optimal weights

and thresholds.

4.7 Optimization

Table 5 shows the tools with the best results for each

feature and the achieved scores for precision, recall

and F1. Note that the combination of tools is already

part of the iteration phase in CRISP-DM. For exam-

ple, the tool combination C

1,2

combines the extrac-

tion results for date of incident of tools 1 and 2, using

only the coincident values of both tools. C

1,3

uses the

non-empty result values of tool 1 and, when tool 1

provides no value, the result values of tool 3.

As mentioned before we deﬁned different values

to be useful for scenario 1 and 2 for feature date of

incident (see table 3), resulting in different values for

precision, recall and F1 for the same extraction re-

sults.

In comparison, Table 6 shows the results we

would have received with a standard correctness-

based evaluation that does not consider usefulness of

information but only whether the extracted features

match the expected values or not. As to expect, qual-

ity of all tools decreases for this scenario. Addi-

tionally, for this case tool 3 provides the overall best

score, while for scenario 2, tool 1 was the clear win-

ner with respect to usefulness. In this case, the result

in Table 6 is misleading, showing the importance of

using appropriate evaluation mechanisms that are tai-

lored to the speciﬁc use case scenario and its goals in

order to receive meaningful results (see Table 5).

4.8 Learnings

Finally we used the gained knowledge to iterate

through the phases of CRISP-DM.

In the ﬁrst phase of Business Understanding, it

may be beneﬁcial to redeﬁne the scenarios, features

and their usage goals. In the presented use case the

scenarios were predetermined and were not changed

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

188

Table 5: Best tools and their evaluation results - goal: usefulness (scenario 1 and 2).

Feature Scenario (Goal) Importance of feature Tool Precision Recall F1

author name 1 values should be correct 3 0.925 0.380 0.539

author name 2 incorrect values acceptable 1 0.725 0.604 0.659

date of incident 1 values must be correct C

1,2

0.987 0.450 0.619

date of incident 2 values should be correct C

1,3

0.949 0.784 0.859

cause of incident 1 and 2 incorrect values acceptable 1 0.820 0.720 0.767

Table 6: Best tools and their evaluation results - goal: correctness.

Feature Scenario (Goal) Tool Precision Recall F1

author name correctness 3 0.817 0.336 0.476

date of incident correctness C

1,3

0.910 0.753 0.824

cause of incident correctness 3 0.642 0.280 0.390

during the project. On the other hand, the gained

knowledge about the extraction results resulted in dif-

ferent usage goals. For instance, the objective of

black box processing requires high precision values

for features with an effect on the control ﬂow. Thus,

the obtained precision values may trigger a change of

the usage goals, taking (or not) into account black

box processing. Such decisions, in turn, may result

in different feature deﬁnitions that are more suitable

to achieve the new goals.

With regard to Data Understanding, new knowl-

edge about the documents and their content can be

used to reconsider the feature deﬁnitions. Overall, we

observed the following peculiarities:

• unexpected missing values for feature author

name (assumed to be present in all documents)

• unexpected multiple values for feature author

name and date of incident

• incomplete or vague values (such as only last

name of author name or period for date of inci-

dent)

• reports with attached prior reports referring to the

same incident

To better suit the exceptional cases, a change of fea-

ture deﬁnitions is necessary. For example, we could

change the feature type of date of incident from (sin-

gle) date to period (start and end date). Such changes

would considerably increase the complexity of the ex-

traction task.

With regard to the Modeling Phase, it is possible

to improve the efﬁciency of extraction of the tools.

In case of Machine Learning approaches, this could

mean to provide more training data or to adapt and

change the machine learning algorithm. If the extrac-

tion approach is rule-based, one can try to detect sys-

tematic errors and to adapt the rule set.

When iterating trough the Evaluation Phase, the

rating functions and evaluation criteria should be re-

viewed and adapted if necessary. A reusable library

with evaluation functionality covering the following

subjects is part of future work:

• automatic matching/rating of time periods for

usual kinds of time periods to support further au-

tomatic rating of date features (rating overlapping

time periods as useful)

• rate date values close to the correct value as useful

(with maximum distance as parameter)

• matching of names using a reference database (if

available), resulting in exact values (full names)

that can be rated automatically

• development of rating functions for vague text

features using keyword-based approaches and

considering length of text

5 CONCLUSION AND OUTLOOK

In this work we have described in detail the evaluation

additions to the standard CRISP-DM process for data

mining applied for feature extraction results in a text

mining project. We have described the additional arti-

facts in each phase, spanning multiple business goals

and scenarios, and described how metrics are used for

evaluating not only correct, but also useful feature ex-

traction results. We have brieﬂy described possible

optimizations to the process and given an outlook on

future work.

We have successfully applied the methodology to

our real-world industrial use case, which consists of

three commercial software solutions and real-world

documents from four companies. It showed that

a methodical approach for deﬁning scenarios, goals

and criteria is necessary for project success. Sev-

eral iterations were necessary to achieve a common

understanding of the goals and challenges among

customers, implementers and solution providers. It

showed that distinguishing between scenarios, goals

Goal-based Evaluation of Text Mining Results in an Industrial Use Case

189

and criteria is crucial for successful communication

in the project.

The ﬁrst deﬁnition of usage goals was based on in-

complete information about the characteristics of doc-

uments and features. The discussion of intermediate

results with the participants led to the development

of suitable feature deﬁnitions and usage goals. Based

on this we could reﬁne the evaluation criteria and the

rating functions for extracted values. The customers

gave the feedback that the approach is suitable and

helpful in the described setting.

In the end, we achieved a common understanding

of the inﬂuencing factors of feature extraction scenar-

ios for text, how these factors affect the evaluation

results and what solutions (or what kind of solutions)

are suitable to extract what kind of features.

In the following, we give an outlook on which im-

provements can be made especially during the evalu-

ation phase of text mining projects.

First of all, introducing a form of weighted F1

score could prove as useful. Depending on the ap-

plication scenario it might be more important to ﬁnd

less, but reliable results. This may take into account

a higher rate of FNs (like the feature date of incident

in our scenario 1). In contrast, for use cases similar to

scenario 2, it might be wiser to rather allow a higher

rate of FPs in the document than losing some impor-

tant information.

Although not concerning the three features se-

lected from our industrial use case in this work, hav-

ing to deal with highly imbalanced occurrences of

features within the document sets causes well-known

challenges. For example, features with a small set

of target classes (like yes/no), predicting always ’yes’

may achieve a high evaluation score when the target

values occur very uneven. Here using Cohen’s kappa

(Cohen, 1960) instead of F1 score seems to be a good

option and will be investigated in the future.

REFERENCES

Apache Software Foundation (2019). Uima ruta.

https://uima.apache.org/ruta.html.

Capterra Inc. (2019). Text mining software.

https://www.capterra.com.de/directory/30933/text-

mining/software.

Carnerud, D. (2014). Exploration of text mining methodol-

ogy through investigation of qmod-icqss proceedings.

In 17th QMOD-ICQSS Conference, Prague, Czech

Republic.

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz,

T., Shearer, C., and Wirth, R. (2000). Crisp-dm 1.0

step-by-step data mining guide. Technical report, The

CRISP-DM consortium.

Chinchor, N. (1992). Muc-4 evaluation metrics. In Pro-

ceedings of the 4th Conference on Message Under-

standing, MUC4 ’92, page 22–29, USA. Association

for Computational Linguistics.

Cohen, J. (1960). A coefﬁcient of agreement for nominal

scales. Educational and Psychological Measurement,

20(1):37–46.

Davis, M., Emmott, S., Brethenoux, E., and Vashisth,

S. (2019). Market guide for text analytics.

https://www.gartner.com/en/documents/3892564.

Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L.,

Strassel, S., and Weischedel, R. (2004). The automatic

content extraction (ace) program-tasks, data, and eval-

uation. Proceedings of LREC, 2.

Esuli, A. and Sebastiani, F. (2010). Evaluating informa-

tion extraction. In Agosti, M., Ferro, N., Peters,

C., de Rijke, M., and Smeaton, A., editors, Multilin-

gual and Multimodal Information Access Evaluation,

pages 100–111, Berlin, Heidelberg. Springer Berlin

Heidelberg.

Evelson, B. and Sridharan, S. and Perdoni,

R. (2019). The Forrester Wave

: AI-

Based Text Analytics Platforms, Q2 2018.

https://www.forrester.com/report/The+Forrester+

Wave+AIBased+Text+Analytics+Platforms+Q2+

2018/-/E-RES141340.

Ferro, L., Gerber, L., Mani, I., Sundheim, B., and Wilson,

G. (2005). Tides 2005 standard for the annotation of

temporal expressions. Technical report, MITRE.

Google (2019). Tessseract ocr.

https://opensource.google/projects/tesseract.

Grishman, R. and Sundheim, B. (1996). Message un-

derstanding conference-6: A brief history. In Pro-

ceedings of the 16th Conference on Computational

Linguistics - Volume 1, COLING ’96, pages 466–

471, Stroudsburg, PA, USA. Association for Compu-

tational Linguistics.

Hamon, T. and Grabar, N. (2013). Extraction of ingredient

names from recipes by combining linguistic annota-

tions and crf selection. In Proceedings of the 5th Inter-

national Workshop on Multimedia for Cooking &

Eating Activities, CEA ’13, pages 63–68, New York,

NY, USA. ACM.

Jiang, R., Banchs, R. E., and Li, H. (2016). Evaluating and

combining name entity recognition systems. In Pro-

ceedings of the Sixth Named Entity Workshop, pages

21–27, Berlin, Germany. Association for Computa-

tional Linguistics.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press, New York, NY, USA.

Nadeau, D. and Sekine, S. (2007). A survey of named entity

recognition and classiﬁcation. Lingvisticae Investiga-

tiones, 30.

Onan, A., Koruko

glu, S., and Bulut, H. (2016). Ensem-

ble of keyword extraction methods and classiﬁers in

text classiﬁcation. Expert Systems with Applications,

57:232 – 247.

PAT Research (2019). Top software for

text analysis, text mining, text analytics.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

190

https://www.predictiveanalyticstoday.com/top-

software-for-text-analysis-text-mining-text-

analytics/.

Singhal, A., Simmons, M., and Lu, Z. (2016). Text

mining for precision medicine: automating disease-

mutation relationship extraction from biomedical lit-

erature. Journal of the American Medical Informatics

Association, 23(4):766–772.

Suominen, H. et al. (2009). Performance Evaluation Mea-

sures for Text Mining. IGI Global, Hershey, PA, USA.

Tjong Kim Sang, E. F. and De Meulder, F. (2003). Intro-

duction to the CoNLL-2003 shared task: Language-

independent named entity recognition. In Proceed-

ings of the Seventh Conference on Natural Language

Learning at HLT-NAACL 2003, pages 142–147.

Wiedemann, G. and Heyer, G. (2017). Page stream seg-

mentation with convolutional neural nets combining

textual and visual features. CoRR, abs/1710.03006.

Goal-based Evaluation of Text Mining Results in an Industrial Use Case

191