Event Log Abstraction in Client-Server Applications

M. Amin Yazdi

1 a

, Pejman Farhadi Ghalatia

2 b

and Benedikt Heinrichs

1 c

IT Center, RWTH Aachen University, Aachen, Germany

Aachen Institute for Advanced Study in Computational Engineering Science, RWTH Aachen University, Aachen, Germany

Keywords:

Business Process Intelligence, Event Log Abstraction, Process Discovery, Data Pre-processing.

Abstract:

Process mining provides various techniques in response to the increasing demand for understanding the exe-

cution of the underlying processes of software systems. The discovery and conformance checking techniques

allow for the analysis of event data and verify compliances. However, in real-life scenarios, the event data

recorded by software systems often contain numerous activities resulting in unstructured process models that

are not usable by domain experts. Hence, event log abstraction is an essential preprocessing step to de-

liver a desired abstracted model that is human-readable and enables process analysis. This paper provides an

overview of the literature and proposes a novel approach for transforming ﬁne-granular event logs generated

from client-server applications to a higher level of abstraction suitable for domain experts for further analysis.

Moreover, we demonstrate the validity of the suggested method with the help of two case studies.

1 INTRODUCTION

Newly born software solutions emerge every day to

fulﬁll the demand for digitalization. The continu-

ous development of new applications is leading us

toward the establishment of software systems with

complex internal processes. For instance, univer-

sities provide distributed software solutions to sup-

port research data management and enable reusability,

ﬁndability, and accessibility of research data (Yazdi

et al., 2016; Politze et al., 2020). It results in a

heterogeneous and distributed IT infrastructure with

a complex and unstructured landscape of user pro-

cesses. To tackle this issue, data modeling and ex-

ploring researchers’ processes would help domain ex-

perts understand and monitor user behaviors and dis-

cover the underlying operational processes. Such a

process-oriented modeling allows for the discovery of

the challenges researchers face during their research’s

life-cycle (Valdez et al., 2015).

To analyze the dynamic behavior of users and

monitor the status of operational processes, we use

process mining techniques which enables us to dis-

cover process models (Yazdi, 2019). Two main pro-

cess mining tasks are process discovery and confor-

mance checking (van der Aalst, 2016). Process dis-

https://orcid.org/0000-0002-0628-4644

https://orcid.org/0000-0002-7627-533X

https://orcid.org/0000-0003-3309-5985

covery aims to discover a process model that repre-

sents the real descriptive behavior, based on real-life

events. There are various process discovery algo-

rithms including but not limited to Inductive Miner

(IM), Alpha miner, Fuzzy miner, or Heuristic miner.

Conformance checking compares a process model

with an event log to ﬁnd commonalities and differ-

ences. It enables us to measure the ﬁtness of the dis-

covered model (Van der Aalst et al., 2012). Various

software products such as ProM (Van Dongen et al.,

2005), Disco (Günther and Rozinat, 2012) , Rapid-

miner (Mans et al., 2014) are available to assist us

with process analysis tasks.

Often, software systems work in a complex in-

frastructure where a user request may require many

software components and microservices to help for

executing a task. This complex infrastructure of-

ten results in a n:m relationship between server-side

events and client-side events that are too ﬁne-grained.

Applying process discovery to these low-level event

logs often results in obtaining an unstructured process

model called “Spaghetti”, that is unsuitable for pro-

cess analysis (van der Aalst, 2016). Hence, various

event log abstraction techniques have been developed

to transform ﬁne granular (low-level) event logs to a

higher level of abstraction (high-level). However, the

main challenge is to ﬁnd a desired abstraction level

that is interpretable by domain experts and still de-

scribes the reality of the business activities.

Yazdi, M., Farhadi Ghalatia, P. and Heinrichs, B.

Event Log Abstraction in Client-Server Applications.

DOI: 10.5220/0010652000003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 1: KDIR, pages 27-36

ISBN: 978-989-758-533-3; ISSN: 2184-3228

Table 1: The work of literature and classiﬁcation based on abstraction techniques.

Author Grouping Input Data Event Perspective

Mapping

Relationship

Internal Abstraction Validity Quality Indicator Target Domain

Su. Un. Co. Di. Pr. De.

(Begicheva and Lomazova, 2017) X X Sequential n:1 X Formal Fitness Acyclic Cases

(Mannhardt et al., 2016) X X Sequential n:1 X Real-Life Fitness & Matching Error Information Systems

(Tax et al., 2016) X X Sequential n:m X Synthetic Levenshtein Distance Parallel Activities

(Liu et al., 2018) X X Sequential n:1 X Synthetic Fitness & Precision & Generalization Information Systems

(de Leoni and Dündar, 2020) X X X Non-Sequential n:1 X Real-Life Fitness Information Systems

(Mannhardt and Tax, 2017) X X Sequential n:1 X Real-Life Fitness Automatic Pattern Discovery

Our Approach X X X Non-Sequential n:m X Real-Life Fitness, PCC Client-Server Applications

In order to deal with the challenge of abstracting

low-level events, many methods are proposed (and

discussed in Section 2). Some of the existing meth-

ods rely on unsupervised learning techniques to clus-

ter groups of events into a higher-level of events. On

one hand, current unsupervised learning methods are

not able to acquire a suitable degree of abstraction and

on the other hand, it leads us to an unclear relabeling

strategy for the cluster of abstracted events. These

learning methods either require manual labeling by

domain experts or automatic concatenation of labels

that create unreadable and long activity names. Fur-

thermore, there are several supervised learning meth-

ods to overcome the mentioned challenges. However,

these methods either utilize training sets to guide the

abstraction process or rely on domain experts’ inputs

for the task of relabeling the activity names. Hence

relabeling abstracted activity names is a non-trivial

task. The challenges mentioned earlier describe the

necessity for an approach to abstract low-level event

logs to a suitable higher-level of granularity.

In this paper, we describe an iterative supervised

abstraction technique for client-server applications.

Whereas client-side event logs are used as the train-

ing set to describe the server-side event logs, and the

Pearson Coefﬁcient Correlation (PCC) is employed

to measure event similarity for the task of abstracting

activity names. The method described in this study is

built on top of the suggested approach in (Yazdi and

Politze, 2020) for acquiring the server-side event logs.

Accordingly, we are relying on the OAuth2 workﬂow

for recording the server-side event logs. The client-

side event log plays the role of descriptive user activ-

ities that are understandable by domain experts and

enable us to map low-level server-side event logs to

high-level activities.

Overall, in this paper, we try to answer the follow-

ing research questions in client-server applications:

1. How can we abstract event logs so that they are

reliable and descriptive for the domain experts?

2. How to balance the abstraction granularity with

respect to process models’ ﬁtness?

As suggested in the literature (Buijs et al., 2012), we

use the model ﬁtness as an indicator of the suitabil-

ity of the discovered model. Furthermore, with the

help of two case studies, we evaluate the validity and

scalability of our approach.

The rest of this paper is as follows. Section 2 in-

troduces the related work and studies related to event

log abstraction techniques. In Section 3, we provide

the preliminaries and formal models used throughout

the paper. Section 4 describes the approach used to

enable the event log abstraction for client-server ap-

plications. We examine our approach with the help

of two case studies in Section 5. In Section 6, we

address the challenges and possible future work. Fi-

nally, in Section 7, we discuss the beneﬁts and the

contributions of our work.

2 RELATED WORK

There has been a recent research trend on event log

abstraction methods and their challenges (van Zelst

et al., 2020). This section categorizes different crite-

ria that we ﬁnd essential for every abstraction method

and describes the work of literature accordingly. Ad-

ditionally, we position our work concerning each cri-

terion. Table 1 provides an overall birds-eye-view of

related works and how the suggested approach differs

from other existing solutions.

2.1 Log Abstraction Strategies

In this part, we elaborate on several event log abstrac-

tion criteria, their beneﬁts and drawbacks. Further-

more, we also position our work accordingly.

2.1.1 Grouping Strategy

The grouping strategy refers to the learning tech-

niques used for converting low-level event logs to a

higher-level of granularity. There are two main cat-

egories of abstraction operations: aggregation and

elimination (Smirnov et al., 2012). The event elim-

ination operation leads to a naive approach by omit-

ting insigniﬁcant events. However, this way results

in models that are underﬁtting and misses informa-

tion about eliminated events. The aggregation opera-

tion generates a signiﬁcant group of events in which

a combination of relatively insigniﬁcant events exists.

There are two sub-domains of the aggregation oper-

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

ation, namely, unsupervised and supervised learning

methods.

Unsupervised abstraction methods rely on com-

pletely automated techniques to group events and of-

ten result in less accurate high-level models. The la-

bels in these methods are regenerated either by con-

catenation of activities or by using assumptions with-

out the involvement of domain knowledge. Authors in

(Mannhardt and Tax, 2017) use Local Process Model

(LPM) discovery to identify frequent activity patterns

in the process model automatically, and authors of

(de Leoni and Dündar, 2020) use clustering methods

to group activities based on the frequency of occur-

rences and use clusters’ centroids for relabeling cor-

responding groups. Unsupervised learning methods

are beneﬁcial if no domain knowledge is available or

acquiring additional information is not feasible.

On the contrary, supervised learning methods are

the most common and practiced technique used for

simplifying process models. These methods rely on

some form of external or additional information to

group and relabel event logs. Authors in (Begicheva

and Lomazova, 2017; Mannhardt et al., 2016; Tax

et al., 2016; Liu et al., 2018) use supervised learn-

ing techniques either by capturing additional informa-

tion or creating a training data set to annotate ﬁne-

granular event logs. As the supervised learning tech-

nique is considered to be a more reliable abstraction

technique for real-life scenarios and allows for ad-

ditional domain experts’ interventions, our approach

also focuses on a supervised learning method.

2.1.2 Input Data

Input Data notes the nature of data in the form of dis-

crete or continuous events. The discrete events are a

set of inﬁnite numbers of values (ex. categorical or

real values), and continuous events refer to as a set

of data that can take any values (ex. activity names).

Event logs can be in the form of discrete sequences

of events or continuous as a result of the execution of

activities. Most techniques covered in the literature

review are using some form of continuous activity

names for the analysis. However, authors in (de Leoni

and Dündar, 2020) vectorize the sequences of contin-

uous event data and transformed the event log into the

discrete data format. The discrete data, facilitates the

execution of data mining algorithms, such as analyz-

ing the frequency of occurrences of activities within

an event log. Likewise, our approach adopts both con-

tinuous event data for discovering process models and

discrete event data for the task of measuring the simi-

larity of two activities.

2.1.3 Event Perspective

Event Perspective points to the effect of the order of

appearing event logs on the abstraction technique.

Most of the literature’s procedures exploit only the

order of the sequence of events as they appear in the

low-level events. Authors in (de Leoni and Dündar,

2020) use the non-sequential order of events with un-

supervised abstraction methods. In the system under

study, the order of execution of events are not se-

quential due to the nature of the distributed systems,

hence, our approach has to incorporate non-sequential

events.

2.1.4 Mapping Relationship

The mapping relationship between instances of low-

level to higher-level activities. There are 1:1, n:1,

n:m mapping relationships. In the case of 1:1 map-

ping, there is no abstraction that can be performed

as the level of granularity does not change, i.e., for

every event, there is only one corresponding activity.

Most of the literature focuses on n:1 mapping or also

known as shared functionality, i.e., for every low-level

event log, there might be multiple higher-level activi-

ties. In software systems, user activities are often not

directly corresponding to a recognizable event on the

server-side, hence resulting in a n:m mapping rela-

tionship. We position our work as n:m mapping rela-

tionship as there are both 1:1 as well as n:1 mappings

between event logs of each level.

2.1.5 Internal Abstraction Modeling

Internal Abstraction Modeling highlights the use of

probabilistic or deterministic factors to identify the

sequence of event distributions. Similar to our sug-

gested approach, most of the literature focuses on

deterministic strategies. However, authors in (Be-

gicheva and Lomazova, 2017) also employ the theory

of regions to classify corresponding high-level activi-

ties, and authors in (Tax et al., 2016) adopt the Condi-

tional Random Fields (CRF) as a probabilistic factor

for the task of event sequence relabeling.

2.1.6 Validity

Validity describes the evaluation procedure used to

demonstrate the soundness of a suggested technique.

Unlike the authors of (Begicheva and Lomazova,

2017) that only use formal mathematical representa-

tion, all other works examine their approach over ei-

ther real-life or synthetic event logs. However, only

the authors of (Mannhardt et al., 2016) evaluate their

technique by implementing their approach in a real

Event Log Abstraction in Client-Server Applications

infrastructure. It is particularly important because de-

spite the existence of real-life event logs, it does not

include real challenges that exist in the ﬁeld, such as

noise, outliers, or missing data in event logs. We val-

idate our method with real-life event logs by directly

collecting and utilizing event logs from an existing

software system to demonstrate the effectiveness of

our approach.

2.1.7 Quality Indicator

These indicators support the quality dimensions of the

discovered abstracted process model. Usually discov-

ered models are evaluated based on metrics such as

ﬁtness, precision and generalization, which gives ad-

ditional insights into the changes by the event log ab-

straction process (van der Aalst, 2016). According

to (Buijs et al., 2012), ﬁtness quantiﬁes the extent to

which the discovered model can accurately reproduce

the cases recorded in the log. Precision quantiﬁes the

fraction of the behavior allowed by the model, which

is not seen in the event log. Generalization assesses

the extent to which the resulting model will be able to

reproduce the future behavior of the process.

With respect to our literature study, only the au-

thors of (Liu et al., 2018) have successfully reported

the quality of their approach with the help of ﬁt-

ness, precision, and generalization. The rest of schol-

ars in the ﬁeld have either reported the model ﬁtness

as the only quality factor or have backed their work

with supplementary measures such as Matching Error

as a threshold for excluding unreliable matches be-

tween ﬁne-granular event logs and high-level activi-

ties (Mannhardt et al., 2016). Nevertheless, authors

in (Tax et al., 2016) solely leverage from the Leven-

shtein similarity distance as a degree to express the

closeness of two traces from different granularity lev-

els. In our study, besides the model ﬁtness, we ad-

ditionally use the Pearson Coefﬁcient Correlation to

compare and identify similar high-level activities.

2.1.8 Target Domain

The application domain that a technique is primar-

ily focused on. Unfortunately, most of the literature

is not generic enough and aims for a particular ap-

plication domain. For instance, the suggested solu-

tion in (Begicheva and Lomazova, 2017) only targets

the non-recursive activities (acyclic cases). The tech-

nique in (Tax et al., 2016) intends for the abstraction

of only parallel activities with any executive order.

The automatic pattern discovery and clustering of ac-

tivities is the focus of the authors in (Mannhardt and

Tax, 2017). Other literature under study appear to fo-

cus on the analysis of information systems such as a

digital whiteboard software (Mannhardt et al., 2016),

user behavior analysis on software systems (Liu et al.,

2018), and analysis of the frequency of a website vis-

its (de Leoni and Dündar, 2020). On the contrary, we

propose a universal approach that can be employed

over client-server applications.

2.2 Summary

Despite the focus on acyclic cases in (Begicheva and

Lomazova, 2017), authors only use a formal repre-

sentation to demonstrate the handling of duplication

in the event logs but have ignored the stuttering sub-

sequences of activities. Moreover, the approaches

suggested by (Mannhardt et al., 2016) and (Liu et al.,

2018) rely on manual labeling of low-level events

to high-level activities through interviews, observa-

tions, and excluding traces with missing or unknown

events. The method proposed in (Tax et al., 2016) is

capable of dealing with parallel activities; However,

it relies on excessive domain knowledge. Methods

such as (de Leoni and Dündar, 2020) and (Mannhardt

and Tax, 2017) take advantage of using unsupervised

learning techniques to relabel the abstract traces auto-

matically. Cluster centroids or label concatenations

generate the new labels for the higher-level activi-

ties, but the technology acceptance for the domain

experts is uninvestigated. Although the authors in

(Mannhardt and Tax, 2017) use Local Process Models

(LPM) and automatic pattern discovery techniques to

automatically detect start and end of activities, it is

limited to a ﬁxed number of candidates LPMs based

on the event frequency ranking.

Overall, almost all the authors have used the In-

ductive Miner (IM) algorithm to discover process

models. The IM algorithm can deal with infrequent

behaviors while ensuring the soundness of the discov-

ered process model (Leemans et al., 2014). Addition-

ally, different variations of Petri Nets notations have

been used to illustrate the discovered model. By refer-

ring to Table 1, it is clear that none of the approaches

is suitable to be utilized in the existing infrastructure

and either relies on internal probabilistic abstraction

techniques or requires extensive domain knowledge

to help with the task of relabeling the abstracting ac-

tivities.

3 PRELIMINARIES

In this section, we provide the basic concepts and for-

mal models used throughout the paper.

Given a set A, σ = ha

, a

, . . . , a

i ∈ A

∗

is the set

of all ﬁnite sequence over the A. σ(i) = a

denotes

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

the i

element of the sequence. β(A) is the set of all

multi-sets over set A. P

(A) is the set of all non-

empty subsets over the set A. |σ| = n is the length of

σ. For σ

, σ

∈ A

∗

, σ

⊆ σ

if σ

is a sub-sequence

of σ

. e.g., hg, b, oi ⊆ hz, g, b, b, g, b, o, qi. For σ ∈ A

∗

{a ∈ σ} is a set of elements in σ. For σ ∈ A

∗

, [a ∈ σ] is

a multi-set of elements in σ, e.g., [a ∈ hg, b, z, g, bi] =

, b

, z].

(Event, Event Log) An event is a n-tuple e =

(t, c, a

, a

) and e ∈ L. Where t ∈ T is the event

timestamp, c ∈ C is the case id, a

∈ A

is the client-

side activity. a

∈ A

is the server-side activity, such

that for each client-side activity, there is a set of

corresponding server-side activities in respond to a

user request. We denote ξ = T × C × A

× A

the universe of all events for the original event log,

s.t. L ⊆ ξ. In the event log, each event can ap-

pear only once. Hence, events are uniquely iden-

tiﬁable by their attributes. Additionally, we denote

= T

start

× T

complete

×C × A

as the universe of ab-

stracted events. The abstracted event log is repre-

sented as L

⊆ ξ

, and each event e

∈ L

is a n-tuple

= (t

start

complete

, c

, a

) where {t

start

complete

} ∈ T

are the start and completion timestamp of an event re-

spectively, representing a full software execution life-

cycle, c

∈ C is the case id, and a

∈ A

is the ab-

stracted activity name.

(Trace) Let x be a set of activities. Then x ∈ e,

= he

, e

, . . . , e

i is deﬁned as a trace in the event

log, s.t., for each e

, e

∈ σ

, 1 ≤ i < j ≤ n : π

) =

) and π

) ≤ π

). Trace σ ∈ L is a se-

quence of activities. Note that there are no special

start and end activities.

(Labeled Petri net) A Petri net is a n-tuple N =

(P, T, F) where P is the set of places, T the set of tran-

sitions, P ∩ T =

0, F ⊆ (P × T ) ∪ (T × P) represents

the ﬂow relation. A labeled Petri net N = (P, T, F, l, A)

extends the Petri nets with a labeling function l ∈ T 9

A which maps a transition to an activity from A. As

an example, if l(t) = a, then a is an observed activity

when transition t is executed.

4 APPROACH

Figure 1 represents the overview of our approach to

abstract event logs in client-server applications, ac-

companied with example event logs’ transformation

throughout the procedure. Note that in the sample

logs of ﬁgure 1, the client activity names are repre-

sented as [activityName]@[webPage] to indicate the

web page that a client activity is occurring. In the

following section, we elaborate on every step for the

recursive log abstraction.

Algorithm 1: Recursive Event Log Abstraction.

Input: Fine-granular event log L

Output: Abstracted event log L

1 set PCC threshold (Θ) to 1;

2 discover Petri net model for L using the

Inductive Miner;

3 clone L to L

as a temporary event log;

4 while ﬁtness ≥ 0.8 do

5 foreach L

6 convert L

to a vectorized weighted

matrix M

×A

;

7 foreach e do

8 calculate the vector similarity(∆)

between every a

;

9 if ∆ ≥ Θ then

10 relabel the activity a

by the

concatenation of the two

similar activity names a

;

11 replace the corresponding

event in L

with

= (t, c, a

, a

);

12 end

13 end

14 end

15 Lower PCC threshold Θ by ϒ

16 end

17 foreach c in L

18 while a

= a

19 capture the ﬁrst (t

start

) and last

complete

) occurring event for the

activity a

;

20 create event e

= (t

start

complete

, c, a),

and append to L

;

21 end

22 end

23 return L

Additionally, algorithm 1 demonstrates a recur-

sive log abstraction process using PCC (Benesty et al.,

2009). The PCC enables us to calculate the similar-

ity of different client-side activities by assessing the

corresponding server-side activity occurrences.

Our approach applies the IM algorithm on top of

the event log for the process model discovery task

(step 1 of ﬁg. 1). We clone the original event log

L to a temporary event log L

to be further abstracted

and reused for the recursive process (Step 2 of ﬁg.

1). Then the abstracted event log is replayed over the

discovered model generated from the original log to

evaluate model ﬁtness (step 3 of ﬁg. 1). During this

process, the ﬁtness of the discovered model is calcu-

lated by: ﬁtness =

True Positive

True Positive+False Negative

. The True

Event Log Abstraction in Client-Server Applications

Timestamp CaseId

Client

Activity

Server

Activity

... ... ... ...

1579108918 8 getUser()

1579108919 8

getInfo()

1579108920

getFile()

1579109897 8

getUser()

... ... ... ...

(upload@manage)

(download@manage)

Timestamp

CaseId

Client

Activity

Server

Activity

fitness

> 0.8

Yes

Timestamp

Start

Timestamp

Complete

CaseId

Activity

... ... ...

1579108918 1579109841 8

1579109897 1579109897 8

... ... ...

(upload@manage, upload@admin)

(download@manage)

PCC Similarity Comparison( )

Re-labeling & Abstraction

Timestamp

CaseId

Client

Activity

Server

Activity

... ... ... ...

1579108918 8 getUser()

1579108919 8

getInfo()

1579108920

getFile()

1579109897 8

getUser()

... ... ... ...

(upload@manage, upload@admin)

(download@manage)

Process Model (Petri net)

Event Log (L )

Layout Change

Abstracted

1579109840 8

1579109841

getFile()

(upload@admin)

... ... ... ...

1579109840 8

1579109841

getFile()

(upload@manage, upload@admin)

... ... ... ...

getUser()

(upload@manage)

(download@manage)

(upload@admin)

GetUser() GetInfo() GetFile()

1 1

0 0

...

... ...

... ... ...

getUser()

Figure 1: The general overview of the approach and sample event logs transformation throughout the abstraction process.

Positive is the number of traces that align with the dis-

covered model, and the False Negative is the number

of traces that do not align with the discovered model

(Jouck et al., 2018). If the discovered model can com-

pletely replay the traces from L

, the model’s ﬁtness is

1. Hence, as the log abstraction increases, the ﬁtness

decreases.

In line 6 of the Algorithm 1, the event log e =

(t, c, a

, a

) ∈ L is transforming to a vectorized de-

scriptive matrix M

, i.e., the ﬁrst column of the

matrix is the group of client-side activities (a

), and

every other column represents a server-side activity

). Each row carries the number of executions of

server-side activities to a corresponding client-side

activity. Note that we increment the number of a

oc-

currences per a

(step 4 of ﬁg. 1). The weighted ma-

trix M enables vector similarity analyses with higher

accuracy.

In the block between lines 7 to 13 of Algorithm

1, we evaluate the similarity between two client-side

activities (step 5 of ﬁg. 1) and relabel logs by concate-

nating the client-side activity names (with a comma)

that falls into the PCC threshold (step 6 of ﬁg. 1).

The PCC evaluates the strength of a linear association

between two vector variables and therefore quantiﬁes

two events’ similarity. As the PCC description is out

of this paper’s scope, we refer the readers to (Benesty

et al., 2009) for the formula and extensive explana-

tion. The result of abstraction and relabeling is then

cycled back into the method (step 7 of ﬁg. 1).

In our proposed approach, we initialize the ab-

straction process with PCC threshold Θ = 1 and grad-

ually reduce the threshold Θ by ϒ. As shown in Figure

1, the abstraction process is continued until a ﬁtness

quality of 0.8 is reached. We propose the ﬁtness of 0.8

as a criterion to stop the iterative abstraction process

in our algorithm as a ﬁtness value below 0.8 cannot

replay event logs by that model completely (van der

Aalst, 2016; Buijs et al., 2012) and thus lose its cru-

cial quality dimension of a process model.

The block between lines 17 to 22 transforms the

event log format and thus, decreases the number of

events with each case’s same activity (step 8 of ﬁg.

1). At this stage, every event would be accompa-

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

Table 2: Results of each abstraction iteration over model ﬁtness, number of activities and total number of events.

(a) Metadata Manager tool- case study 1.

PCC

Similarity Threshold

1 97-1 94-1 91-1 82-1 79-1 52-1 40-1 22-1

Fitness 1 0.91 0.89 0.88 0.88 0.84 0.79 0.77 0.73

# Client

Activities (a

)

21 16 15 14 11 10 8 7 6

# Events 13794 10302 9824 9268 7261 6414 5378 4735 4019

(b) Data Archiving tool - case study 2.

PCC

Similarity Threshold

1 97-1 91-1 88-1 85-1 82-1 55-1 28-1

Fitness 1 0.98 0.88 0.83 0.81 0.76 0.70 0.59

# Client

Activities (a

)

19 18 17 15 14 12 10 4

# Events 10487 9973 9481 8297 7920 6553 5481 2203

nied by the events’ start and completion time, i.e., the

time taken for the server-side components to process

and respond to a client-side request. In the case of

a single appearance of an event, the start and com-

pletion timestamp is the same to signify an instant

event occurrence. The resulting abstracted event log

is L

⊆ ξ

where e

∈ L

and e

= (t

start

complete

, c,

5 EXPERIMENTAL

EVALUATIONS

To manifest a suggested method’s quality and reliabil-

ity, one needs to examine a system with at least two

experiments (Dean et al., 1999). Consequently, we

demonstrate the validity, robustness, and generaliza-

tion of our methodology to obtain a readable and uti-

lizable abstracted event log by conducting two rounds

of empirical evaluations in real-life scenarios. The

event logs obtained are from information systems pro-

vided by the RWTH-Aachen University for metadata

management and archiving of research data. We fol-

low (Raﬁei and van der Aalst, 2019)’s instructions

for securing event logs that comply with EU GDPR

privacy regulations. Additionally, as mentioned ear-

lier, there is an n:m relationship between client and

server activities for the system under study. Respec-

tively, every user activity triggers a set of server-side

activities to process and execute the user’s demand for

a client-side operation. For the purpose of the case

studies, we have collected and analyzed client-server

event logs from two web applications, namely Meta-

data Manager and Data Archiving tool.

5.1 Case Study 1: Metadata Manager

Tool

As the name represents, the Metadata Manager is a

web application tool for researchers at the university

to provide customizable metadata schemas and enable

the classiﬁcation and exploration of research data by

their metadata. The most typical user interactions are

providing metadata schemas, searching metadata, and

adding/editing metadata. Overall, we obtained event

logs for a duration of six months with 573 case ids

and 13794 events. Metadata Manager generates 21

unique client-side activities and 32 server-side activi-

ties responsible for processing user requests.

Results: By applying the proposed approach, we ob-

tained an abstracted process model portraying the pre-

sumed user interaction life-cycle. Figure 2a illustrates

the discovered complex Petri net before the abstrac-

tion of event logs. Figure 2c shows the discovered

abstracted model using our suggested approach. Ta-

ble 2a demonstrates the result of the iterations of pro-

cess abstraction until no further abstraction is pos-

sible. For the proof of concept, we have executed

the process beyond our suggested ﬁtness threshold.

While considering our recommended ﬁtness criteria

as the breaking point for the abstraction process, we

obtain an event log with 10 activities, 6414 events,

and 573 cases with an overall ﬁtness accuracy of 0.84.

Furthermore, by exploring the results in table 2a, we

observe that a suitable PCC threshold for classifying

similar events and combining events without compro-

mising the ﬁtness on the Metadata Manager tool is 79-

1. Note that no further abstraction is possible below

the ﬁtness of 0.73 as no more two events are similar

below the PCC threshold of 22-1. Thus, we achieved

a 52.39% decrease in the number of activities and

Event Log Abstraction in Client-Server Applications

(a) Metadata Manager tool Spaghetti model. (b) Data Archiving tool Spaghetti model.

Figure 2: The discovered Petri net models before and after applying the event log abstraction technique.

53.51% abstraction in the total number of events for

the system under study.

5.2 Case Study 2: Data Archiving Tool

The Data Archiving tool is a web application for re-

searchers of the RWTH-Aachen University to archive

research data and restore ﬁles as demanded. The

users of this service can deﬁne a storage expiry

date to adhere to the requirements of funding or-

ganizations for research projects to ensure that re-

search data are available despite the expiry of a

project. The most typical user activities are up-

loading/downloading data, searching for a research

record, and restoring research data. In this case study,

we collected event logs for a duration of six months

with 413 case ids and 10487 events. The Data Archiv-

ing tool generates 19 unique client-side activities and

28 server-side activities responsible for processing

user’s client-side requests.

Results: Figure 2b and 2d respectively illustrate the

discovered process model before and after applying

the abstraction technique respectively. By checking

the abstraction results at the suggested ﬁtness criteria,

we reach an event log with 7920 events, 14 client-side

activities, with a ﬁtness accuracy of 0.81. Table 2b

reports on the results with every abstraction iteration

until no two further activities are comparable. More-

over, we can ﬁnd a suitable PCC threshold of 85-1

for achieving a desired level of abstraction. We also

observe that no further abstraction is possible below

the ﬁtness of 0.59 as no more two events are simi-

lar below the PCC threshold of 28-1. Therefore, we

managed a 26.32% decrease in the number of activ-

ities and a 24.47% reduction in the total number of

events for the system under study.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

6 DISCUSSION AND FUTURE

WORK

We used the Data Follow Graph model to illustrate the

abstracted models due to ease of understanding for the

domain experts. We received various qualitative feed-

back, such as: “[. . . ] I can now understand the user

interactions and assess the user behaviors without be-

ing overwhelmed by many activities that do represent

the actual processes, [. . . ] I am wondering why we

have such a bad performance on restoring ﬁles, I need

to investigate this.”, referring to a violation of a KPI

for restoring ﬁles that was previously unknown. Thus,

the proposed method could successfully identify non-

functional requirements and draw developers’ atten-

tion to the right software components for further tech-

nical investigations.

Despite our effort to propose a generalizable event

log abstraction technique, the suggested method is

only applicable to client-server applications. We have

to acknowledge that one of the most time-consuming

and manual efforts in this technique is to evaluate

the input log for noise, outliers, and anomalies in the

event log. Unfortunately, despite our attempt, our

case studies’ report lacks the exploration of the pre-

cision and the generalization as other quality indica-

tors due to extensive computing power required for

calculating these factors via ProM implementation.

Thus these calculations never terminated. Moreover,

we use the concatenation of activity names for rela-

beling task; yet, this may cause readability issues for

some domain experts if the number of original activ-

ities increases drastically. Lastly, we are also relying

on event logs being generated by the OAuth2 work-

ﬂow (authorization service). Consequently, there may

be server-side activities that are not recorded. So, fur-

ther study is required to validate our approach while

including all executing software components.

7 CONCLUSIONS

This paper describes a novel approach for event log

abstraction in client-server applications. In Section

2, we discussed the state of the art and elaborated on

the beneﬁts and shortcomings of each work, and ex-

plained how our suggested method could overcome

existing limitations for the system under study. Our

suggested approach enabled us to gradually abstract

the ﬁne-grained event logs to higher levels without

losing essential information, enabling the domain ex-

perts to use the appropriate discovered model for fur-

ther analysis. Besides answering our research ques-

tions, we evaluated the proposed approach with the

help of two real-life experiments. Overall, this work’s

contributions are adaptability of the approach to any

client-server application and enabling automatic rela-

beling of abstracted activity names at a desired level

of granularity. Additionally, it empowers discovering

the exact software execution life-cycle, facilitating the

discovery of user behavior on client-server applica-

tions with high accuracy.

REFERENCES

Begicheva, A. K. and Lomazova, I. A. (2017). Discovering

high-level process models from event logs. Modeling

and Analysis of Information Systems, 24(2):125–140.

Benesty, J., Chen, J., Huang, Y., and Cohen, I. (2009).

Pearson correlation coefﬁcient. In Noise reduction in

speech processing, pages 1–4. Springer.

Buijs, J. C., Van Dongen, B. F., and van Der Aalst, W. M.

(2012). On the role of ﬁtness, precision, gener-

alization and simplicity in process discovery. In

OTM Confederated International Conferences" On

the Move to Meaningful Internet Systems", pages 305–

322. Springer.

de Leoni, M. and Dündar, S. (2020). Event-log abstraction

using batch session identiﬁcation and clustering. In

Proceedings of the 35th Annual ACM Symposium on

Applied Computing, pages 36–44.

Dean, A., Voss, D., Dragulji

c, D., et al. (1999). Design and

analysis of experiments, volume 1. Springer.

Günther, C. W. and Rozinat, A. (2012). Disco: Discover

your processes. BPM (Demos), 940:40–44.

Jouck, T., Bolt, A., Depaire, B., de Leoni, M., and van der

Aalst, W. M. (2018). An integrated framework for pro-

cess discovery algorithm evaluation. arXiv preprint

arXiv:1806.07222.

Leemans, S. J., Fahland, D., and Van Der Aalst, W. M.

(2014). Process and deviation exploration with induc-

tive visual miner. BPM (Demos), 1295(46):8.

Liu, C., Wang, S., Gao, S., Zhang, F., and Cheng, J. (2018).

User behavior discovery from low-level software exe-

cution log. IEEJ Transactions on Electrical and Elec-

tronic Engineering, 13(11):1624–1632.

Mannhardt, F., De Leoni, M., Reijers, H. A., Van Der Aalst,

W. M., and Toussaint, P. J. (2016). From low-level

events to activities-a pattern-based approach. In Inter-

national conference on business process management,

pages 125–141. Springer.

Mannhardt, F. and Tax, N. (2017). Unsupervised event ab-

straction using pattern abstraction and local process

models. arXiv preprint arXiv:1704.03520.

Mans, R., van der Aalst, W. M., and Verbeek, H. (2014).

Supporting process mining workﬂows with rapid-

prom. BPM (Demos), 56.

Politze, M., Claus, F., Brenger, B., Yazdi, M. A., Heinrichs,

B., and Schwarz, A. (2020). How to manage it re-

sources in research projects? towards a collaborative

Event Log Abstraction in Client-Server Applications

scientiﬁc integration environment. European Journal

of Higher Education IT, 2.

Raﬁei, M. and van der Aalst, W. M. (2019). Mining roles

from event logs while preserving privacy. In Interna-

tional Conference on Business Process Management,

pages 676–689. Springer.

Smirnov, S., Reijers, H. A., and Weske, M. (2012). From

ﬁne-grained to abstract process models: A semantic

approach. Information Systems, 37(8):784–797.

Tax, N., Sidorova, N., Haakma, R., and van der Aalst, W. M.

(2016). Event abstraction for process mining using

supervised learning techniques. In Proceedings of

SAI Intelligent Systems Conference, pages 251–269.

Springer.

Valdez, A. C., Özdemir, D., Yazdi, M. A., Schaar, A. K.,

and Zieﬂe, M. (2015). Orchestrating collaboration-

using visual collaboration suggestion for steering of

research clusters. Procedia Manufacturing, 3:363–

370.

van der Aalst, W. (2016). Process mining: data science in

action. Springer.

Van der Aalst, W., Adriansyah, A., and van Dongen, B.

(2012). Replaying history on process models for con-

formance checking and performance analysis. Wiley

Interdisciplinary Reviews: Data Mining and Knowl-

edge Discovery, 2(2):182–192.

Van Dongen, B. F., de Medeiros, A. K. A., Verbeek, H.,

Weijters, A., and van Der Aalst, W. M. (2005). The

prom framework: A new era in process mining tool

support. In International conference on application

and theory of petri nets, pages 444–454. Springer.

van Zelst, S. J., Mannhardt, F., de Leoni, M., and

Koschmider, A. (2020). Event abstraction in process

mining: literature review and taxonomy. Granular

Computing, pages 1–18.

Yazdi, M. A. (2019). Enabling operational support in the re-

search data life cycle. In Proceedings of the ﬁrst Inter-

national Conference on Process Mining, pages 1–10.

CEUR.

Yazdi, M. A. and Politze, M. (2020). Reverse engineering:

The university distributed services. In Proceedings of

the Future Technologies Conference (FTC) 2020, Vol-

ume 2, pages 223–238. Springer.

Yazdi, M. A., Valdez, A. C., Lichtschlag, L., Zieﬂe, M., and

Borchers, J. (2016). Visualizing opportunities of col-

laboration in large research organizations. In Interna-

tional Conference on HCI in Business, Government,

and Organizations, pages 350–361. Springer.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval