DA4RDM: Data Analysis for Research Data Management Systems

M. Amin Yazdi

, David Schimmel

, Marcel Nellesen

, Marius Politze

and Matthias Müller

IT Center, RWTH Aachen University, Aachen, Germany

Keywords:

Pre-processing Pipeline, Web Application, Research Data Management, Data Analysis, Process Mining,

Requirement Engineering.

Abstract:

Research Data Management (RDM) systems are becoming an essential part of every researcher’s academic ca-

reer. Often, researchers use various resources and web applications to handle their research data, causing com-

plications for maintaining data and assessing research projects against FAIR principles. Consequently, RDM

platforms help researchers with data administration tasks while providing the necessary tools for managing

research projects. Furthermore, user engagement with such RDM platforms leaves traces of user interaction

with research data; thus, studying user behaviors over research data becomes an exciting territory. However,

running periodic data analysis studies proves to be a time-consuming and challenging task and requires the

help of scientiﬁc staff to run pre-and post-processing pipelines per use case in order to be able to produce re-

sults that are usable by domain experts. This paper introduces Data Analysis for Research Data Management

systems (DA4RDM) as a scalable web application that supports reusing pre-deﬁned pre-and post-processing

pipelines to enable domain experts to utilize the system without the need for scientiﬁc expertise. We use

real data acquired from an RDM system, explain the tool’s applicability, and present the preliminary ﬁndings,

demonstrating its use cases and capabilities.

1 INTRODUCTION

In the era of digitalization, researchers frequently face

the challenges of managing their research data to

present their ﬁndings to other scholars and ultimately

enable the reusing of research data. The Research

Data Life Cycle (RDLC) in ﬁgure 1 represents the

stages that every researcher faces during his/her aca-

demic career (Yazdi, 2019). Every RDLC stage in-

cludes several sub-activities, and each University pro-

vides various IT solutions per activity. On the one

hand, RDLC navigates researchs throughout their re-

search processes, and on the other hand, RDM sys-

tems encourage researchers to follow the FAIR prin-

ciples. The FAIR principle is a set of guidelines

and standards to increase transparency and impact of

research results by making the research data Find-

able, Accessible, Interoperable, and Reusable (FAIR)

(Wilkinson et al., 2016). Thus, data FAIRness en-

ables humans or machines understand the semantics

https://orcid.org/0000-0002-0628-4644

https://orcid.org/0000-0002-1719-8928

https://orcid.org/0000-0002-1830-5780

https://orcid.org/0000-0003-3175-0659

https://orcid.org/0000-0003-2545-5258

and purpose of the data (Gargiulo et al., 2021).

Accordingly, there is a need for a system to sup-

port researchers throughout the RDLC and promote

FAIR guidelines for research projects. Collaborative

Scientiﬁc Integration Environment (Coscine) is a soft-

ware platform at RWTH Aachen University to help

researchers throughout RDLC and guide them toward

FAIRness principles (Politze et al., 2020). Currently,

Coscine enables the integration of multiple data re-

sources and handles metadata management for re-

search projects. The users of this platform can deﬁne

research projects, invite other researchers to projects,

integrate multiple storage resources, and deﬁne spe-

cialized metadata schema to describe research data

while storing and archiving data. Coscine aims to pro-

vide a centralized web platform for RDLC activities

while encouraging users to adhere to FAIR principles.

RDM systems provide the means for managing

metadata and studying the RDLC process via dis-

covering user interactions with research data. Al-

though metadata management assists scholars in un-

derstanding the semantics of research data, investi-

gating user interactions assists Principal Investigators

(PI) and domain experts to discover bottlenecks in

the respective processes and to capture implicit user

Yazdi, M., Schimmel, D., Nellesen, M., Politze, M. and Müller, M.

DA4RDM: Data Analysis for Research Data Management Systems.

DOI: 10.5220/0010678700003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 3: KMIS, pages 177-183

ISBN: 978-989-758-533-3; ISSN: 2184-3228

177

Figure 1: Presumed research data life cycle.

requirements (Yazdi and Politze, 2020). By collect-

ing and capturing the footprints of user engagement,

we can discover non-trivial changes to the research

data and reveal insightful information, which facil-

itates further analytical investigations. Process and

data mining techniques can effectively discover non-

trivial process models and empower data-centric stud-

ies (van der Aalst, 2016). However, despite avail-

able process and data mining tools, most solutions are

either ofﬂine tools or developed for specialized use

cases and are too advanced for users without technical

expertise (Kebede and Dumas, 2015; Malkawi et al.,

2020; Celik and Akçetin, 2018). Thus, we need a soft-

ware solution that can support the full spectrum of

data preparation, pre-and post-processing pipelines,

modeling, and even action suggestions for RDM sys-

tems. DA4RDM, on the contrary, enables us to con-

nect to live data sources, run project-speciﬁc data pre-

processing pipelines, discover process models, and al-

lows for further scalability and adaptability of our sys-

tem.

In the remainder of this paper, in section 2, we

elaborate on the characteristics and functionalities of

the DA4RDM, then in section 3, we review a system

under study and demonstrate the ﬁndings. In section

4, we address the challenges and possible future work,

and lastly, in section 5 we give a brief summary of our

work.

2 DA4RDM WEB APPLICATION

DA4RDM is based on the Flask web framework, and

it consists of 4 modules. As illustrated in ﬁgure 2,

module 1 conﬁgures a data source, module 2 pro-

vides data cleansing tools, module 3 enables trans-

forming data according to projects’ scope, and mod-

ule 4 presents the ﬁndings with customizable post-

processing and data modelings. The actual imple-

mentation for the DA4RDM web application is also

publicly available on GitLab

2.1 Characteristics

DA4RDM is developed with the following character-

istics and properties:

2.1.1 Scalability

DA4RDM provides web services for connecting data

sources, customizing and executing project-speciﬁc

pre-processing pipelines, and allows for integrating

python packages for process and data mining projects.

The current implementation of DA4RDM has embed-

ded the PM4PY python package (Berti et al., 2019)

into the existing service infrastructure for process dis-

covery tasks.

2.1.2 Extendability

As shown in ﬁgure 2, the modular construction of the

DA4RDM allows for the integration or extension of

third-party Python packages to satisfy the needs of a

data analysis study. For instance, by extending the

appropriate modules for data emulation, outlier de-

tection, or data balancing and normalization, we can

deﬁne and perform various data modeling projects.

2.1.3 Accessibility

We split the accessibility of DA4RDM into two

parts. Firstly, the pipeline needs to be conﬁgured

by a data scientist. It includes connecting to a data

source, specifying a suitable data query, and setting

the pre-processing pipeline per project. Once a pre-

processing pipeline is deﬁned, the steps are stored for

later re-use as predeﬁned projects. Secondly, a non-

technical user can reuse a predeﬁned project and ap-

ply additional web-based ﬁlters, attributes, and algo-

rithms on top of data, supporting post-processing the

data models without programming knowledge. Thus,

the DA4RDM client application enables principal in-

vestigators to analyze the RDM system without tech-

nical background knowledge.

2.2 Functionalities

In the following section, we elaborate on the function-

alities of the DA4RDM web application. We have di-

vided the functionality of DA4RDM into three main

https://git.rwth-aachen.de/AminYazdi/da4rdm

KMIS 2021 - 13th International Conference on Knowledge Management and Information Systems

178

CSV, XML, XES,...

Data base

Missing Values Detection

Data Imputation

Emulation

Duplicates Detection

Typos Detection

Outlier Detection

Syntax Error Detection

Grouping

Data Type Conversion

Concatination

Balancing

Normalization

Reporting

Visualizations

Validation

Data Cockpit

Filters

Process Discovery

Source Config

Data Cleansing Data Transformation

Data Analysis

Data Scientist

Principal

Investigator

Data Pre-processing Pipeline

Client Application

Figure 2: The DA4RDM overarching infrastructure and its modules.

parts, namely data source, data pre-processing, and

process analysis.

2.2.1 Data Source

The data source module allows for the supply of un-

processed data for the DA4RDM. As shown in step 1

of ﬁgure 2, the data input module currently supports

uploading local ﬁles in CSV, XES formats, or a di-

rect connection to a relational database table. In the

data source user interface (see ﬁgure 3a), users can

provide additional parameters for each data source to

further specify a ﬁle conﬁguration or a database query

command. Despite local data storage, every fresh exe-

cution of pre-processing pipeline fetches the latest in-

stance of a speciﬁed data source. Moreover, this mod-

ule is responsible for serializing data structure into

Panda data-frame, allowing for data modiﬁcations re-

quired for later steps in the process. Note that XES

ﬁle formats are standardized XML data-types used by

Process Mining algorithms; thus, DA4RDM handles

converting input data into XES format and preparing

for process mining algorithms.

2.2.2 Data Pre-processing

The data pre-processing interface shown in ﬁgure 3b

is responsible for data cleansing and transformation.

The user interface allows for selecting a pre-deﬁned

data source and a pre-processing pipeline to initial-

ize a data-driven study. Although the pre-processing

pipeline is heavily dependent on the input data prop-

erties and objectives of a data analysis project, this

tool allows for reusing a pipeline on top of opera-

tional data. Consequently, a data scientist speciﬁes a

data source, consequently develops or modiﬁes a pre-

processing pipeline according to his/her needs by uti-

lizing methods mentioned in steps 2 and 3 of ﬁgure 2

to adhere to a data structure or a use-case in hand.

Moreover, DA4RDM creates a user session main-

taining the designated data source and pre-processing

pipeline outcomes to ensure a continuous workﬂow

throughout the client application.

2.2.3 Process Analysis

The process analysis shown in ﬁgure 3c provides the

toolset and user interface required by principal in-

vestigators enabling them to further investigate data

without technical knowledge. At this stage, users

can utilize a few process mining algorithms provided

by the Pm4py library to discover the process mod-

els and study particular user journey scenarios based

on either frequency or performance analysis. Users

of DA4RDM can conveniently specify the main event

attributes (timestamp, case id, and activity) necessary

for a process discovery based on the existing data fea-

tures. Moreover, the ﬁlter module allows for further

narrowing the analysis over a speciﬁed dataset. Ad-

ditionally, the information module provides essential

statistics of the selected dataset for speciﬁed crite-

ria and ﬁlters. The results section provides a visu-

alization of the discovered model according to a cho-

sen algorithms such as Alpha miner, Heuristic miner,

Inductive miner, or Directly Follows Graph (DFG)

(van der Aalst, 2016).

3 SYSTEM UNDER STUDY

In the following section, we demonstrate the beneﬁts

of DA4RDM for an RDM system. We begin by elab-

orating on the data collection method and its prop-

erties over the Coscine platform as an RDM-system

under study and then demonstrate the procedures in-

volved to enable principal investigators to utilize data

and deduce valuable insights.

3.1 Coscine Platform

As mentioned earlier, Coscine is a platform to assist

researchers with data management tasks and guide a

DA4RDM: Data Analysis for Research Data Management Systems

179

(a) UI for conﬁguration of a data source using either a local ﬁle or a database.

(b) UI for selection of a data source and a pre-deﬁned pre-processing pipeline.

Figure 3: User interfaces for DA4RDM web application.

KMIS 2021 - 13th International Conference on Knowledge Management and Information Systems

180

Table 1: Sample of transformed data objects into features and labels, enabling further data analysis using DA4RDM.

Id Operation Timestamp UserId RoleId SessionId ProjectId PID MetadataSchema MetadataComp. License Discipline Organizations

... ... ... ... ... ... ... ... ... ... ... ... ...

1021 View Project 1579108918 29613-d8... be29c-4e... 4b15f... 4e9f-97... NULL NULL NULL MIT Mechanical Eng. ETH, Darmstadt

1022 View Resource 1579109840 29613-d8... be29c-4e... 4b15f... 4e9f-97... ef9175... EngMeta NULL MIT Mechanical Eng. ETH, Darmstadt

1023 Update Metadata 1579109897 29613-d8... be29c-4e... 4b15f... 4e9f-97... Xn3on4... EngMeta 73% MIT Mechanical Eng. ETH, Darmstadt

... ... ... ... ... ... ... ... ... ... ... ... ...

research project toward the FAIR maturity paradigm.

In addition, researchers may use Coscine to store their

research data, provide specialized metadata, and col-

laborate within research projects. Coscine is under

active development and beneﬁts from the microser-

vices software architecture model. Various APIs and

applications are employed to interact with the system

such that each API is distinguishable by topic. For

instance, all requests handling ﬁle modiﬁcations are

processed by the same API.

3.2 Data Model

Coscine allows us to collect user-based RDM activ-

ities while formalizing appropriate data objects and

relationships between attributes. The obtained infor-

mation is a set of user requests received by the server-

side and processed according to particular application

domains. Coscine generates and captures this data

in a serialized JSON object format due to its ease

of scalability to incorporate supplementary attributes

and entities without a need for extending database ta-

bles. The data ﬁelds include detailed insights into the

sequence of actions and respective RDM-relevant en-

tities. For instance, the listing 1 represents a JSON

object triggered by a user, attempting to update meta-

data of a research data entries; respectively, we can

deduce meta information from a single user action

such as its PID, selected metadata schema, percent-

age of metadata provided, discipline, and additional

associated properties.

{

"Operation":"Update Metadata",

"Timestamp":"1579109897",

"UserId":"29613-d8..." ,

"RoleId":"be29c-4e...",

"SessionId":"4b15f...",

"ProjectId":"4e9f-97...",

"PID":"ef9175...",

"MetadataSchema":"EngMeta",

"MetadataCompletness":"70%",

"License":"MIT",

"Discipline":["Mechanical Engineering"],

"Organizations":["ETH", "Darmstadt"]

}

Listing 1: Sample JSON object captured on Coscine upon

user action.

Subsequently, DA4RDM provides the means to

utilize a produced dataset by Coscine or any other

RDM system via the user interface shown in ﬁgure 3b.

It allows selecting a preprocessing pipeline to evalu-

ate and transform data samples into a new data for-

mat, suitable for a data modeling algorithm. Accord-

ingly all JSON objects are collected and stored in a

relational database ready for importing to DA4RDM.

Table 1 is an example of a dataset converted into

columns of features and labels ready for running in-

depth process mining analysis using the DA4RDM in-

terface displayed in ﬁgure 3c.

3.3 Privacy

Concerning responsible data mining principles, it

is essential to maintain privacy within user-based

datasets (Raﬁei et al., 2018). Accordingly, Coscine,

as a system under study, generates and contains no

human-readable names. Instead, GUIDs are used as

unique identiﬁers to enable analyzing a user’s actions

and user behavior study without compromising users’

identity. Moreover, by relying on server-side event

logs, we eliminate the need for a client-side logger,

which may cause unnecessary complications or in-

clude undesired sensitive information.

3.4 Preliminary Findings

The following section elaborates on the qualitative

and quantitative ﬁndings from two case studies using

DA4RDM that are also shown in ﬁgure 4. Initially,

we investigated user activities over resource PIDs

and then analyzed overall system performance for re-

search projects on Coscine. The goal is to demon-

strate the preliminary applications of DA4RDM in

its initial stages by discovering process models for

non-trivial user interaction paths and identifying non-

functional requirements in the system.

In the ﬁrst use case, we ran frequency analysis us-

ing inductive miner over resource PIDs and discov-

ered a Petri-net model for the most typical user jour-

ney paths over interacting with different resources and

research data. We refer our readers to (Kindler et al.,

2006) for further explanation of Petri-net modeling.

On the top right section of ﬁgure 3c, DA4RDM re-

ports brief statistics over the number of cases, events,

activities, and variations of activities. With the help of

DA4RDM: Data Analysis for Research Data Management Systems

181

'Activity': 'Delete File' (3)

'Activity': 'Upload MD' (73)

'Activity': 'Download File' (70)

'Activity': 'Open Resource (RCV)' (135)

'Activity': 'Update File' (46)

'Activity': 'Update MD' (78)

'Activity': 'View MD' (120)

'Activity': 'Add Resource' (18)

'Activity': 'Upload File' (55)

(a) Frequency analysis using Petri-net model for all involving activities over resource PIDs.

17s

'Activity': 'Add Member' (0s) 1s

'Activity': 'View Users' (0s)

'Activity': 'Add Project' (0s)

'Activity': 'Open Project' (0s)

34s

'Activity': 'Add Resource' (0s)

18s

'Activity': 'Change Role' (0s) 1s

'Activity': 'Open User Management' (0s)

'Activity': 'Delete File' (0s)

'Activity': 'Open Resource (RCV)' (0s)

'Activity': 'Delete Project' (0s)

'Activity': 'Download File' (0s)

35s

'Activity': 'Update MD' (0s)

'Activity': 'Upload MD' (0s)

'Activity': 'View MD' (0s)

14s1s

13s

'Activity': 'View Project' (0s)

10s

38s

30s

'Activity': 'Update File' (0s)

12s

34s

22s

35s

10s

'Activity': 'Upload File' (0s)

21s

19s

11s

27s

25s

1s15s

18s 6s

35s

16s

(b) Performance analysis using Data-Follow-Graph for all research projects. Red circles

indicate violated KPIs in the system.

Figure 4: Discovered process models using the Inductive Miner via DA4RDM web application.

DA4RDM illustrated in ﬁgure 4a, we discovered that

the number of execution for the Open Resource(RCV)

activity is oddly excessive with respect to the num-

ber of incoming and outgoing transitions, indicating

the existence of loops in the code and the necessity

to refactor code in order to avoid unnecessary server-

side requests.

In the second use case, we investigated the overall

performance of the RDM system shown in ﬁgure 4b

using DFG process model and evaluated the ﬁndings

(shown in red circles) with the help of a domain expert

against key performance indicators. Despite the ex-

pected complex and unstructured process model due

to the freedom of user interaction within the system,

DA4RDM successfully assisted us in identifying bot-

tlenecks in the system that were previously unknown.

For instance, the transition from Open User Manage-

ment to View Users actions should not take longer

than a few seconds, or the process of creating new

projects (transition from Add Project to Open Project)

on Coscine was discovered to take over 70 seconds.

Accordingly, discovering the evidence of software ex-

ecution bottlenecks resulted in gaining the developer

teams’ attention, and also, as shown in ﬁgure 4b, the

performance for this process is greatly improved.

The empirical study over real data, implies that the

cyclic order of operations within the RDLC shown in

ﬁgure 1 may not reﬂect the reality of how researchers

interact with research data. Furthermore, despite the

limited data sample size, both use cases manifest

DA4RDM capabilities and how the web application

could be extended to serve non-technical users to in-

vestigate their research projects against certain crite-

ria by extending the data modeling technique or visu-

alization for a target use case.

4 CHALLENGES AND FUTURE

WORK

In order to evaluate the functionality of DA4RDM

with a real dataset while developing DA4RDM, we

spent a considerable amount of time ﬁnding a suitable

method for obtaining and extracting event logs from

Coscine as an evolving RDM system. Nevertheless,

DA4RDM is designed to be independent of its input

data model and allows for data processing according

to foreseeable use cases. Additionally, the current im-

KMIS 2021 - 13th International Conference on Knowledge Management and Information Systems

182

plementation only allows for a single source of truth

for data input which is a limitation for RDM systems

where the datasets are scattered in multiple locations.

The seen complex process model shown in ﬁgure 4b

results from actual user behavior in an environment

where users are free to interact with a system with-

out a speciﬁc order. Thus, there is a need to extend

DA4RDM with a pre-processing pipeline suggested

by the authors in (Yazdi et al., 2021) to abstract event

logs to achieve structured and simple process mod-

els. We have to acknowledge that Coscine is a de-

veloping RDM platform and is currently in its pilot

phase; hence the sample dataset available is limited

to its beta users, and the current preliminary ﬁndings

may not entirely reﬂect the actual user behavior in a

mature RDM system.

The future work includes adding additional in-

terfaces for conformance checking of the user pro-

cess, identifying unﬁnished user journeys and trig-

gering automatic actions, and extending the collected

dataset to produce a FAIR maturity dashboard for ev-

ery research project and suggest user actions to in-

crease research data FAIRness. Although the current

DA4RDM UI allows for a PI to interact with the sys-

tem, the post-processing of data models proved to be

too advanced; therefore, we may need to pre-scope

the available features of our web application for the

target audience.

5 CONCLUSION

In this paper, we discussed the beneﬁts of DA4RDM

for an RDM system. We began with a technical re-

view of the developed web application, its charac-

teristics, and functionalities. Furthermore, we elab-

orated on an RDM platform (Coscine) as a candidate

system under study and the approach used to obtain

a real dataset for the goal of data modeling and anal-

ysis. Preliminary ﬁndings demonstrated the beneﬁts

of such a system toward user behavior studies and

discovering non-functional requirements. Although

the extracted data from Coscine is entirely based on

a service layer, DA4RDM is showing to be adapt-

able to any log format according to the needs of a

data modeling algorithm. Therefore, the contributions

of DA4RDM are to allow for a scalable web appli-

cation that enables non-technical staff to reuse pre-

deﬁned pre-and post-processing pipelines to execute

data-driven studies without technical or scientiﬁc ex-

pertise.

REFERENCES

Berti, A., van Zelst, S. J., and van der Aalst, W. (2019).

Process mining for python (pm4py): bridging the gap

between process-and data science. arXiv preprint

arXiv:1905.06169.

Celik, U. and Akçetin, E. (2018). Process mining tools

comparison. Online Academic Journal of Information

Technology, 9:97–104.

Gargiulo, P., Galimberti, P., Tammaro, A. M., and Zane, A.

(2021). Fair rdm (research data management): Italian

initiatives towards eosc implementation. In IRCDL,

pages 42–52.

Kebede, M. and Dumas, M. (2015). Comparative evaluation

of process mining tools. University of Tartu.

Kindler, E., Rubin, V., and Schäfer, W. (2006). Process

mining and petri net synthesis. In International Con-

ference on Business Process Management, pages 105–

116. Springer.

Malkawi, R., Saifan, A. A., Alhendawi, N., and Bani-

Ismaeel, A. (2020). Data mining tools evaluation

based on their quality attributes. International Journal

of Advanced Science and Technology, 29(3):13867–

13890.

Politze, M., Claus, F., Brenger, B., Yazdi, M. A., Heinrichs,

B., and Schwarz, A. (2020). How to manage it re-

sources in research projects? towards a collaborative

scientiﬁc integration environment. European Journal

of Higher Education IT, 2.

Raﬁei, M., von Waldthausen, L., and van der Aalst, W. M.

(2018). Ensuring conﬁdentiality in process min-

ing. Proceedings of the 8th International Sympo-

sium on Data-driven Process Discovery and Analysis-

SIMPDA, 18:3–17.

van der Aalst, W. (2016). Process mining: data science in

action. Springer.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Apple-

ton, G., Axton, M., Baak, A., Blomberg, N., Boiten,

J.-W., da Silva Santos, L. B., Bourne, P. E., et al.

(2016). The fair guiding principles for scientiﬁc data

management and stewardship. Scientiﬁc data, 3(1):1–

Yazdi, M. A. (2019). Enabling operational support in the

research data life cycle. In Proceedings of the First

International Conference on Process Mining, pages

1–10.

Yazdi, M. A., Farhadi, P., and Heinrichs, B. (2021). Event

log abstraction in client-server applications. In IC3K

2021: Proceedings of the 13th International Joint

Conference on Knowledge Discovery, Knowledge

Engineering and Knowledge Management: KDIR.

SciTePress.

Yazdi, M. A. and Politze, M. (2020). Reverse engineering:

The university distributed services. In Proceedings of

the Future Technologies Conference, pages 223–238.

Springer.

DA4RDM: Data Analysis for Research Data Management Systems

183