Analysis and Design of Data Quality Monitoring Application using Open

Source Tools: A Case Study at a Government Agency

Mohammad Reza Effendy

, Tien Fabrianti Kusumasari

and Muhammad Azani Hasibuan

Department of Information System, Telkom University, Bandung, Indonesia

Keywords:

Data Quality, Monitoring, Data Quality Management, Open-source.

Abstract:

Data is an essential component in managing information systems in an organization. A good data quality

management can be used by organizations to create business policies. Many organizations have managed their

data to be good data quality, but there are still many organizations that have not been able to make good data

quality. A government agency in Indonesia has problems in managing data quality. Problems occur such as

a large amount of null data, different data patterns, duplication of data, alternating data writing policies until

there is no data quality monitoring application. To overcome these problems, there is a data quality monitoring

process which will monitor data quality in an organization.

1 INTRODUCTION

Organizations realize that the development of a qual-

ity information system has signiﬁcant business bene-

ﬁts. Using data to determine goals and decision mak-

ing makes data as an essential component of infor-

mation systems (Leonard, 2018). Data governance

as a part of corporate governance and information

technology is deﬁned as data management techniques

that produce data as a part of organizational assets

(Brackett and Earley, 2009). Data governance is a se-

ries of processes, policies, standards, organizations,

and technologies that ensure availability, accessibil-

ity, quality, consistency, audit capability, and data se-

curity in an organization (Panian, 2010).

When an organization is seriously handling infor-

mation system quality issues, they must manage the

quality of data in their organization (Leonard, 2018).

Data quality plays a vital role in data governance,

data quality as a way to build the reputation of an or-

ganization, as a decision-making action, even opera-

tional process, and transaction process (Herzog et al.,

2007). The primary purpose of the organization to

handle data quality problems is to improve the accu-

racy of data, data completeness, the age of data and

data quality reliability. Managing data quality prob-

lems are one among the functions in data governance,

namely Data Quality Management (DQM) (Brackett

and Earley, 2009). The DQM process divided into

several stages, starting from the scene of data produc-

tion, data storage, data transfer, data sharing, and data

usage (Rouse,2018). DQM consists of several phases,

namely data proﬁling, data cleansing, data quality as-

sessment, and data quality monitoring (Apel et al.,

2015). Data quality monitoring is also described as

monitoring and ensuring data (static and streaming)

can adjust the business rules and can be used to deter-

mine the quality of data in an organization (Judah and

Friedman, 2014).

In this study, a government agency in Indonesia

has some common mistakes in the data entry pro-

cess, for example, are the existence of empty data, the

absence of standards in data formats, and the emer-

gence of data redundancies that are still a problem for

the government agency. These errors also cannot be

monitored automatically by the application, and if left

unchecked, these errors will become an ongoing prob-

lem because the data will continue to increase even

more. By looking at these conditions, a data quality

monitoring application is needed to monitor the qual-

ity of data automatically. This study was made to de-

sign a data quality monitoring application in the study

case of a government agency.

214

Effendy, M., Kusumasari, T. and Hasibuan, M.

Analysis and Design of Data Quality Monitoring Application using Open Source Tools: A Case Study at a Government Agency.

DOI: 10.5220/0009867402140219

In Proceedings of the International Conference on Creative Economics, Tourism and Information Management (ICCETIM 2019) - Creativity and Innovation Developments for Global

Competitiveness and Sustainability, pages 214-219

ISBN: 978-989-758-451-0

2 THEORY AND RELATED

WORK

2.1 Data Quality Monitoring

The Data Management Association says there are four

processes in managing data quality are plan, deploy,

monitor, and act (Brackett and Earley, 2009). The

plan is deﬁned as a process for assessing the scope

of data quality issues. The deploy stage is the stage

for analyzing data proﬁles. Stages of monitoring are

used to monitor data quality and measure data quality

by business rules. Act stage is the last stage, namely

decision-making, to overcome and resolve data qual-

ity problems. In managing data quality, there is a

process, one of which is Data Quality Monitoring

(DQM). Data Quality Monitoring is a framework that

is used to control the quality of data in an informa-

tion system continuously, for example, by using met-

rics, reports, or by using proﬁling data regularly (Apel

et al., 2015). (Ehrlinger and W

oß, 2017) et al. di-

viding the data quality monitoring process into four

steps, namely data proﬁling and quality assessment,

data quality repository, time series analytics, and vi-

sualization. Data proﬁling is a series of activities and

processes to determine metadata in a data set (Abed-

jan, 2016). Based on (Abedjan et al., 2016), data pro-

ﬁling is divided into three groups, namely single col-

umn proﬁling, multiple columns proﬁling, and depen-

dencies. According to DataFlux Corporation, in the

technique and processing of data proﬁling involves

three categories of analytical methods, namely: col-

umn discovery, structure discovery, and relationship

discovery (Apel et al., 2015). Data Quality Assess-

ment is a phase in DQM that is used to verify the

source, quantity, and impact of each data item that

violates predetermined data quality rules. Data qual-

ity standards consist of ﬁve dimensions, and there are

availability, usability, reliability, relevance, and pre-

sentation qualityc (Cai and Zhu, 2015). DQ Repos-

itory is divided into two components, namely Data

Quality Metadata (DQMD) and the results of Data

Quality Assessment (DQA). DQMD is a description

of the data schema being assessed, while the results

of DQA are a database that stores DQA results over

time (Ehrlinger and W

oß, 2017). Visualization of the

results of data quality has been examined by Kandel et

al. (Kandel et al., 2012), which highlights the need to

automate this step since the determination. The pur-

pose of the visualization is that there are two of them,

the time-series data stored from the DQ Repository

can be mapped directly, and on the other hand, the re-

sults of time series analysis can be presented to the

user.

2.2 Pureshare

Pureshare is a dashboard development methodology

produced by pureshare vendors. Pureshare proposes

to do projects associated with measuring and manag-

ing organizational performance easier. The pureshare

development method, starting with planning and de-

sign that identifying the user needs, in addition to this

step, also identifying the features on the dashboard.

After knowing the user needs, the next step is review

system and data, such as controlling the system, iden-

tifying of data sources, accessing data, and measuring

the size of a data. The next step is designing the pro-

totype as quickly as possible to provide an overview

of the dashboard system that will be created. After the

prototype is created, a series of prototypes that have

been made will be reviewed together with the user to

gather feedback to be further developed according to

the user’s needs. After the user approve the prototype

and it suitable with the user needs, the dashboard pro-

totype will be implemented in the release step.

2.3 Related Work

This research is a continuation of previous studies that

have been carried out. Previous research conducted

by Amethyst (Amethyst et al., 2018) about the using

of data proﬁling, which focuses primarily on analyz-

ing data by doing proﬁling data using the cardinali-

ties method, data pattern and value distribution using

open source applications. The results of proﬁling will

be implemented in the form of logic in open source

applications and will be compared with other open-

source applications. In another research used in this

study is a study from Dwiandriani, which had a main

focus on building a proﬁling data architecture for cal-

culating null or blank data in a column.

3 PROPOSED METHODOLOGY

In developing this data quality monitoring applica-

tion, there are two main focuses, including the devel-

opment of a dashboard as visualization and the de-

velopment of data quality monitoring architecture. In

developing the data quality monitoring architecture,

we used the concept that was studied by Ehrlinger

(Ehrlinger and W

oß, 2017). The architecture of data

quality monitoring has been investigated by Ehrlinger

(Ehrlinger and W

oß, 2017) states that data quality

monitoring architecture starts from determining data

source, followed by data proﬁling and data quality as-

sessment, storing the results, time series analysis, and

Analysis and Design of Data Quality Monitoring Application using Open Source Tools: A Case Study at a Government Agency

215

the last is visualization. While in developing the dash-

board application, we implement pureshare method-

ology. In this proposed methodology, the steps of

developing a data quality monitoring architecture are

performed in the pureshare methodology.

This proposed methodology consists of four

stages, there are plan and design, system and data

review, prototype reﬁnement, and release. The ﬁrst

stage is plan and design, the developer identify the

user needs and the type of dashboard in this stage.

Second stage is system and data review, in this stage

the developer identify the datasource, the process of

identifying the data source starts with data proﬁl-

ing, data quality assessment, and saving the results

of data proﬁling and assessment. The data proﬁling

process has been carried out by previous research by

(Amethyst et al., 2018). After the system and data

review stage, the next stage is prototype reﬁnement

stage with the aim to design dashboard prototype. The

prototype will be reviewed periodically until the user

approves the renewal dashboard prototype. The dash-

board prototype will be implemented at release stage,

as shown at Figure. 1. As shown in Figure. 1. all

of the steps of data quality monitoring architecture

are applied in this proposed methodology such as data

proﬁling, data quality assessment, time series analy-

sis, and also visualization.

Figure 1: Pureshare Methodology

4 RESULT AND ANALYSIS

4.1 Study Case Analysis

In this study, the researcher uses three tables of data

from a government agency in Indonesia, there are a

pabrik table, trader table, and merk table. In this case,

a government agency in Indonesia is still having difﬁ-

culties in managing the quality of data related to their

data. The challenge is, there are a lot of empty data

in each column, the data pattern in a column is still

different, and the amount of value distribution from

a column is still unknown. These problems also still

cannot be monitored automatically by the application,

and if left unchecked, these problems will become a

continuing problem because the data that is owned

will continue to increase even more. Besides that,

the condition that occurs in the current government

agency is the existence of existing applications that

are used to support the process of checking the au-

thenticity of a product. However, existing applica-

tions are not integrated of their respective uses, there

is no checking for data entry errors in the application,

and each application has its database with different

platforms.

Another problem is alternating policies for the for-

mat of data. It will make the data become more re-

duced quality data. By looking at the conditions in

such a government agency, a data quality monitoring

application is needed that aims to monitor the qual-

ity of data automatically in each database with dif-

ferent platforms. We use ﬁve proﬁling parameters in

this case, there are pattern proﬁling, show null pro-

ﬁling, clustering proﬁling, data completeness proﬁl-

ing, and value distribution proﬁling. Pattern proﬁling

is used to identify the percentage of data patterns in

a column with a different format, show null proﬁl-

ing is used to collect the amount of empty data in a

column. Clustering proﬁling classiﬁes the number of

clusters in each column, and data completeness pro-

ﬁling shows the percentage of data valid according to

the dictionary, value distribution proﬁling calculates

the number of items in a column.

4.2 Data Quality Monitoring

Application

4.2.1 Dashboard Page

The dashboard on Figure. 2 is consist of ﬁve cards

that show the results from each data proﬁling. The

pattern card shows a pie chart with the blue part is

the percentage passing score of pattern proﬁling in all

rules, and the red part is the percentage failed score

of pattern proﬁling by all regulations. The show null

card shows the total blank/null data in each show null

proﬁling running. The clustering card shows a to-

tal of number clustering in each clustering proﬁling

running. The data completeness card shows a bar

chart, and the yellow bar describes total valid data that

matches the dictionary, the red bar describes a total of

not complete data, and the yellow bar describe a to-

tal of not in dictionary data. The last card is the value

distribution card, this card show bar chart that outlines

the amount of data distribution in each proﬁling.

ICCETIM 2019 - International Conference on Creative Economics, Tourism Information Management

216

Figure 2: Dashboard Page

4.2.2 Pattern Report Page

The pattern report page on Figure. 3 consist of infor-

mation about the results of pattern proﬁling, the col-

umn with the most number of patterns, columns with

the least amount of patterns, pattern passing score,

pattern failing score, pattern checking score, and pat-

tern proﬁling table.

Figure 3: Pattern Report Page

4.2.3 Show Null Report Page

The show null report page consists of information

about the results of show null proﬁling, the column

with the highest number of blank/null, the column

with the least amount of blank/null, totally passed

data, totally failed data, results of all proﬁling, and

all show null proﬁling table as shown in Figure. 4.

Figure 4: Show Null Report Page

4.2.4 Clustering Report Page

The clustering report page on Figure. 5 consist of

information about the results of clustering proﬁling,

the column with the most number of clusters, the col-

umn with the least amount of clusters, cluster with the

highest ratio, the cluster with the lowest ratio.

Figure 5: Clustering Report Page

4.2.5 Data Completeness Report Page

The data completeness report page consists of infor-

mation about the results of data completeness proﬁl-

ing, the column with the most valid data, the column

with the least valid data, completeness passed data,

completeness failed data and all data completeness

proﬁling table as shown in Figure. 6.

Figure 6: Data Completeness Report Page

4.2.6 Value Distribution Report Page

The value distribution report page on Figure. 7 con-

sist of information about the results of running proﬁl-

ing, the column with the most amount of distributions,

the column with the least amount of distributions, the

highest distribution ratio, the lowest distribution ratio.

Figure 7: Value Distribution Report Page

4.2.7 Pattern Monitoring Page

This pattern monitoring page shows information

about the results of pattern data proﬁling feature. On

this page, a pivot table is displayed whose data is

taken from the database. Users can also monitor the

results of pattern data proﬁling based on the pivot

columns provided such as based on running id, table

name, column name, database name, number of pat-

terns, and date as shown in Figure. 8.

Analysis and Design of Data Quality Monitoring Application using Open Source Tools: A Case Study at a Government Agency

217

Figure 8: Pattern Monitoring Page

4.2.8 Show Null Monitoring Page

Show null monitoring page on Figure. 9 show infor-

mation about the results of show null data proﬁling

feature. In this page, users can monitor the results of

show null data proﬁling based on pivot columns pro-

vided such as based on running ID, table name, col-

umn name, database name, null number, and date.

Figure 9: Show Null Monitoring Page

4.2.9 Clustering Monitoring Page

This clustering monitoring page shows information

about the results of running clustering proﬁling fea-

tures. On this page, a pivot table is displayed whose

data is taken from the database. Users can observe

the results of clustering proﬁling based on the pivot

columns provided such as based on ID running, table

name, column name, database name, date, clustering

value, and total as shown in Figure. 10.

Figure 10: Clustering Monitoring Page

4.2.10 Data Completeness Monitoring Page

The data completeness monitoring page on Figure. 11

present information about the results of running data

completeness proﬁling. On this page, a pivot table

is displayed whose data is taken from the database.

Users can observe the results of data completeness

proﬁling based on the pivot columns. The pivot ta-

ble presents the data such as running ID, table name,

column name, database name, date, and type.

Figure 11: Data Completeness Monitoring Page

4.2.11 Value Distribution Monitoring Page

The value distribution monitoring page presents the

results of the value distribution proﬁling. On this

page, a pivot table is displayed whose data is taken

from the database. Users can monitor the results

of value distribution proﬁling based on the pivot

columns provided such as based on running ID, ta-

ble name, column name, database name, date, column

value, column value distribution as shown in Figure.

12.

Figure 12: Value Distribution Monitoring Page

4.3 Testing

Testing on applications use two testing method, they

are unit testing and usability testing. A unit testing

performed by developers, while the second is usabil-

ity testing conducted by an expert user. For the mea-

surement method in usability testing uses a Likert.

Usability testing measures applications with user in-

terface and functionality. The result of unit testing is

there are no bugs found in the application. The result

of the usability testing as shown in Figure. 13

ICCETIM 2019 - International Conference on Creative Economics, Tourism Information Management

218

Figure 13: Usability Testing of Data Quality Monitoring

Application

5 CONCLUSIONS

Based on this research can be concluded this data

quality monitoring application is just a prototype be-

cause there is a lack of analysis in decision making

and the dashboard only showing information from

each proﬁling. Presentation of visualization and

features in the data quality monitoring application

made with the pureshare development method that

starts from identifying needs, identifying systems and

data, reﬁnement prototypes, and implementing appli-

cations that have features such as dashboard, report-

ing, and monitoring. The application proposed in

the study case has been made and adapted to the ob-

ject of research. With the proposed application, the

government agency used as the research object can

monitor data quality based on ﬁve proﬁling parame-

ters, there are pattern proﬁling, show null proﬁling,

clustering proﬁling, data completeness proﬁling, and

value distribution proﬁling. The research we are cur-

rently working on is the process of completing the de-

velopment of data quality management at the stage of

data cleansing, data integration, data quality monitor-

ing in another proﬁling parameters, and increasing the

performance for this application.

REFERENCES

Abedjan, Z., Golab, L., and Naumann, F. (2016). Data

proﬁling. In 2016 IEEE 32nd International Confer-

ence on Data Engineering (ICDE), pages 1432–1435.

IEEE.

Amethyst, S., Kusumasari, T., and Hasibuan, M. (2018).

Data pattern single column analysis for data proﬁling

using an open source platform. In IOP Conference

Series: Materials Science and Engineering, volume

453, page 012024. IOP Publishing.

Apel, D., Behme, W., Eberlein, R., and Merighi, C. (2015).

Datenqualit

at erfolgreich steuern: Praxisl

osungen f

Business-intelligence-Projekte. dpunkt. verlag.

Brackett, M. and Earley, P. S. (2009). The dama guide to the

data management body of knowledge (dama-dmbok

guide).

Cai, L. and Zhu, Y. (2015). The challenges of data quality

and data quality assessment in the big data era. Data

science journal, 14.

Ehrlinger, L. and W

oß, W. (2017). Automated data qual-

ity monitoring. In Proceedings of the 22nd MIT In-

ternational Conference on Information Quality (ICIQ

2017), pages 15–1.

Herzog, T. N., Scheuren, F. J., and Winkler, W. E. (2007).

Data quality and record linkage techniques. Springer

Science & Business Media.

Judah, S. and Friedman, T. (2014). Magic quadrant for data

quality tools. Gartner.

Kandel, S., Parikh, R., Paepcke, A., Hellerstein, J. M., and

Heer, J. (2012). Proﬁler: Integrated statistical analysis

and visualization for data quality assessment. In Pro-

ceedings of the International Working Conference on

Advanced Visual Interfaces, pages 547–554.

Leonard, K. (2018). The role of data in business.

Panian, Z. (2010). Some practical experiences in data gov-

ernance. World Academy of Science, Engineering and

Technology, 62(1):939–946.

Analysis and Design of Data Quality Monitoring Application using Open Source Tools: A Case Study at a Government Agency

219