Implementation of Data Quality Management Application Architecture

Aji Nur Laksono

, Tien Fabrianti Kusumasari

and Muhammad Azani Hasibuan

Telkom University, Information System Departement, Jl. Telekomunikasi 01, Bandung, Indonesia

Keywords:

data proﬁling, data cleansing, data quality, monitoring, data quality management, architecture.

Abstract:

Data is a precious aspect, especially for an organization. Data must have good and guaranteed quality because

data can provide business value as a decision-maker. However, today many companies do not have useful

quality data in their management. Large organizations data requirements often cause this can be very diverse.

At the departmental and division level, each requires a different business process and IT system, so it directs

to create a speciﬁc application system with various platforms. This situation causes the amount of data to

repeat and become inconsistent. To overcome this problem, the architecture is needed to carry out quality

management data that can carry out data transformation paths such as proﬁling data and data cleansing. In this

study, the researchers developed a data quality application architecture by applying the principle of quality

management data that includes the process of proﬁling data, data cleansing, and monitoring data. The results

of this study will be compared with similar applications to ﬁnd out which features can be superior in the data

quality management process.

1 INTRODUCTION

In the current era of information technology, data is

a critical aspect for an organization, whether proﬁt

or non-proﬁt. The data is facts and ﬁgures that are

processed every day. Processed data can be valuable

information that can inﬂuence the decisions taken.

Also, data must have good quality, which is a vital

determinant in terms of the effectiveness of an orga-

nization to provide business value (Taleb et al., 2018).

Data with higher data quality results in increas-

ingly accurate decisions. Determining the relation-

ship between data quality dimensions and business

processes helps organizations to make better deci-

sions (Panahy et al., 2014). In enterprise organiza-

tions, data requirements are very diverse. Each de-

partment and division in an enterprise has different

business processes. This directs the department or di-

vision to create a speciﬁc application system with var-

ious platforms. The business process in each part still

requires the same data.

Weak quality data will affect data governance in

a company. Data governance is planning, supervi-

sion, and control of data management and use of data

and data sources related to data (International, 2017).

Based on the concept of data governance, there are ten

(10) data management functions, including the Qual-

ity Management Data is one of the functions (Inter-

national, 2017). Data Quality Management (DQM) is

the management of data quality in order to maintain

the consistency of data to conform to the standards

and strategies applied to the organization or company

(Sabtiana et al., 2018). DQM is expected to be able to

measurably improve the quality of data so that busi-

ness objectives can be achieved. Data quality process-

ing has several processes such as data proﬁling, data

cleansing, data monitoring, and data integration Data

Proﬁling is the ﬁrst step in data quality management

to understand all the feasibility of data sources and

the quality of each current data source (Abedjan, et

al., 2016). Data cleansing is a solution that can be

used to overcome data problems that generally occur

in enterprise-scale companies. The data cleansing is

improving the quality of data by transforming data to

ﬁt the rules of business (Juddoo, 2015). Data mon-

itoring stages are used to monitor data quality and

measure data quality by business rules (International,

2017).

Previous research conducted by Febri (Dwiandri-

ani et al., 2017), focus on designing a single column

proﬁling algorithm, which is a process for performing

data quality using the open-source Pentaho Data Inte-

gration (PDI). The results of the research are expected

to be developed, which can later be arranged in inte-

grated application architecture to carry out the DQM

process.

This research focuses on utilizing and reﬁning

previous research to create a web-based application

268

Laksono, A., Kusumasari, T. and Hasibuan, M.

Implementation of Data Quality Management Application Architecture.

DOI: 10.5220/0009868302680274

In Proceedings of the International Conference on Creative Economics, Tourism and Information Management (ICCETIM 2019) - Creativity and Innovation Developments for Global

Competitiveness and Sustainability, pages 268-274

ISBN: 978-989-758-451-0

architecture to integrate algorithms in previous re-

search that can later be used as a whole as a feature

for data quality management architecture.

2 DATA PROFILING PROCESS

One thing that can be done to maintain data quality is

by conducting proﬁling data. Proﬁling data is the pro-

cess of verifying user‘s data for gathering data struc-

ture, data pattern, distribution value, or other infor-

mation for data governance and data quality manage-

ment control (Dai et al., 2016). Data Proﬁling is an

activity and process for determining metadata based

on a particular data set (Kusumasari et al., 2016).

The technique and processing of proﬁling data can

be grouped into three categories, namely Structure

Discovery, Content Discovery, and Relationship Dis-

covery (Institute, 2017). In general, Structure Dis-

covery can also be categorized as structure analysis.

Structure discovery analyzes the metadata and pattern

structure in the data. Structure discovery ensures that

data contains a consistent format. The point made in

structural analysis is to ﬁnd data that is not the same

as the expected pattern or standard. Content Discov-

ery is a process of analyzing in more detail to each

element of the database to check data quality (Insti-

tute, 2017). General data problems are caused by data

entries that are messy and do not match the standard

pattern set. Data that does not conform to the standard

has the potential to cause problems, ranging from a

misunderstanding of data interpretation that can lead

to incorrect policies. Standardization on the content

discovery process is one of the leading solutions to

minimize the problems that occur. Relationship Dis-

covery handles the relationship between data from

various sources to get knowledge about the correla-

tion between the data needed. Related research con-

ducted by Febri is applying a data proﬁling algorithm

to process facet text for data using ﬁngerprint cluster-

ing techniques. The text facet process is related to one

of the processing techniques in proﬁling data, namely

content discovery, which in this algorithm determines

a pattern to test the consistency of a data. The re-

search conducted by Sandy is to make a proﬁling data

processing algorithm that can determine the pattern

of data (Amethyst et al., 2018). This data pattern can

be categorized in content discovery. Then another re-

lated research conducted by Margo is to create a data

deduplication algorithm that can determine duplicate

data between columns and tables (Gunatama et al.,

2019). Which can be categorized as relationship dis-

covery.

3 DATA PROFILING PROCESS

Data Cleansing is a process to transform data into data

that is by the rules of an organization (Azeroual et al.,

2019). Problems with Quality Data can be solved by

initiating data cleansing. In general, problems with

data quality are caused by differences in the data for-

mat of each part of the application platform. Data

entered in the data warehouse becomes irrelevant to

each other. This problem causes a loss of data quality

that should be important to process. The data ware-

house itself requires the role of data cleansing. Data

that is loaded every day continuously is possible to

contain dirty data. Dirty data that is continuously col-

lected without doing cleansing can have fatal conse-

quences on decision making. Because of clean data

is an essential requirement for any sales, marketing,

and distribution strategy, data cleansing is one of the

main problems in data processing (Kumar and Dixit,

2017).

Related research conducted by Alﬁ (Khoirunisa,

2017) regarding making data cleansing. In this study

made a data cleansing algorithm using Pentaho data

integration originating from User Input and Auto-

matic Logic Users.

4 METHOD

The method used in this study are four steps. The ﬁrst

step is to determine the proposed application archi-

tecture. The second step is to determine the feature

component. The third step is the adjustment of the

Pentaho algorithm. The ﬁnal step is the evaluation.

Figure 1: Research Methodology

The ﬁrst step is to determine the proposed appli-

cation architecture to analyze what architectural mod-

els are suitable for the application being developed.

Application architecture was made to change previ-

ous research about more dynamic data cleansing or

proﬁling algorithms. The application architecture is

created as a container for executing on Pentaho ﬁles

Implementation of Data Quality Management Application Architecture

269

which can later be integrated as a data proﬁling fea-

ture on data quality management applications.

The next stage determines how the feature com-

ponents in the proposed application architecture. The

researcher groups criteria as a reference aspect to de-

termine the features developed. From these criteria

determined how the feature components contained in

the proposed application.

The next step is the adjustment of the Pentaho al-

gorithm. The Pentaho algorithm made in previous re-

search was developed into a dynamic algorithm and

can receive data with several core parameters that are

required algorithms to process proﬁling data.

The ﬁnal step is evaluation and comparison. In

this phase compared the features and components of

the proposed application architecture with existing

applications related to data quality processing.

4.1 Application Architecture

The application was made to develop previous re-

search on proﬁling data. In other research, proﬁling

data algorithms are made with the Pentaho data inte-

gration application tools. In order for the algorithm

to be utilized further into a function that can be run

widely and dynamically, architecture is designed to

be able to run the algorithm. Pentaho is made as a ﬁle

that can be executed using a job trigger that is avail-

able on Pentaho. The Pentaho algorithm can be inter-

preted as a logic function where this algorithm exe-

cutes a command to do proﬁling data. The algorithm

ﬁle is executed by the user on the front-end applica-

tion, which is then stored in the database.

Figure 2: Proposed Application Architecture

Figure 2 is a proposed architecture. The archi-

tecture created using the three-tier architecture model

where there are several layers namely, presentation

layer, logic layer, and the data layer.

• Presentation layer: Layer which is a link between

user and application, is a web application us-

ing the PHP programming language with Laravel

framework.

• Logic layer: Layer where the logic algorithm of

the feature is executed. Here Pentaho Data Inte-

gration works to process the data obtained to run

according to the expected function features

• Data layer: This layer is to store data both source

data that will be processed, exported, and save

data as a result of processing algorithms.

The selection of this three-tier model is in line

with the adjustments to the Pentaho algorithm, which

is developed where at the presentation layer, users can

enter the data needed in the process of proﬁling data

algorithm. From the presentation layer, the algorithm

is executed along with the data as a parameter. The al-

gorithm is executed with trigger Pentaho job by call-

ing the terminal function.

pentahodirectory pan.bat

/ f ile : f ilepentahodirectory

/param : ”param name = param value (1)

4.2 Features Component

As a measurement of the data quality management

function itself, the researcher divides into several cri-

teria for the requirements of the DQM process. The

comparison parameters include several discussion cri-

teria, namely data sources, core functions, and devel-

opment platforms.

Figure 3: Architecture Parameters.

ICCETIM 2019 - International Conference on Creative Economics, Tourism Information Management

270

For the database source function itself because it

applies the Pentaho data integration algorithm to the

core proﬁling data process and data cleansing from

the related research, the variety of data sources that

can be processed varies according to the capabilities

of the Pentaho tools. It is known that Pentaho data

integration can collect data from various data source

platforms, for example, relational databases, text ﬁles,

and CSV ﬁles.

For the core process, namely Data Proﬁling, the

researcher takes functions based on three technical di-

mensions in proﬁling data, namely content discovery,

structure discovery, and relationship discovery. Al-

gorithms are taken in research on proﬁling data pro-

cesses in different studies.

The following table explains the mapping of pro-

ﬁling data process functions in this study

Figure 4: Data Proﬁling Description.

Other studies related to data cleansing from Au-

lia developed research on null cleansing and cleans-

ing patterns. This function is taken based on the im-

plementation of the existing proﬁling process, namely

pattern and null. Where according to the rules of data

cleansing is cleaning the data after anomaly data is

identiﬁed in the previous process, namely proﬁling

data.

For monitoring function itself is the phase where

proﬁling data has been done. The results of Data Pro-

ﬁling stored on the database are then taken and dis-

played in graphic form. According to its function,

monitoring data is carried out to monitor data anoma-

lies periodically and consistently.

4.3 Adjustment of the Pentaho

Algorithm

Application architecture is built using Algorithms in

other studies. However, this algorithm needs to be

adjusted again with the application architecture used.

This adjustment is carried out as a development in

previous research to be used as part of the features

of an application architecture that has been designed.

Figure 5: Transformation properties

Variables as additional parameters are included to

replace previously determined values such as data in

database conﬁguration, column data, and table. By

changing the value into a variable as a parameter, data

can be taken from the input form on the Application

layer, which will then be entered based on the speci-

ﬁed variable parameters. Figure 5 is the property of

the transformation ﬁle Pentaho to initiate parameters

that will be captured and then inserted into the ﬂow of

the algorithm.

To apply the variable to the parameter there are

two ways, the ﬁrst one can directly call parameters

based on the variables with $parameter or ﬁrst enter

the row through the Get Variables function on Pen-

taho.

To adjust the storage in the result database, the

component is ﬁrst adjusted to the ﬂow of the algo-

rithm to at the end of the data storage results. Data

being input from the form in the presentation layer.

Data from form input is a parameter that is entered in

the application layer as a variable to run the existing

algorithm. In previous studies, these data have been

determined or static. In this adjustment, the dynamic

data needed is changed to a new parameter variable

obtained from the parameters when the logic is run

on the application layer using a job trigger. In this ad-

justment, the trigger job used is Pan. Job Trigger is

executed through a terminal command that is called

on the Laravel framework as the presentation layer.

Then if all data entered is appropriate, and the algo-

rithm is successfully executed, the data is ready to be

entered into the database. Before the results of the

data are entered, the running id is entered to iden-

tify each process. In addition to running IDs, labels

are also included as a User reminder when processing

features. Then ﬁnally, the additional information is

entered into the database with data from the process.

In addition to the algorithm data, in the previous

database design, it was necessary to have a column to

Implementation of Data Quality Management Application Architecture

271

Figure 6: Adjust Pentaho algorithm

store the running ID to identify each process that was

run and the column to store details of the source data

connection

Figure 7: Algorithm before adjustment

Figure 8: Algorithm after adjustment

Figure 7 is an example of an algorithm in a previ-

ous study of proﬁling patterns that have added several

component steps to the Pentaho Kettle section in Fig-

ure 8. This adjustment is also made on the cleansing

algorithm for the same function.

The following Figure 9 is a component added as

an algorithm adjustment step to the application archi-

tecture.

Figure 9: Component description.

5 DISCUSSION

There are similar applications regarding data quality

processing, where the purpose of this application is

to test to ﬁnd out the phenomenon of data. One of

them is Data Cleaner. With the variety of open source

applications available, the researcher will determine

the evaluation by comparing the proposed application

with the DataCleaner application.

When compared with a data cleaner application,

there are some differences from the comparison cri-

teria. In the application data source, the proposal

can accommodate text data as well as the relational

database. This variation is based on ability rather than

Pentaho Data Integration in collecting data sources.

While in the data cleaner, data sources can be more

varied, consisting of text, relational databases, Big

Data, and CRM software. In the primary function,

proﬁling data, data cleansing, and monitoring data,

only monitoring data are criteria that are not found

in the DataCleaner application. In the proposed ap-

plication architecture, the data cleansing and proﬁl-

ing data features are built by developing previous re-

search algorithms regarding processing data on the

Pentaho data integration. For the development plat-

form, the Data Cleaner application was developed us-

ing a java desktop platform. Whereas in the applica-

tion architecture, the proposal uses Php at the presen-

tation layer, and Java at the application layer, namely

ICCETIM 2019 - International Conference on Creative Economics, Tourism Information Management

272

Pentaho data integration. Comparison can be seen in

the following Figure 10.

Figure 10: Comparison Evaluation.

Then compared the results of the data proﬁling

and data cleansing functions. Analysis of the data

in this journal was obtained from the government

agency dataset, which contained three tables, namely

ereg pattern, ereg pabrik, ereg trader with compari-

son columns that differed from each function. For

data deduplication functions, comparing two-column

data from two different tables. Because the function

of data deduplication itself is Relationship discovery

that compares the relationship between one column

and another.

Figure 11: Result comparison.

From the functions that are compared, there are

no differences or differences that are not too signiﬁ-

cant. However, in the data completeness function, be-

cause the Data Cleaner can only detect null or blank,

it cannot be compared with the data cleaner function

of the application in this study because in this study

data completeness was detected in more detail at the

content level. Then in the data deduplication, the re-

searcher did not ﬁnd a similar function in the Data

Cleaner application.

ACKNOWLEDGEMENTS

This research is one of the studies on data qual-

ity management funded by Telkom University’s re-

search and community service directorate. We want

to thank Telkom University and the Industrial Engi-

neering Faculty for supporting the research that we

have done. We also thank all the data quality man-

agement research teams that are part of Enterprise

System Development, Telkom University expertise

group.

REFERENCES

Amethyst, S., Kusumasari, T., and Hasibuan, M. (2018).

Data pattern single column analysis for data proﬁling

using an open source platform. In IOP Conference

Series: Materials Science and Engineering, volume

453, page 012024. IOP Publishing.

Azeroual, O., Saake, G., and Abuosba, M. (2019). Data

quality measures and data cleansing for research in-

formation systems. arXiv preprint arXiv:1901.06208.

Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., and Long, J.

(2016). Data proﬁling technology of data governance

regarding big data: review and rethinking. In Informa-

tion Technology: New Generations, pages 439–450.

Springer.

Dwiandriani, F., Kusumasari, T. F., and Hasibuan, M. A.

(2017). Fingerprint clustering algorithm for data pro-

ﬁling using pentaho data integration. In 2017 2nd

International conferences on Information Technology,

Information Systems and Electrical Engineering (ICI-

TISEE), pages 359–363. IEEE.

Gunatama, M., Fabrianti, T., and Hasibuan, M. A. (2019).

Deduplication for data proﬁling using open source

platform. In 2018 International Conference on In-

dustrial Enterprise and System Engineering (ICoIESE

2018). Atlantis Press.

Institute, S. (2017). What is data proﬁling and how does it

make big data easier?

International, D. (2017). DAMA-DMBOK: Data Manage-

ment Body of Knowledge. Technics Publications,

LLC.

Juddoo, S. (2015). Overview of data quality challenges in

the context of big data. In 2015 International Con-

ference on Computing, Communication and Security

(ICCCS), pages 1–9. IEEE.

Khoirunisa, A. N. (2017). Analysis and design of applica-

tion architecture data cleansing based on open source

in xyz organization , bandung: Telkom university.

Kumar, S. and Dixit, M. (2017). Analysis of various data

cleansing approaches: A survey. International Jour-

nal, 7(5).

Kusumasari, T. F. et al. (2016). Data proﬁling for data qual-

ity improvement with openreﬁne. In 2016 Interna-

tional Conference on Information Technology Systems

and Innovation (ICITSI), pages 1–6. IEEE.

Implementation of Data Quality Management Application Architecture

273

Panahy, P. H. S., Sidi, F., Affendey, L. S., and Jabar, M. A.

(2014). The impact of data quality dimensions on

business process improvement. In 2014 4th World

Congress on Information and Communication Tech-

nologies (WICT 2014), pages 70–73. IEEE.

Sabtiana, R., Yudhoatmojo, S. B., and Hidayanto, A. N.

(2018). Data quality management maturity model: A

case study in bps-statistics of kaur regency, bengkulu

province, 2017. In 2018 6th International Conference

on Cyber and IT Service Management (CITSM), pages

1–4. IEEE.

Taleb, I., Serhani, M. A., and Dssouli, R. (2018). Big data

quality assessment model for unstructured data. In

2018 International Conference on Innovations in In-

formation Technology (IIT), pages 69–74. IEEE.

ICCETIM 2019 - International Conference on Creative Economics, Tourism Information Management

274