A CLOUD-BASED SOLUTION FOR DATA QUALITY

IMPROVEMENT

Marco Comerio

University of Milano - Bicocca, Milano, Italy

Keywords:

Service Selection, Service Contract, Data Contract, Data Quality, Cloud Computing.

Abstract:

The application of techniques to improve the data quality of an organization is traditionally costly since differ-

ent speciﬁc tools are required. Potentially, cloud computing models could offer powerful solutions to reduce

costs. However, some challenges remain in the widespread acceptance of cloud computing models because

they require the sharing of business critical data. Therefore, services for data quality improvements in the

cloud should act in compliance with predeﬁned contracts. This paper extends previous works on the speciﬁ-

cation, selection and evaluation of service and data contracts. Moreover, a cloud-based architecture for data

quality improvement that supports contract-based service selection is proposed. Experimental activities on a

real scenario demonstrate the feasibility of the proposed solution.

1 INTRODUCTION

In the last decade, the research has focused on deﬁn-

ing data quality methodologies (Batini and Scanna-

pieco, 2006) providing a set of guidelines that deﬁnes

a rational process to select, customize, and apply data

quality techniques for improving the quality of data

of an organization. Techniques for data quality im-

provement can be divided into three different cate-

gories: (i) data cleansing techniques (e.g., standard-

ization, normalization and record linkage), which are

applied to correct errors, standardize information, and

validate data; (ii) data integration techniques (e.g.,

data and schema integration), which support the link-

ing of data elements about the same item available in

different data sources and the consolidation of these

data elements into a single view; (iii) data enrich-

ment techniques (e.g., data veriﬁcation and data vali-

dation), which support the incorporation of additional

data from external sources to complete and verify ex-

isting records. The application of data cleansing, in-

tegration and enrichment techniques inside an orga-

nization is traditionally costly since different speciﬁc

tools are required.

The emerging cloud computing paradigm sup-

ports the Software-as-a-Service(SaaS) and Data-as-a-

Service (DaaS) distribution models in which software

and data are made available as on-demand services to

the customers. Potentially, these distribution models

could offer powerful solutions to reduce costs in per-

forming data quality improvement inside an organi-

zation. The SaaS model (Viega, 2009) can be used to

provide data quality improvement tools and the DaaS

model (Truong et al., 2011) can be used e.g., for data

enrichment activities. In this paper, SaaS and DaaS

dealing with data quality improvement are referred to

data intensive services.

The possibility to use data intensive services in the

cloud should be deeply investigated. By moving from

tools and data sources in an organization to data in-

tensive services in the cloud, traditional data quality

improvement activities must be re-formulated(Come-

rio et al., 2010). Moreover,some challenges remain in

the widespread acceptance of SaaS and DaaS delivery

models because they require the sharing of business

critical data (Li et al., 2009). Due to this fact, data

intensive services should act in compliance with pre-

deﬁned agreements with the customer. The generic

term contract is used to refer to these agreements, the

speciﬁc term service contract is used when the con-

tract deals with Quality of Service (QoS), business,

contextual and legal terms related to a service and the

term data contract is used when the contract deals

with data managed by a service. Typically, service

and data contracts are composed of one or more con-

tractual terms that specify conditions established on

the basis of non-functional parameters.

Current data intensive services are typically asso-

ciated with human-readable contracts that hinder their

automatic evaluation and selection. In the literature,

222

Comerio M..

A CLOUD-BASED SOLUTION FOR DATA QUALITY IMPROVEMENT.

DOI: 10.5220/0003902802220227

In Proceedings of the 2nd International Conference on Cloud Computing and Services Science (CLOSER-2012), pages 222-227

ISBN: 978-989-8565-05-1

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

different approaches (e.g., policies, licenses, service

level agreements) for the speciﬁcation of structured

contracts have been proposed. These approaches de-

ﬁne contractual terms by means of <attribute,value>

clauses that prevent the inclusion of articulated con-

tractual terms specifying technological and business

interdependencies among non-functional parameters

(e.g., a higher price for the service that guarantees a

certain bandwidth). Moreover, these approaches sup-

port only the speciﬁcation of service contracts with-

out any reference to the data managed by the services

(Comerio et al., 2009b).

Contract speciﬁcation is not the only impediment

to the use of data intensive services in the cloud.

Current tools for data quality improvements (e.g.,

DataFlux dfPower Studio

) provide a limited sup-

port for the integration of third-party services and do

not support automatic contract-based service selec-

tion. Basically, even if they allow for the usage of

external tools for speciﬁc data quality activities (e.g.,

geocoding services for data quality enrichment), the

selection of these tools is performed manually.

Previous works provide contributions to automate

the evaluation and selection of contracts (namely,

the matchmaking process) w.r.t. a set of user re-

quirements. In (De Paoli et al., 2008), a se-

mantic metamodel, namely Policy-Centered Meta-

model (PCM), has been deﬁned to support expres-

sive descriptions of contractual terms according to a

language-independent ontology. The PCM was ex-

ploited as intermediate format to develop a hybrid

matchmaking process (Comerio et al., 2009a), where

hybrid indicates the capability of addressing symbolic

and numeric values and expressions. The match-

making process has been implemented by the Policy

Matchmaker and Ranker (PoliMaR) framework

This paper extends previous works providing new

contributions on modeling non-functional parameters

that are relevant when performing data quality im-

provement activities in the cloud. Moreover, a cloud-

based architecture that supports contract-based ser-

vice selection is proposed. Experimental activities on

a real scenario demonstrates the feasibility of the pro-

posed solutions.

The rest of this paper is organized as follows. Sec-

tion 2 reports the new contribution on modeling con-

tracts for data intensive services. The proposed cloud-

based architecture is described in Section 3 and tested

in Section 4 on a real scenario. Related works are de-

scribed in Section 5. Section 6 concludes the paper

and outlines future works.

http://www.dataﬂux.com/Downloads/Documentation/

8.2.1/dfPowerGuide.pdf

http://sourceforge.net/projects/polimar/

2 MODELING CONTRACTS FOR

DATA INTENSIVE SERVICES

The analysis of existing contracts offered by real data

intensive services (e.g., Strikerion

, Postcode Any-

where

and Jigsaw

) allows the identiﬁcation of non-

functional parameters that must be considered when

modeling contracts. In order to allow the semantic

speciﬁcation of these parameters in PCM-based con-

tracts for data intensive services, a reference ontol-

ogy (namely, Data Contract Ontology (DCOnto)) has

been produced. In this section, the attention is fo-

cused on Data Rights terms (Truong et al., 2011) re-

lated to data handling and anonymity.

2.1 Modeling Data Handling

Several different authorizations on data handling can

regulate the exchange of data between the data owner

and the data intensive service provider. Examples of

rights to be observed when a data owner sends data

to the data intensive service and asks for a particular

request (e.g., data cleansing) are:

• no-reuse: the service can use the data only to pro-

cess the request and after that the data must be

deleted. Re-use is not allowed.

• reuse for commercial use (reuse4CU): the data are

used to process the request and, possible, to pro-

cess third-party requests. Re-use is allowed.

• data must be managed according to a particular

Open Data Commons license

– Public Domain Dedication and License

(PDDL): the data can be copied and dis-

tributed. Moreover, the data can be modiﬁed

and elaborated to produce new data assets.

Distribution and re-use are allowed.

– Attribution License (ODC-By): as the PDDL

but with the restriction that any distribution and

re-use of the data (or of new data assets pro-

duced from them) must be attributed to the orig-

inal license.

– Open Database License (ODC-ODbL): as the

ODC-By with the additional restriction that the

data (or of newdata assets producedfrom them)

must be distributed under the same license.

This hinders the commercial use of the data.

Data handling contractual terms are characterized

by inclusion relations (⊑). As an example, if a data

http://www.strikeiron.com

http://www.postcodeanywhere.co.uk

http://www.jigsaw.com/

http://www.opendatacommons.org/licenses

ACLOUD-BASEDSOLUTIONFORDATAQUALITYIMPROVEMENT

223

owner requests a data cleansing activity on data cov-

ered by a ODC-By license, a data intensive service

that offers the same license for data handling or li-

censes that include it (i.e., the more restrictive li-

censes ODC-ODbL, no-reuse or reuse4CU) must be

selected. The complete set of inclusion relations on

data handling terms is the following: ODC− ODbL ⊑

noReuse, PDDL ⊑ ODC − By, ODC − By ⊑ ODC −

ODbL, ODC − By ⊑ reuse4CU and reuse4CU ⊑

noReuse.

A fragment of DCOnto that models the data han-

dling terms ODC-By and ODC-ODbL is shown in

Figure 1. To be noticed that the terms are modelled

as instances of the DataHandlingType concept. The

modeling of the inclusion relations is performed by

means of relationInstances pcm#composedOf.

 

concept DataHandlingType

instance odcBy memberOf DataHandlingType

instance odcOdbl memberOf DataHandlingType

relationInstance pcm#composedOf( odcOdbl , odcBy )

 

Figure 1: Data handling terms modeling.

2.2 Modeling Anonymity

Anonymity is an important requirement when dealing

with privacy-aware data intensive services. In gen-

eral, data can be categorized into different classes.

Among them, one class includes data, referred to as

sensitive data, concerning the private life, political or

religious creed and so on, while another class con-

tains data that describes the identity of individuals

(e.g., ﬁrst name, family name, etc.) (Coen-Porisini

et al., 2010). A privacy-aware data intensive service

must assure that only authorized users can view the

existing relationships between sensitive data and the

identity of the individuals.

Different data modiﬁcation methods (e.g., pertur-

bation and blocking) can be used when dealing with

the anonymity of data that needs to be released to

the public. To measure the anonymity, several met-

rics characterized by different levels of applicability,

complexity and generality are available. Similar to

data handling terms, the anonymity metrics present

inclusion relations. For example, let us consider the

metrics k-anonymity and l-diversity.

A dataset provides k-anonymity protection if the

information for each person contained in the dataset

cannot be distinguished from at least k-1 individu-

als whose information also appears in the dataset.

As shown in (Machanavajjhala et al., 2007), a k-

anonymized dataset has some subtle, but severe pri-

vacy problems (namely, the homogeneity attack and

the background knowledge attack). The metric l-

diversity solves these privacy problems introducing

the restriction that the k-anonymazed individuals are

characterized by at least l-diverse sensitive informa-

tion. Therefore, l-diversity introduces privacy beyond

k-anonymity and so when a data owner requests for

k-anonymity data management, a data intensive ser-

vice offering the k-anonymity or l-diversity (with k=l)

must be selected.

Since k-anonymity and l-diversity assume

numeric values, they must be modelled as

quantitative terms. The PCM allows for the

speciﬁcation of single value terms and range

terms by means of pcm#SingleValueExpression

and pcm#RangeExpression. The relation

k − anonymity ⊑ l − diversity cannot be directly

modelled into DCOnto since it is valid only when

k=l. However, it must be considered and evaluated

along the service matchmaking process by means

of predeﬁned matching rules. Figure 2 shows a

matching rules stating that a request for a single value

k-anonymity must be evaluated with offered terms on

single value k-anonymity and l-diversity.

 

axiom anonymityMatching definedBy

matchCouple ( ?request , ?offer , anonymity ) :−

( ?request memberOf dcOnto#singleValueKAnonymity

) and

( ? of f e r memberOf dcOnto#singleValueKAnonymity)

( ? of f e r memberOf dcOnto#singleValueLDiversity ) .

 

Figure 2: Matching rule for single value anonymity terms.

3 AN ARCHITECTURE FOR

DATA QUALITY

IMPROVEMENT IN THE

CLOUD

In (Comerio et al., 2010), a preliminary cloud-based

conceptual architecture for data quality improvement

has been proposed. Basically, this architecture allows

a data owner to invoke a data quality SaaS provid-

ing data cleansing, data integration and data enrich-

ment services. The data quality SaaS can directly of-

fer the services or it can act as a broker that selects

and invokes the best SaaS/DaaS over the cloud able

to satisfy the data owner request. The cloud-based

conceptual architecture in (Comerio et al., 2010) does

not provide support for the evaluation of service and

CLOSER2012-2ndInternationalConferenceonCloudComputingandServicesScience

224

data contracts. This limitation can be overcome with

the integration of a new component for contract-based

service selection.

In this paper, the resulting conceptual architecture

is applied to extend the architecture of DataFlux df-

Power Studio

, a commercial solution for data qual-

ity engineering activities. Within the dfPower Studio,

the dfPower Architect supports the deﬁnition of work-

ﬂows (namely, jobs) that are composed by different

nodes (namely, steps) conﬁgurable via metadata prop-

erties. Furthermore, DataFlux also offers the Enter-

prise Integration Server (EIS) to support the execu-

tion of dfPower Architect jobs and their conversion to

Web services. Through tight integration with the df-

Power Architect, the EIS enables the complete reuse

of jobs and services. However, the automatic selec-

tion and invocation of external data intensive services

in the cloud is not supported.

Figure 3: DataFlux dfPower Studio extended architecture.

Figure 3 shows the extensions proposed for the

architecture of DataFlux dfPower Studio. Two new

steps are introduced: service selection and service in-

vocation. The Service Selection step receives a Re-

quest specifying the requested contractual terms of

the service to be selected and realizes the selection

process on PCM-based contracts. The matchmak-

ing process is realized through the following phases:

a Term Matching phase, which identiﬁes couples of

comparable requested and offered contractual terms;

a Local Evaluation, that computes a local degree (LD)

of match between each couple; a Global Evaluation,

which exploits LDs to compute a global degree (GD)

of match for each offered contract; and a Contract Se-

lection that performs a sorting based on the GDs and

returns the best matching contract. As mentioned in

Section 1, the matchmaking process has been imple-

mented in the PoliMaR framework. Therefore, the

DataFlux has recently released a new version of the tool

changing its name into Data Management Studio.

usage of a service selection step in dfPower Architect

jobs basically consists in a invocation of PoliMaR,

seen as an external service.

The Service Invocation step receivesthe ID of the

service to be invoked and the data to be sent to the ser-

vice (e.g., the URI of an external data enrichment ser-

vice and the data to be enriched). The service invoca-

tion step invokes a wrapper that supports the follow-

ing three-phase process: (i) the mapping between the

input of the service invocation step and the input of

the selected service; (ii) the invocation of the selected

service and (iii) the mapping between the output of

the selected service and the output of the service in-

vocation step. The mapping rules used along phases

(i) and (iii) are predeﬁned and available into a rule

repository.

4 EXPERIMENTS

In order to verify the feasibility of the architecture

proposed in Section 3, a test on a real scenario has

been performed.

The Academic Patenting in Europe (APE-INV)

project has the objective to produce a freely available

database on academic patenting in Europe, that will

contain reliable and comparable information on the

contribution of European academic scientists to tech-

nology transfer via patenting. The reference source

of raw patent data is the PATSTAT database, issued

for statistical purposes by the European Patent Of-

ﬁce (EPO)

. The raw data of the PATSTAT database

presents data quality issues (e.g., the addresses of the

inventors in the database are often inaccurate causing

the presence of duplicated records).

To execute the test, the following process has been

realized: (i) Geocoding Step: usage of a geocod-

ing service to verify/correct the addresses of the in-

ventors; (ii) De-duplication Step: usage of a de-

duplication service to identify and delete duplicated

records; (iii) Enrichment Step: usage of an enrich-

ment service to extend data of the inventor’s city of

residence with information about the number of in-

habitants. This information is useful for statistical

purposes.

The process has been designed as a dfPower Ar-

chitect job, where the geocoding and enrichment

steps are realized by means of the selection and in-

vocation steps described in Section 3. The objec-

tive is to show that, changing the requested contrac-

tual terms (namely, Request), different functional-

equivalent services can be selected and invoked at

http://www.esf-ape-inv.eu

http://www.epo.org

ACLOUD-BASEDSOLUTIONFORDATAQUALITYIMPROVEMENT

225

Figure 4: Examples of the selection performed for the

geocoding step.

run-time.

Due to space limit, the description of the test pro-

ceeds focusing only on the geocoding step. The en-

richment step is performed similarly. Figure 4 shows

how the selection for the geocoding step is performed.

Tree geocoding services are available; two of them

are services in the cloud (namely, Google Geocod-

ing

and Yahoo! PlaceFinder

) and the last one

(Rooftop Geocoding) is an internal service offered

by DataFlux dfPower Studio. Each service is char-

acterized by a contract including contractual terms

on the pricing type, the service coverage and typ-

ical data quality metrics (i.e., accuracy, reliability,

data currency and completeness). The selection has

been performed considering two different Requests

containing contractual terms associated with differ-

ent relevance value in [0..1] (i.e., the importance for

the requestor to have the terms satisﬁed). The for-

mer contains strictly-required (relevance=1) contrac-

tual terms on pricing type and service coverage and

preferred terms (relevance=0.8) on data quality met-

rics. The latter contains strictly-required contractual

terms on pricing type, service coverage and data cur-

rency and optional terms (relevance=0.2) on accu-

racy, reliability and completeness. Figure 4 shows

that: (i) the selection step for the ﬁrst Request returns

Google Geocoding since this service better satisﬁes

the strictly-required and preferred contractual terms

and (ii) the selection for the second Request returns

Yahoo! PlaceFinder since this service better satisﬁes

the additional strictly-required constraint on data cur-

rency.

Figure 5 shows how the invocation of the Google

http://code.google.com/intl/it-IT/apis/maps/

documentation/geocoding/

http://developer.yahoo.com/geo/placeﬁnder/

Figure 5: Examples of invocation performed for the

geocoding step.

Geocoding service is performed. For each record

of the table, the wrapper composes the input for the

geocoding service merging the six ﬁelds (i.e., IN-

ADDR, INCITY, INCOUNTRY, INREGION, INZIP,

INCY) that specify the address of an inventor. Then,

the wrapper invokes the Google Geocoding service

and receives, when available, a string with the cor-

rected address. Finally, the wrapper, by means of a

pre-deﬁned mapping rule, splits the string and substi-

tutes the address in the table.

The usage of external service in the cloud provides

positive results for the data quality improvement. As

an example, Google Geocoding has veriﬁed/corrected

the 75% of the addresses in the table. This results out-

performs the one obtained using the internal Rooftop

Geocoding service. This is due to the fact that the

completeness (95%) of the data provided by Google

is higher. A limitation of the proposed solution is rep-

resented by the semantic solution adopted for contract

selection. The response time of the PoliMaR frame-

work is unsatisfactory due to the slowness of seman-

tic matching activities; moreover, as already tested in

(Comerio et al., 2009a), the performance decreased

dramatically with the growing of contracts in size and

number. However, this limitation is not crucial since

the approach in this paper is not proposed for a time-

critical application.

5 RELATED WORK

Only few papers (Zhou et al., 2009; Tsai et al., 2007)

propose a SOA-based solution to perform data qual-

ity improvement activities. A SOA-based semantic

data quality assessment framework which supports

automatically searching for proper data quality as-

sessment services which fulﬁll user requirements is

proposed in (Zhou et al., 2009). A dynamic frame-

work for data provenance (i.e., the origins and routes

of data) classiﬁcation in a SOA system is described

CLOSER2012-2ndInternationalConferenceonCloudComputingandServicesScience

226

in (Tsai et al., 2007). Respect to the proposed cloud-

based architecture, these SOA-based solutions cover

only a particular aspect (i.e., data quality assessment

(Zhou et al., 2009) and data provenance (Tsai et al.,

2007)) of a data quality improvement process.

A very limited number of papers (Faruquie et al.,

2010; Dani et al., 2010) proposes a cloud-based solu-

tion to perform data quality improvementactivities. A

cloud infrastructure to offer virtualized data cleansing

that can be accessed as a transient service is presented

in (Faruquie et al., 2010). Setting up data cleansing

as a transient service gives rise to several challenges

such as (i) deﬁning a dynamic infrastructure for the

cleansing on demand based on customer requirements

and (ii) deﬁning data transfer and access that meet re-

quired service level agreements in terms of data pri-

vacy, security, network bandwidth and throughput.

Moreover, as further discussed in (Dani et al., 2010),

offering data cleansing as a service is a challenge be-

cause of the need to customize the rules to be applied

for different datasets. The Ripple Down Rules (RDR)

framework is proposed in (Dani et al., 2010) to lower

the manual effort required in rewriting the rules from

one source to another. The solutions in (Faruquie

et al., 2010; Dani et al., 2010) face challenges similar

to the ones tackled in this paper but they focused only

on data cleansing and not to a complete data quality

improvement process. Moreover, the contract-based

service selection is not addressed.

6 CONCLUSIONS AND FUTURE

WORKS

Cloud computing models offer powerful solutions to

reduce costs when performing data quality improve-

ments by using software and data offered as services

on-demand. However, since data quality improve-

ments potentially require the sharing of business criti-

cal data, these services should act in compliance with

predeﬁned contracts. This paper has extended pre-

vious works on the deﬁnition of methods and tech-

niques for the speciﬁcation, selection and evaluation

of service and data contracts. Moreover, this paper

has proposed an extension for the DataFlux dfPower

Studio architecture that supports contract-based ser-

vice selection for data quality improvement activities

in the cloud. Experimental activities on the PATSTAT

database have demonstrated the feasibility of the pro-

posed solutions.

Future works deal with some open issues concern-

ing data transfer and resource allocation for data pro-

cessing services over the cloud.

ACKNOWLEDGEMENTS

This work is supported by the SAS Institute srl (Grant

Carlo Grandi). The author wants to thank Andrea

Scrivanti for his precious contribution in this work.

REFERENCES

Batini, C. and Scannapieco, M. (2006). Data Quality: Con-

cepts, Methodologies and Techniques (Data-Centric

Systems and Applications). Springer-Verlag.

Coen-Porisini, A., Colombo, P., and Sicari, S. (2010). Deal-

ing with anonymity in wireless sensor networks. In

Proc. of SAC 2010, pages 2216–2223. ACM.

Comerio, M., De Paoli, F., and Palmonari, M. (2009a). Ef-

fective and ﬂexible nfp-based ranking of web services.

In Proc. of ICSOC/ServiceWave 2009, pages 546–560.

Comerio, M., Truong, H.-L., Batini, C., and Dustdar, S.

(2010). Service-oriented data quality engineering and

data publishing in the cloud. In Proc. of SOCA 2010,

pages 1–6.

Comerio, M., Truong, H.-L., De Paoli, F., and Dustdar, S.

(2009b). Evaluating contract compatibility for service

composition in the seco2 framework. In Proc. of IC-

SOC/ServiceWave 2009, pages 221–236.

Dani, M. N., Faruquie, T. A., Garg, R., Kothari, G., Mo-

hania, M. K., Prasad, K. H., Subramaniam, L. V.,

and Swamy, V. N. (2010). A knowledge acquisition

method for improving data quality in services engage-

ments. In Proc. of SCC 2010, pages 346–353.

De Paoli, F., Palmonari, M., Comerio, M., and Maurino, A.

(2008). A Meta-Model for Non-Functional Property

Descriptions of Web Services. In Proc. of ICWS 2008,

pages 393–400.

Faruquie, T. A., Prasad, K. H., Subramaniam, L. V., Mo-

hania, M. K., Venkatachaliah, G., Kulkarni, S., and

Basu, P. (2010). Data cleansing as a transient service.

In Proc. of ICDE 2010, pages 1025–1036.

Li, J., Stephenson, B., and Singhal, S. (2009). A policy

framework for data management in services market-

places. In Proc. of ARES 2009, pages 560–565.

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasub-

ramaniam, M. (2007). L-diversity: Privacy beyond

k-anonymity. ACM Trans. Knowl. Discov. Data, 1.

Truong, H.-L., Gangadharan, G., Comerio, M., Dustdar, S.,

and De Paoli, F. (2011). On analyzing and developing

data contracts in cloud-based data marketplaces. In

Proc. of APSCC 2011, pages 174–181.

Tsai, W.-T., Wei, X., Zhang, D., Paul, R., Chen, Y., and

Chung, J.-Y. (2007). A new soa data-provenance

framework. In Proc. of ISADS 2007, pages 105–112.

Viega, J. (2009). Cloud computing and the common man.

Computer, 42:106–108.

Zhou, Y., Hanß, S., Cornils, M., Hahn, C., Niepage, S., and

Schrader, T. (2009). A soa-based data quality assess-

ment framework in a medical science center. In Proc.

of ICIQ 2009, pages 149–160.

ACLOUD-BASEDSOLUTIONFORDATAQUALITYIMPROVEMENT

227