A Novel Metric for Measuring Data Quality in Classiﬁcation Applications

Jouseau Roxane, Salva S

ebastien and Samir Chaﬁk

Universit

e Clermont-Auvergne, CNRS, Mines de Saint-Etienne, Clermont-Auvergne-INP, LIMOS,

Clermont-Ferrand, France

ﬁ

Keywords:

Classiﬁcation, Data Quality, Machine Learning, Measure, Metric.

Abstract:

Data quality is a key element for building and optimizing good learning models. Despite many attempts to

characterize data quality, there is still a need for rigorous formalization and an efﬁcient measure of the quality

from available observations. Indeed, without a clear understanding of the training and testing processes, it is

hard to evaluate the intrinsic performance of a model. Besides, tools allowing to measure data quality speciﬁc

to machine learning are still lacking. In this paper, we introduce and explain a novel metric to measure data

quality. This metric is based on the correlated evolution between the classiﬁcation performance and the dete-

rioration of data. The proposed method has the major advantage of being model-independent. Furthermore,

we provide an interpretation of each criterion and examples of assessment levels. We conﬁrm the utility of the

proposed metric with intensive numerical experiments and detail some illustrative cases with controlled and

interpretable qualities.

1 INTRODUCTION

During the last decades, data availability has played a

crucial role in the development and sophistication of

artiﬁcial intelligence in general and machine learning

models in particular. Yet, few works have been ded-

icated to deeply investigating the assessment on the

apparent accuracy and consistency of data (Cichy and

Rass, 2019; Batini et al., 2009; Batini et al., 2016;

Pipino et al., 2002). Moreover, the term data qual-

ity has been restricted to studying the impact of some

standard criteria on the task at hand and user’s expec-

tations. Standard criteria have been evaluated very

often separately or with a non-rigorous combination

of scores. For example, several methods use the term

data quality for speciﬁc purposes, such as accuracy,

completeness, or timeliness, for speciﬁc applications

and contexts (Gudivada et al., 2017). Unfortunately,

this leads to the ﬁrst limitation related to data qual-

ity: The lack of an appropriate deﬁnition and, subse-

quently, an accurate measure.

Classical data evaluation is often related to a given

context: External metadata, rules, trusted references,

etc. Establishing or extracting these external elements

is usually a long process, expert-dependent and error-

prone (Ehrlinger and W

oß, 2022). Instead of develop-

ing new methods, some previous papers cite commer-

cial products used to measure data quality. A common

conclusion states that only a few tools are available.

We believe that this is a consequence of the limita-

tions detailed above. Nevertheless, we have tried to

test some of them for the same context, but they were

either out of the scope of our study, or we were unable

to set up a usable conﬁguration.

In this work, we solve the previous limitations by

introducing an original data quality metric. We fo-

cus on the use of data in artiﬁcial intelligence appli-

cations and, more speciﬁcally, for learning models in

numerical classiﬁcation. The proposed metric has the

signiﬁcant advantage of providing a consistent means

to evaluate the quality of different types of data with

different numbers of classes, various domains, and

from low to high dimensionality, etc. All the steps

were constructed carefully to make the metric inter-

pretable, easy to use, and model-independent. To il-

lustrate how the metric can successfully capture a de-

terioration, we have shown that the impact makes the

classiﬁcation performance decrease non-linearly with

different rates. This is against a non-justiﬁed prior

when, in some previous works, they have assumed a

linear behavior.

1.1 Related Work

Only a few tools are available to directly evalu-

ate the quality of a dataset with a metric, as most

tools only offer indicators to monitor and help with

data proﬁling. In the recent survey (Ehrlinger and

Roxane, J., Sébastien, S. and Chaﬁk, S.

A Novel Metric for Measuring Data Quality in Classiﬁcation Applications.

DOI: 10.5220/0012311500003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 141-148

ISBN: 978-989-758-680-4; ISSN: 2184-433X

141

oß, 2022) the authors evaluated 11 tools: Apache

Grifﬁn (Foundation, 2023), Ataccama ONE (Atac-

cama, 2023), DataCleaner (DataCleaner, 2023), Data-

martist (Datamartist, 2023), Experian Pandora (Expe-

rian, 2023), InformaticaDQ (Informatica, 2023), In-

foZoom & IZDQ (InfoZoom, 2023), MobyDQ (Rol-

land, 2023), OpenReﬁne (OpenReﬁne, 2023) & Met-

ricDoc (Bors et al., 2018), SAS Data Quality (SAS,

2023), and Talend Open Studio (Talend, 2023). These

tools were classiﬁed into 5 categories: accuracy, com-

pleteness, consistency, timeliness, and others. Inter-

estingly, most of these tools allow the evaluation of

one or two criteria only. They mainly focus on eas-

ing the deﬁnition of data indicators and assisting data

proﬁling. Ultimately, only 4 tools were distinguished:

Apache Grifﬁn, InformaticaDQ, MobyDQ, and Met-

ricDoc.

While many of these tools focus on the attribute

level, they are not generalized for higher levels of ag-

gregation. They also considered data rules, which

are not considered here because they require ex-

pert knowledge and are often unavailable. Addition-

ally, Apache Grifﬁn and MobyDQ require a reference

dataset. This makes them less practical as ground

truth reference datasets are not always available. In-

formaticaDQ focuses on textual data, such as ele-

ments of postal addresses, email addresses, etc., and

cannot be applied to numerical data. MetricDoc of-

fers two different time interval metrics for time-series

data, a redundancy metric on the table level and met-

rics for validity and plausibility, both deﬁned at the

attribute level only. Finally, we tried to investigate

the Data Quality for AI API proposed by IBM (IBM,

2023). However, despite taking steps to access the

free trial version on the website, we could not secure

working access to the API, which is a problem that

the authors of (Ehrlinger and W

oß, 2022) also seem

to have faced, as mentioned in their paper.

1.2 Contributions

In order to build a metric that is independent of learn-

ing models, we formulate the problem such that no

reference or expertise is needed. Instead, our met-

ric is based upon two main terms: The former eval-

uates classiﬁcation performance across a wide range

of models, and the latter assesses variations of per-

formance when a low amount of errors is injected

into datasets. We show that high variations are conse-

quences of quality issues. All terms are then empiri-

cally combined to form a unique evaluation, denoted

. Furthermore, we show how to interpret our metric

scores and express the notions of good, medium, and

bad qualities. Then, we evaluate the metric q

on 110

datasets of known quality and show that q

is able to

measure the quality correctly. We also discuss the in-

formation that can captured by q

, leading to an easy

connection with a given result.

To the best of our knowledge, this is the ﬁrst

method that proposes a clear and rigorous formulation

for measuring data quality. We summarize the main

contributions of this work as follows: We propose a

novel rigorous metric to evaluate data quality for nu-

merical classiﬁcations. Our method can be adapted

and extended to regression. The proposed metric is

model-independent. This proposed metric does not

require external elements or expert supervision. We

use a constructive approach, step by step, to formulate

the metric that could be generalized to other contexts.

The rest of the paper is organized as follows: Sec-

tion 2 proposes a deﬁnition for the data quality met-

ric and some thresholds to ease interpretation. Then,

the metric is evaluated on 155 datasets of varied lev-

els of quality in different contexts. Possible threats to

validity are also discussed in Section 3. Finally, we

conclude the paper in Section 4.

2 MEASURING DATA QUALITY

In (Jouseau et al., 2022), we investigated the rele-

vance of repairing datasets according to the impact of

the amount of errors included in datasets on the clas-

siﬁcation performance. This study allowed the obser-

vation of some distinct characteristics related to data

quality. In particular, we observed the two following

properties:

1. Model accuracy decreases along with data quality

when errors are injected into data. This decrease

varies across classiﬁcation models and numbers

of classes;

2. The decrease in accuracy is nonlinear. It is low

when data is of good quality or when data is ex-

tremely deteriorated. However, the decrease in ac-

curacy is signiﬁcantly higher between these two

states.

We illustrate these observations on three datasets

(Iris, Breast cancer, Adult). We chose 12 standard

classiﬁcation models available in (Pedregosa et al.,

2011): Logistic regression, K-Nearest Neighbors, De-

cision tree, Random forest, Ada boost, Naive Bayes,

XGboost, Support vector classiﬁcation, Gaussian pro-

cess, Multi-layer perceptron, Stochastic gradient de-

scent, and Gradient boosting. We also illustrate these

observations for the error types missing values, out-

liers, and fuzzing, a.k.a. partial duplicates. We chose

these error types because we observed that they have

different impacts on model performance: Outliers and

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

142

missing values have the most impact on accuracies

and f1 scores; fuzzing tends to have less impact and

offers the beneﬁt of simulating data generation. In

Figure 1, we present mean accuracies computed over

30 iterations of injecting controlled percentages of er-

rors, randomly generated with a uniform distribution,

in training data. We inject up to 95% of errors with a

5% increment. The 12 classiﬁcation models are then

trained on these deteriorated datasets with a random

split using 80% for training and 20% for testing. Fig-

ures 1a, 1b, and 1c present the mean accuracies when

missing values, outliers, and fuzzing are respectively

injected in training data.

In Figures 1a and 1b, as per our ﬁrst observation,

accuracies are high when none or a low percentage of

errors are present in the data. Unsurprisingly, these

accuracies decrease as data deterioration increases.

But, as we stated in our second observation, the de-

crease in accuracy is non-linear. This is especially

visible in Figure 1a. We can also see in Figures 1a

and 1b that the mean accuracy can drop signiﬁcantly

with only a 5% increment of errors. For instance, in

Figure 1a, for the dataset Iris, the mean accuracy stays

over 0.8 up to the injection of 35% of missing values

into training data. However, at 40% of missing values,

the mean accuracy drops to 0.6, approximately. The

data fuzzing observed in Figure 1c is a particular case

with no loss of information, which is why accuracies

stay quite steady.

The analysis and formalization of these observa-

tions allow us to propose the ﬁrst data quality mea-

surement using the notion of data deterioration. This

metric, denoted q

, is composed of two parts, denoted

a,1

(Eq. 4) and q

a,2

(Eq. 7), which respectively en-

code these characteristics:

• q

a,1

: the accuracies across a set of classiﬁcation

models (Observation 1). q

a,1

also accounts for

the number of classes in datasets in order to al-

low comparisons of data quality levels between

datasets with different numbers of classes;

• q

a,2

: variations of accuracies when a low percent-

age of errors are injected in training sets (Obser-

vation 2). It aims to capture abnormally high ac-

curacy variations over small dataset perturbations.

Next, we formalize these two observations in a rigor-

ous way.

2.1 Deﬁnition of Q

a,1

We use the following notations in the remainder of the

paper: The set of models is denoted M, the set of error

types is denoted E, and D is the dataset under evalu-

ation. A(m, D) stands for the accuracy of the model

m ∈ M on D. We deﬁne A

(D) as :

(D) :=

card(M)

∑

m∈M

A(m, D) (1)

Observing the mean accuracy A

(D) alone is not suf-

ﬁcient to express data quality. Moreover, given the

number of classes c in D, we want an accuracy that is

better than a random choice:

1 ≥ A

(D) >

, c > 1 (2)

When the accuracy of a model is lower than

, we

consider the quality to be the lowest. The function δ

capture this statement:

(D)) :=

(

1 i f A

(D) >

0 otherwise

(3)

Finally, we deﬁne q

a,1

as:

a,1

(D) := 1 −

(D) − 1

c − 1

(D)) (4)

We have 0 ≤ q

a,1

(D) ≤ 1, with q

a,1

(D) = 0 as the best

quality.

2.2 Deﬁnition of Q

a,2

The main idea encoded by q

a,2

is that the mean ac-

curacies of models trained on good or bad-quality

datasets are not sensitive to a small data deterioration.

This is not the case for datasets of medium quality.

We verify this hypothesis by computing variations of

accuracy with the set of classiﬁcation models M when

a percentage p of an error type e ∈ E is injected in D.

We denote D

e,p

the resulting dataset.

For ∆A

M,e

(D) to expresses the variation of accu-

racy for a speciﬁc error e, we introduce:

∆A

M,e

(D) :=

card(M)

∑

m∈M

|A(m, D)−A(m, D

e,p

(5)

To avoid any bias in our metric, we assume that

errors are injected randomly, with a uniform distribu-

tion in training data. In our experiments, we noticed

in Figure 1a and 1b that a small percentage of errors

p = 5% is sufﬁcient to capture accuracy variations.

According to our experiments, a higher value of p is

possible but may lead to less precise measurements of

the performance loss. The goal of this parameter is to

simulate small perturbations in data that can happen

in real-life situations. Although small variations of

accuracies are expected, we are only interested in ab-

normal variations. To exclude these minor variations,

we deﬁne δ

as:

(∆A

M,e

(D)) :=

(

1 i f ∆A

M,e

(D) > p

0 otherwise

(6)

A Novel Metric for Measuring Data Quality in Classiﬁcation Applications

143

(a) (b) (c)

Figure 1: Evolution of the mean accuracy as a function of missing values (a), outliers (b), and fuzzing (c).

a,2

is then deﬁned as:

a,2

(D) := min(

card(E)

∑

e∈E

∆A

M,e

(D)δ

(∆A

M,e

(D)), 1)

(7)

We chose to add a factor of 10 in q

a,2

(D) because

we consider that an accuracy variation of 10% or more

when we inject a small amount of error in data, indi-

cates bad data quality. However, this factor does not

keep the result bounded by 1, so we use the minimum

function to deﬁne q

a,2

. It is worth noting that this

parameter may be easily changed to meet user prefer-

ences.

2.3 Deﬁnition of the Quality Metric Q

We deﬁne q

as:

(D) := max(q

a,1

(D), q

a,2

(D)) (8)

We use the maximum to ensure that the metric

captures the most variation of data quality, given by

a,1

and q

a,2

. We are now ready to present, in Algo-

rithm 1, all the steps required for evaluating the qual-

ity of D with q

(D). If no trusted test is available,

we compute q

(D) over 30 resamplings of test and

train. We studied the validity of this method and pre-

sented it in an extended version of this paper (Jouseau

et al., 2023b). q

a,1

and q

a,2

are then computed as the

means of the q

a,1

and q

a,2

obtained from the resam-

pled datasets.

2.4 Interpretation

In this section, we discuss the interpretation of q

More speciﬁcally, we propose to extract thresholds for

that express the notions of good, medium, or bad

quality. To do this, we computed q

on 114 datasets

derived from Iris, Breast Cancer, and Adult when er-

rors are injected in 5% increments from 0% to 95%.

We only considered the error types missing values and

outliers, as fuzzing does not tend to affect classiﬁer

performance. Since we control the level of errors, we

estimate the different levels of quality for these data-

Data: Dataset D

Result: q

(D) and its interpretation.

if D is made up of a trusted test dataset then

LD = (D);

else

Generate the list LD = (D

, .. . , D

) of

resampled versions of D;

end

foreach D

∈ LD do

Compute q

a,1

) as deﬁned in Eq. 4;

foreach error type e ∈ E do

Create a new dataset D

e,p

by injecting

with p% of error e, randomly

generated with a uniform

distribution;

end

Compute q

a,2

) with Eq.(7);

end

a,1

(D) =

∑

i=1

a,1

);

a,2

(D) =

∑

i=1

a,2

);

Compute q

(D) with Eq.(8);

Interpret q

(D);

Algorithm 1: q

(D) computation.

sets. Iris and Breast Cancer are considered good qual-

ity when used with classiﬁcation models in the liter-

ature, while the Adult is of medium quality. We ex-

pect data quality to be good for the datasets obtained

from Iris and Breast Cancer with up to 10% of errors,

medium between 10% and 30%, and bad for over 30%

of errors. For the datasets obtained from Adult, we

expect medium quality up to 10 % of errors and then

bad quality.

We present, in the ﬁgures 2a, 2b, the evolution

of q

for three dataset examples when controlled per-

centages of missing values, outliers, and fuzzing are

respectively injected with 5% increments. Colors on

the y axis depict the proposed thresholds for which q

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

144

(a) . (b) .

Figure 2: q

when missing values (a) and outliers (b) are injected in the datasets.

indicates good, medium, or bad data quality.

These thresholds were empirically chosen based

on the following observations. In Figure 2, we see that

without any deterioration, q

measurements for these

datasets are respectively 0.11, 0.2, and 0.5. Thus, we

expect 0.11 and 0.2 to indicate good quality and 0.5

to indicate medium quality. Furthermore, in Figure

2b for around 10% of errors, when data quality is ex-

pected to be altered from good to medium, we ob-

serve that values of q

are close to 0.3 for the datasets

Breast Cancer and Iris. This prompts us to set the up-

per threshold for good quality at 0.3, which is con-

sistent with our ﬁrst observations. Additionally, in

Figure 1a, we observe that the mean accuracy for the

dataset Adult starts to decrease signiﬁcantly between

15% and 20% of missing values injected. This corre-

sponds to q

measurements between 0.55 and 0.7 in

Figure 1a. From these elements, we propose to set the

upper threshold for medium quality to 0.6. We ﬁnd

this to be a reasonable limit since, if we take the case

of a dataset with two classes, being over this thresh-

old either means that its mean accuracy is below 0.7 or

that it experiences a mean variation of accuracy over

0.18 when injected with 5% of errors. These require-

ments either indicate a mean accuracy low enough or

variations of accuracy high enough to characterize a

dataset as bad quality.

We, therefore, propose the following thresholds to

interpret q

: if q

(D) ≤ 0.3, this means that D is of

good quality; If 0.3 < q

(D) ≤ 0.6 D can be con-

sidered to be of medium quality, and examining the

values of q

a,1

(D) and q

a,2

(D) is necessary to decide

whether D can be used. Finally, 0.6 < q

(D) means

that D is of bad quality.

However, in Figure 2a, we can see that for 20% of

missing values, we measure q

(Iris) < 0.3. This re-

sult is unexpected since, for this percentage of errors,

we do not expect good data quality. Nonetheless, it is

consistent with Figure 1a where we can observe that

at 20% of missing values, accuracy is still high with

Iris and is not about to experience a signiﬁcant drop.

3 EMPIRICAL EVALUATION

The experiments presented in this section aim to eval-

uate the relevance of q

. For the remainder of the pa-

per, we express this relevance through 2 questions:

• Q1: Can q

characterize a dataset of good or

medium quality?

• Q2: Can q

characterize a dataset of bad quality?

3.1 Empirical Setup

First, we use the classiﬁcation model set M listed in

Section 2 along with the three error types: missing

values, outliers, and fuzzing. We also set the percent-

age of injected errors p = 5% as discussed in Section

We evaluated q

on 155 numeric datasets.

We used ﬁve distinctive datasets: Spambase,

Heart Disease, Abalone, Dry Beans, and Statlog

(Markelle Kelly, 1999), as well as 150 datasets we

modiﬁed by injecting controlled amounts of errors.

We selected these datasets for their varied dimen-

sions, number of classes, number of attributes, and

domains of application. The objective of the Spam-

base dataset is to predict whether an email is a spam.

The Heart Disease dataset is used to predict the pres-

ence of heart diseases in patients. The Abalone

dataset is used for the prediction of the number of

rings present in abalone shells (which indicates their

ages) from physical measurements. The class imbal-

ance in this dataset is too high to achieve reasonable

accuracies. Therefore, we chose to work on a more

straightforward classiﬁcation task by aggregating the

classes into two groups: up to 8 shell rings and over

eight shell rings. The Dry Beans dataset is used for

the prediction of the varieties of dry beans from fea-

tures by the market situation. Finally, the dataset Stat-

log is used to classify people described by attributes

as good or bad credit risks.

We manually evaluated the quality of these ﬁve

datasets by measuring 7 dimensions and characteris-

A Novel Metric for Measuring Data Quality in Classiﬁcation Applications

145

tics. The results, along with our quality estimations,

are provided in Table 1. The dataset Spambase is esti-

mated to be of good quality as its number of samples

(4 601) is relatively high even for its less populated

class (1 813) compared to its number of attributes and

classes (57 and 2). Besides, its mean accuracy across

classiﬁcation models is high (0.9). Datasets Abalone,

Statlog, Dry Beans, and Heart Disease are estimated

to be of medium quality mainly because their mean

accuracies are lower (respectively 0.84, 0.76, 0.68,

and 0.79). Additionally, the datasets Abalone and Dry

Beans present high levels of class imbalances, which

are usually considered quality issues. The mean ac-

curacy for the dataset Dry Beans can seem very low

compared to the other evaluation datasets, but in the

context of a seven-class classiﬁcation problem, it is,

in fact, rather high.

105 additional datasets were derived from the ﬁve

previous ones by injecting missing values, outliers,

and fuzzing, separately and randomly with a uniform

distribution. Two strategies were followed: injection

of 5 and 10% of errors to build datasets of good or

medium qualities, and injection of 30 up to 50% of

errors with increments by 5% to build datasets of bad

quality. Datasets, results, and a prototype version

of our tool allowing to compute q

are available in

(Jouseau et al., 2023a).

3.2 Q1: Can Q

Characterize a Dataset

of Good or Medium Quality?

To answer this question, we computed q

for the ﬁve

datasets presented in Table 1 as well as for the 30

datasets created by injecting 5% and 10% of missing

values, outliers, or fuzzing. In Table 2, we present

for the ﬁve datasets without deterioration, q

presented with the details of q

a,1

, q

a,2

, and the corre-

sponding data quality levels deﬁned in Section 2.4. q

was computed over the 12 classiﬁcation models and

30 resamplings of training and test data. We compare

these quality levels with the data quality estimations

given in Table 1.

We observe with Tables 2 and 1 that q

evalu-

ates the data quality of the ﬁve ﬁrst datasets correctly.

The only dataset evaluated as of good quality by q

is Spambase. We can also see that its values for q

a,1

and q

a,2

are low. This means that the classiﬁcation

models perform with high levels of accuracy (q

a,1

)

and that these performances would not drop signiﬁ-

cantly upon small perturbations of the dataset (q

a,2

Indeed, its mean accuracy is high (0.9), and even if

the dataset presents a class imbalance, its least pop-

ulated class is still relatively populated (1 813 sam-

ples). The other datasets in Table 2 are all classiﬁed

as medium quality by q

. Again, this is consistent

with our estimated data qualities given in Table 1. The

quality issues are different, though. For instance, for

the dataset Abalone, q

a,1

is relatively high, but q

a,2

is low. This means that the accuracy across mod-

els is not very high, but its accuracy does not vary

much when 5% of errors are injected in the training

set. Moreover, the mean accuracy across classiﬁca-

tion models for the dataset Abalone is 0.84 (Table 1).

Since the dataset only has two classes, this is not con-

sidered a very high accuracy. However, depending on

the application, it can be regarded as reasonable.

Figure 3a, 3b, and 3c now respectively illustrate

the values of q

when injecting missing values, out-

liers, and fuzzing in the previous datasets with a 5%

increment from 0 to 50%. For this question, we

only focus on the 30 datasets obtained when 5% and

10% of errors are injected. We expect them to be of

medium data quality. The colored zones in Figure 3

correspond to the thresholds we set in Section 2.4, for

which quality is considered good, medium, or bad.

We observe in these ﬁgures that q

indicates a

good or medium quality level for 28 of these 30

datasets. In Figure 3a, q

indicates a bad quality for

the datasets Spambase and Statlog at 10% of miss-

ing values injected. For the dataset Statlog, with 10%

of missing values, we observed that the mean accu-

racy score is equal to 0.7. This dataset also presents a

class imbalance that can be worsened by the injection

of missing values. Its quality evaluation is, therefore,

consistent with our observations. For Spambase, the

mean accuracies of the datasets with 0, 5, and 10%

of missing values are 0.9, 0.85, and 0.56. These re-

sults tend to show that the dataset with 10% of errors

is of bad quality for classiﬁcation tasks. In Figure 3b,

we observe that q

indicates medium data quality for

all datasets between 0% and 10% of missing values,

which is consistent with our expectations.

It is worth noting that, in Figure 3c, q

stays rela-

tively constant, which is what we expect.

Out of the 35 datasets studied in this section, q

is indicative of either good or medium data quality

for 33 of them, which is the result we expected. For

the 2 remaining cases where q

scores are indicative

of bad quality, we studied carefully the datasets and

observed that q

is correct. Thus, we conclude that

is able to characterize datasets of good or medium

quality.

3.3 Q2: Can Q

Characterize a Dataset

of Bad Quality?

To investigate this question, we computed q

for the

50 datasets obtained after injecting controlled per-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

146

Table 1: Overview of the evaluation datasets.

Dataset

Number of

classes

Samples total

Number of

attributes

Features Class imbalance

Missing

data

Mean

accuracy

Estimated

data quality

Spambase 2 4 601 57 integers, reals

Yes (1 813 samples

in the least populated class)

None 0.90 good

Abalone

(2 post-processing)

4 177 8

categorical,

integers, reals

Yes (1 407 samples

in the least populated class

post-processing)

None 0.84 medium

Dry Beans 7 13 611 16

categorical,

integers, reals

Yes (522 samples

in the least populated class)

None 0.68 medium

Statlog 2

1 000

(959 post-processing)

23 integers

Yes (275 samples

in the least populated class)

on 41

samples

0.76 medium

Heart Disease

(2 post-processing)

303

(297 post-processing)

categorical,

integers, reals

Yes (No after

post-processing)

on 6

samples

0.79 medium

(a) (b) (c)

Figure 3: q

when missing values (a), outliers (b), and fuzzing (c) are injected in all datasets.

Table 2: q

calculation and the related quality levels.

Dataset q

a,1

a,2

Quality level

Spambase 0.29 0.18 0.29 good

Abalone 0.32 0.32 0.02 medium

Dry Beans 0.36 0.36 0.17 medium

Statlog 0.48 0.48 0.13 medium

Heart Disease 0.42 0.42 0.11 medium

centages of missing values and outliers in the initial

datasets presented in Table 1. We injected from 30 to

50% of errors with a 5% increment. At this level of er-

rors for missing values and outliers, we assumed that

the 50 resulting datasets were of bad quality. We do

not consider fuzzing here as we expected that the in-

jection of this error type should keep the quality close

to the quality of the original datasets. The computed

scores are again presented in Figure 3.

Out of the 50 datasets studied in this section, q

indicated bad data quality for 37 of them. For exam-

ple, we can see in Figure 3a that when 30% of miss-

ing values are present in datasets, q

correctly indi-

cates bad data quality, except for the dataset Abalone

up to 45%. We observe similar results for outliers.

We hence investigated why q

indicates medium data

quality for the 9 datasets obtained from Abalone af-

ter injecting 30 to 45% of missing values or outliers.

The mean accuracies measured from these datasets

stay over 75% and then drop (abruptly in Figure 3a).

This is actually consistent with the Abalone proper-

ties given in Table 1, where we observe that Abalone

has a high number of samples (4 177) for a low num-

ber of classes and attributes (2 and 8). Hence, our

own interpretation of quality was wrong for these 9

datasets. We can indeed estimate that these datasets

are of medium quality with regard to these charac-

teristics. The mean accuracy of the dataset Spambase

stays over 0.7 up until the injection of 45% of outliers.

Its quality evaluation without the injection of errors

was good (as opposed to medium for the other eval-

uation datasets). This could explain its higher resis-

tance to outliers. Moreover, for the dataset Dry Beans

at 30% of outliers, the mean accuracy is evaluated at

0.5, which seems low but is still much better than a

random guess with seven classes (

In summary, q

indicates bad data quality for 37

datasets. For the 13 cases where q

indicates medium

data quality when bad data quality was expected, we

observed that the data characteristics and mean ac-

curacies were consistent with a medium data quality

level. These observations allow us to conclude that q

is able to characterize datasets of bad quality.

3.4 Threats to Validity

In this section, we address 8 possible threats to the

validity of this study. We identiﬁed 3 internal threats

and 4 external threats.

The internal threats we identiﬁed are the imple-

mentation of the classiﬁcation models, the hyper-

parametrization of the classiﬁcation models, and the

number of datasets used to deﬁne the thresholds to

A Novel Metric for Measuring Data Quality in Classiﬁcation Applications

147

interpret q

. To address the ﬁrst threat, we imple-

mented classiﬁcation models using scikit-learn (Pe-

dregosa et al., 2011), a widely used library. To

limit the second threat, we used a grid search to set

the hyper-parameters for classiﬁcation models on all

datasets without any deterioration and then used these

settings for the rest of the experiments. To limit the

third threat, we selected three widely used datasets,

along with the 114 datasets obtained by injecting con-

trolled percentages of errors.

The 4 external threats are the choice of the datasets

for the evaluation, the choice of classiﬁcation mod-

els, the generation of errors, and the combination of

errors. We tried to limit the ﬁrst threat by choosing

datasets that are widely used and have different di-

mensions. We also selected datasets that cover var-

ious applications and ranges of dimensions. We se-

lected a wide range of classiﬁcation approaches to ad-

dress the second threat. To limit the third one, we

decided to generate errors randomly, with a uniform

distribution. Finally, we decided to study errors sepa-

rately to limit the fourth threat. However, we plan to

extend this work to error combinations in future work.

4 CONCLUSION

In this paper, we have introduced a novel metric to

measure data quality. The main advantage of the pro-

posed metric is being independent of learning models

and expert knowledge. Furthermore, it does not re-

quire external reference data. As a consequence, it of-

fers the possibility to compare different datasets. We

have extensively tested and evaluated the proposed

metric and have shown that it is able to characterize

data quality correctly.

REFERENCES

Ataccama (2023). Ataccamaone. https://www.ataccama.

com/platform.

Batini, C., Cappiello, C., Francalanci, C., and Maurino, A.

(2009). Methodologies for data quality assessment

and improvement. In ACM computing surveys.

Batini, C., Scannapieco, M., et al. (2016). Data and infor-

mation quality. Springer.

Bors, C., Gschwandtner, T., Kriglstein, S., Miksch, S.,

and Pohl, M. (2018). Visual interactive creation,

customization, and analysis of data quality metrics.

In Journal of Data and Information Quality (JDIQ)

ACM.

Cichy, C. and Rass, S. (2019). An overview of data quality

frameworks. In IEEE Access.

DataCleaner (2023). Datacleaner. https://datacleaner.

github.io/.

Datamartist (2023). Datamartist. http://www.datamartist.

com/.

Ehrlinger, L. and W

oß, W. (2022). A survey of data quality

measurement and monitoring tools. In Frontiers in big

data.

Experian (2023). User manual version 5.9.

https://www.edq.com/globalassets/documentation/

pandora/pandora\manual\590.pdf.

Foundation, A. (2023). Apache grifﬁn user guide.

https://github.com/apache/grifﬁn/blob/master/

grifﬁn-doc/ui/user-guide.md.

Gudivada, V., Apon, A., and Ding, J. (2017). Data qual-

ity considerations for big data and machine learning:

Going beyond data cleaning and transformations. In

International Journal on Advances in Software.

IBM (2023). Ibm data quality for ai api.

https://developer.ibm.com/apis/catalog/

dataquality4ai--data-quality-for-ai/Introduction.

Informatica (2023). What is data quality?

https://www.informatica.com/resources/articles/

what-is-data-quality.html.

InfoZoom (2023). Infozoom & izdq. https://www.

infozoom.com/en/products/infozoom-data-quality/.

Jouseau, R., Salva, S., and Samir, C. (2022). On study-

ing the effect of data quality on classiﬁcation per-

formances. In 23rd International Conference on In-

telligent Data Engineering and Automated Learning

(IDEAL). Springer.

Jouseau, R., Salva, S., and Samir, C. (2023a). Ad-

ditional resources for the reproducibility of the

experiment. https://gitlab.com/roxane.jouseau/

measuring-data-quality-for-classiﬁcation-tasks.

Jouseau, R., Salva, S., and Samir, C. (2023b). A novel met-

ric for measuring data quality in classiﬁcation appli-

cations (extended version). https://arxiv.org/abs/2312.

08066.

Markelle Kelly, Rachel Longjohn, K. N. (1999). The uci

machine learning repository. https://archive.ics.uci.

edu.

OpenReﬁne (2023). Openreﬁne. https://github.com/

OpenReﬁne/OpenReﬁne.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,

Cournapeau, D., Brucher, M., Perrot, M., and Duch-

esnay, E. (2011). Scikit-learn: Machine learning in

python. In Journal of Machine Learning Research.

Pipino, L. L., Lee, Y. W., and Wang, R. Y. (2002). Data

quality assessment. In Communications of the ACM.

Rolland, A. (2023). Mobydq. https://ubisoft.github.io/

mobydq.

SAS (2023). Dataﬂux data management studio 2.7:

User guide. http://support.sas.com/documentation/

onlinedoc/dfdmstudio/2.7/dmpdmsug/dfUnity.html.

Talend (2023). Talend open studio for data quality

– user guide 7.0.1m2. http://download-mirror1.

talend.com/top/user-guide-download/V552/

TalendOpenStudio DQ UG 5.5.2 EN.pdf.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

148