Toward Formal Data Set Veriﬁcation for Building Effective Machine

Learning Models

Jorge L

opez, Maxime Labonne and Claude Poletti

Airbus Defence and Space, Issy-Les-Moulineaux, France

Keywords:

Machine Learning, Data Set Collection, Formal Veriﬁcation, Trusted Artiﬁcial Intelligence.

Abstract:

In order to properly train a machine learning model, data must be properly collected. To guarantee a proper

data collection, verifying that the collected data set holds certain properties is a possible solution. For example,

guaranteeing that that data set contains samples across the whole input space, or that the data set is balanced

w.r.t. different classes. We present a formal approach for verifying a set of arbitrarily stated properties over

a data set. The proposed approach relies on the transformation of the data set into a ﬁrst order logic formula,

which can be later veriﬁed w.r.t. the different properties also stated in the same logic. A prototype tool, which

uses the z3 solver, has been developed; the prototype can take as an input a set of properties stated in a formal

language and formally verify a given data set w.r.t. to the given set of properties. Preliminary experimental

results show the feasibility and performance of the proposed approach, and furthermore the ﬂexibility for

expressing properties of interest.

1 INTRODUCTION

In the past few decades, Machine Learning (ML) has

gained a lot of attention, partially due to the creation

of software libraries (e.g., (Pedregosa et al., 2011))

that ease the usage of complex algorithms. In this

context, the volume of stored data has dramatically

increased over the last few years. However, an often

overlooked task is the data extraction and collection

to create proper data sets to train efﬁcient machine

learning models.

When retrieving information for the data set col-

lection, there are key points to take into consideration.

The reason is that ML models generalize their output

based on the training (seen) data. However, a prob-

lem that is commonly encountered is that a model is

expected to generalize well unseen regions of the in-

put space while such regions do not behave in accor-

dance to the provided training data. Another problem

that often occurs is that there is a class in the data

set which is underrepresented (e.g., for an anomaly

detection data set, 99% of the examples are normal

events). In general, many data biases can occur in a

collected data set. A simple strategy while collecting

data sets is to collect a large number of entries, con-

jecturing that important data are likely to be found if

more data are available. However, this strategy yields

incorrect results, and moreover, large data sets can

cause ML models to be trained for longer than neces-

sary; this in turn can make certain algorithms which

may yield accurate results unusable for such cases.

Additionally, with the proliferation of machine gener-

ated data sets, for example via Generative Adversarial

Networks, assuring that the generated data set holds

some properties of interest is of utmost importance.

In order to guide the collection of a proper data

set to effectively train a ML model, verifying that a

partially collected data set holds certain properties of

interest is a possible solution. This veriﬁcation can be

done with the use for formal methods, such as for ex-

ample Satisﬁability Modulo Theories (SMT) (Barrett

and Tinelli, 2018). With a formal proof that the data

set holds certain properties, it is feasible to create a

formal speciﬁcation of a data set. Whenever this spec-

iﬁcation is violated (certain properties do not hold),

identifying the properties that do not hold may help

to diagnose the missing information. This paper is

devoted to the formal veriﬁcation of machine learn-

ing data sets through the use of SMT (for preliminary

concepts on ML and SMT, see Section 2). The ap-

proach is based on the encoding of the data set into

a Many-Sorted First Order Logic (MSFOL) formula

which is later veriﬁed together with the desired set of

properties (see Section 3).

A tool for the veriﬁcation of data sets has been

developed. The tool relies on the use of the widely-

López, J., Labonne, M. and Poletti, C.

Toward Formal Data Set Veriﬁcation for Building Effective Machine Learning Models.

DOI: 10.5220/0010676500003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 1: KDIR, pages 249-256

ISBN: 978-989-758-533-3; ISSN: 2184-3228

249

known z3 (De Moura and Bjørner, 2008) solver. Pre-

liminary experimental results show that in spite of the

high computational complexity of SMT procedures,

for the veriﬁcation of data sets, these properties can

be veriﬁed in a reasonable amount of time (see Sec-

tion 4).

It is important to note that verifying certain prop-

erties over a data set is a task which is consistently

considered as necessary, and a norm for many prac-

titioners. However, in the literature very few re-

searchers focus on automatic validation of data sets

(see for example (Carvallo. et al., 2017)). Further-

more, to the best of our knowledge, there is no work

which aims at providing means for the veriﬁcation of

arbitrarily stated properties, and moreover, in a formal

manner. In this light, this paper aims at exploring this

direction.

2 PRELIMINARIES

2.1 Machine Learning and Structured

Data Sets

We consider that a structured machine learning data

set contains examples alongside with their expected

outputs. Given the inputs and expected outputs, the

ﬁnal goal of a supervised ML algorithm is to learn

how to map a training example to its expected output.

For an unsupervised ML algorithm the goal is to learn

patterns from the data; thus, the expected output does

not exist. In our work, we consider that the expected

outputs are always present, and thus, a data set for un-

supervised machine learning (where there are no ex-

pected outputs) has the same expected output for all

training examples. Further, we consider only struc-

tured data sets.

Formally, the inputs are called features or param-

eters. A feature vector, denoted as X, is an n-tuple of

the different inputs, x

, x

, . . . , x

. The expected out-

put for a given feature vector is called a label, denoted

simply as y, and the possible set of outputs is respec-

tively denoted as Y . The set of examples, called a

training data set, consists of pairs of a feature vec-

tor and a label; each pair is called a training example,

denoted as (X, y). For convenience, we represent the

data set as a matrix D

m×n

and a vector O

where D

contains the feature vectors and O contains the ex-

pected outputs for a data set of cardinality m. The

vector representing the i-th row (training vector) is de-

noted as D

, and its associated expected output as O

Likewise, the j-th feature (column vector) is denoted

as D

denotes the transpose of the matrix D). Fi-

nally, the j-th parameter of the i-th training example

is denoted by the matrix element d

i, j

2.2 Satisﬁability Modulo

Theories (SMT)

SMT is a decision problem, such that for a given ﬁrst

order logic formula φ, searches if φ is satisﬁable w.r.t.

a set of background theories. For example, w.r.t. inte-

ger linear arithmetic, the following formula is satisﬁ-

able: Φ = (x ∈ Z)∧ (y ∈ Z)∧ (x < y)∧ (x < 0)∧ (y >

0) ∧ (x + y > 0); the formula can be satisﬁed for in-

stance by the interpretation x = −1, y = 2. The impor-

tance of restricting an interpretation of certain func-

tion and predicate symbols in a ﬁrst-order logic for-

mula (according to a background theory T ), is that

specialized decision procedures have been proposed;

thus, making the problem of checking the satisﬁabil-

ity of such formulas decidable.

It is important to note that many of the applica-

tions that use SMT involve different data types (Bar-

rett and Tinelli, 2018). Therefore, SMT usually works

with a sorted (typed) version of ﬁrst order logic (Man-

zano, 1993). Essentially, in SMT there exists a ﬁnite

set of sort symbols (types) S and an inﬁnite set of vari-

ables X for the (sorted) formulas, where each variable

has a unique associated sort in S. This is an over-

simpliﬁcation of a many-sorted ﬁrst order logic (MS-

FOL). As MSFOL is useful to express our formulas

of interest, in the next subsection we provide a formal

deﬁnition of its syntax (Finkbeiner and Zarba, 2006;

Barrett and Tinelli, 2018; Barrett et al., 2009).

2.2.1 Many-sorted First-order Logic Syntax

A signature is a tuple Σ = (S,C, F, P), where S is a

non-empty and ﬁnite set of sorts, C is a countable set

of constant symbols whose sorts belong to S, F and P

are countable sets of function and predicate symbols

correspondingly whose arities are constructed using

sorts that belong to S. Predicates and functions have

an associated arity in the form σ

×σ

×. . .×σ

→ σ,

where n ≥ 1 and σ

, σ

, . . . , σ

, σ ∈ S.

A Σ-term of sort σ is either: (i) each variable x

of sort (type) σ, where σ ∈ S; (ii) each constant c of

sort (type) σ, where σ ∈ S; and (iii) f ∈ F with arity

× σ

× . . . × σ

→ σ, is a term of sort σ, thus, for

f (t

, . . . ,t

), t

(for i ∈ {1, . . . , n}) is a Σ-term of sort

A Σ-atom (Σ-atomic formula) is an expression

in the form s = t or p(t

, . . . ,t

), where = de-

notes the equality symbol, s and t are Σ-terms of

the same sort, t

, . . . ,t

are Σ-terms of sort σ

, σ

. . . , σ

∈ S, respectively, and p is a predicate of arity

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

250

× σ

× . . . × σ

A Σ-formula is either: (i) a Σ-atom; (ii) if φ is a

Σ-formula, ¬φ is a Σ-formula, where ¬ denotes nega-

tion; (iii) if both φ, ψ are Σ-formulas, then, φ ∧ ψ and

φ ∨ ψ are Σ-formulas (likewise, the short notations

φ → ψ and φ ↔ ψ for ¬φ ∨ ψ and (φ ∧ ψ) ∨ (¬φ ∧

¬ψ)); ﬁnally, (iv) if φ is a Σ-formula and x is a vari-

able of sort σ, then, ∃x ∈ σ φ (x ∈ σ is used to indi-

cate that x has the sort σ) is a Σ-formula (likewise,

the short notation ∀x ∈ σ φ for ¬∃x ∈ σ ¬φ), where

∃ denotes the existential quantiﬁer and ∀ denotes the

universal quantiﬁer, as usual.

We leave out the formal semantics of MSFOL for-

mulas, their interpretations and satisﬁability as we

feel it can unnecessarily load the paper with unused

formalism. However, we brieﬂy discuss some aspects

of MSFOL formula satisﬁability. As previously men-

tioned, for some signatures, there exist decision pro-

cedures, which help to determine if a given formula is

satisﬁable. For example, consider the signature with a

single sort R, all rational number constants, functions

+, −, ∗ and the predicate symbol ≤; SMT will inter-

pret the constants, symbols and predicates as in the

usual real arithmetic sense R. The satisﬁability of Σ-

formulas for this theory (real arithmetic) is decidable,

even for formulas with quantiﬁers (Barrett and Tinelli,

2018; Manna and Zarba, 2003), i.e., for some inﬁnite

domain theories, there exist procedures

to decide if a

given quantiﬁed formula is satisﬁable. Therefore, the

satisﬁability for formulas as: ∃n ∈ R ∀x ∈ R x + n = x

can be automatically determined (via a computer pro-

gram implementing the decision procedure, i.e., an

SMT solver). If a formula is satisﬁable, there exists an

interpretation (or model) for the formula, i.e., a set of

concrete values for the variables, predicates and func-

tions of the formula that makes this formula evaluate

to TRUE.

3 DATA SET ENCODING AND

FORMAL VERIFICATION

As previously mentioned (see Section 2), a ML data

set is composed of a matrix D

m×n

and a vector O

where m is the number of training examples, n the

number of features, D contains the training examples,

and O the expected outputs. However, note that in our

deﬁnition of this matrix we never mentioned the type

of each feature in the data set. In general, there is no

theoretical limitation over the type of these features,

nonetheless, for practical reasons, we consider that

Often such procedures seek to “eliminate” the quanti-

ﬁers and obtain an equivalent quantiﬁer-free formula

all features are real valued. The main reason is that

otherwise additional information would be required

for each of the features. Moreover, in practice, well-

known libraries work with real-valued features. As

usual, for those features which are not naturally real,

an encoding must be found (for example, one hot en-

coding for categorical features, etc.). Thus, we con-

sider that d

i, j

, o

∈ R ∀i ∈ {1, . . . , m}, j ∈ {1, . . . , n}.

Additionally, we assume that O is always present in

the data sets, independently if this data set is meant

for supervised or unsupervised machine learning. If a

data set is not labeled, then ∀i, k ∈ {1, . . . , m} o

= o

Encoding a ML Dataset as a MSFOL Formula.

Having a convenient formal description for a data set

eases the encoding of this data set as a MSFOL for-

mula. To encode the data as a formula, we make use

of the theory of arrays

. We denote that an object a is

of sort array with indices of type (sort) T 1 and hold-

ing objects of type T 2 as a ∈ A

T 1,T 2

. Indeed, a data

set can be encoded using Algorithm 1; the algorithm

creates a formula that is satisﬁable by a model of ar-

rays which represent the data set.

Algorithm 1: Data set encoding.

Input : A data set D

M×N

(with N features and

M training examples), and its

expected output vector O

Output: A MSFOL formula representation of

the data set, φ

Step 0: Set φ ← TRUE, set labels ←ARRAY(),

and set L ← 0;

Step 1: Set

φ ← φ ∧ (m, n, l ∈ Z) ∧ (m = M)∧ (n = N);

Step 2: Set φ ← φ ∧ (D ∈ A

Z,A

Z,R

) ∧ (O ∈

Z,R

) ∧ (L ∈ A

Z,R

);

Step 3: for i ← 0; i < M;i ← i + 1 do

Set add ←TRUE;

for j ← 0; i < N; j ← j + 1 do

Set φ ← φ ∧ (D[i][ j] = d

i, j

);

Set φ ← φ ∧ (O[i] = o

);

for k ← 0;k < L; k ← k + 1 do

if labels[k] = o

then

Set add ←FALSE;

if add then

Set labels[L] ← o

;

Set φ ← φ ∧ (L [L] = o

);

Set L ← L + 1;

Step 4: Set φ ← φ ∧ (l = L) and return φ

The theory of arrays considers basic read and write ax-

ioms (Stump et al., 2001)

Toward Formal Data Set Veriﬁcation for Building Effective Machine Learning Models

251

3.1 Formal Veriﬁcation of Data Sets

Indeed, a data set can be formally deﬁned as an MS-

FOL formula φ

which holds the following proper-

ties: φ

is a conjunction of ﬁve main parts, that is, i)

the assertion that an integer variable m is of the size of

the number of training examples, a variable n is of the

size of the features and a variable l is of the size of

the distinct labels, ii) the assertion that D is a two-

dimensional (integer indexed) real-valued array (of

size m × n) and O, L are integer indexed real-valued

arrays (of size m, and l, respectively) iii) D[i][ j] con-

tains the j-th feature value for the i-th training exam-

ple; iv) O[i] contains the expected output for the i-th

training example; and, v) L [i] contains the i-th (dis-

tinct) label.

We assume that we want to verify k properties

over the data set, and furthermore, that these prop-

erties are expressed also in MSFOL. Indeed, MSFOL

allows to express many properties of interest (in Sec-

tion 3.2 we showcase its expressiveness). Therefore,

we assume that we are given π

, . . . , π

MSFOL for-

mulas to verify. These properties involve the variables

in φ

. Additionally, we assume that these formulas

should all hold independently over the data set, and

their conjunction is satisﬁable. Thus, impose a re-

striction that π

∧ π

is satisﬁable, for x, y ∈ {1, . . . ,

k}; we call this set of properties the data set speci-

ﬁcation σ. This means that two properties may not

contradict each other. For example, it cannot be re-

quired that the data set has more than 30 training ex-

amples and at the same time that it must have at most

20 ((π

↔ (m > 30)) ∧ (π

↔ (m ≤ 20))). Addition-

ally, the conjunction of properties must be satisﬁable

means that there is an interpretation that makes this

formula (the conjunction) evaluate to TRUE, i.e., there

exists a data set which can satisfy this speciﬁcation.

Otherwise, the veriﬁcation of any data set is useless

as no data set can hold such set of properties.

The Formal Data Set Veriﬁcation Problem. can

be reduced to the following: given a data set formula

(created using Algorithm 1 from D and O) and

a data set speciﬁcation σ =

l=1

, is φ

∧ σ satis-

ﬁable? If the conjunction of these formulas is satis-

ﬁable then, each of the properties must hold for the

data set as the conjunction of all properties is satisﬁ-

able by itself; if the conjunction is satisﬁable we say

that the data set holds the properties π

, . . . , π

or that

the data set conforms to the speciﬁcation σ. Perhaps

this is quite an abstract view of the problem. For that

reason, in the following subsection we provide con-

crete examples that should help the reader to better

understand.

3.2 Example Data Set and Properties

First, let us consider a very small data set:

D =







0.051267 0.69956

−0.092742 0.68494

−0.21371 0.69225

−0.375 0.50219

−0.51325 0.46564

−0.52477 0.2098

−0.39804 0.034357

−0.30588 −0.19225

0.016705 −0.40424

0.13191 −0.51389







, O =







−1







After applying Algorithm 1 to D and O as shown

before, the output (φ

) is:

(m, n, l ∈ Z) ∧ (m = 10) ∧ (n = 2)∧

(D ∈ A

Z,A

Z,R

) ∧ (O ∈ A

Z,R

) ∧ (L ∈ A

Z,R

)

∧ (D[0][0] = 0.051267) ∧ (D[0][1] = 0.69956)

∧ (O[0] = 1) ∧ (L [0] = 1)

∧ (D[1][0] = −0.092742) ∧ (D[1][1] = 0.68494)

∧ (O[1] = 0) ∧ (L [1] = 0)

∧ (D[2][0] = −0.21371) ∧ (D[2][1] = 0.69225)

∧ (O[2] = −1) ∧ (L [2] = −1)

∧ (D[3[0] = −0.375) ∧ (D[3][1] = 0.50219)

∧ (O[3] = −1)

∧ (D[4][0] = −0.51325) ∧ (D[4][1] = 0.46564)

∧ (O[4] = −1)

∧ (D[5][0] = −0.52477) ∧ (D[5][1] = 0.2098)

∧ (O[5] = −1)

∧ (D[6][0] = −0.39804) ∧ (D[6][1] = 0.034357)

∧ (O[6] = −1)

∧ (D[7][0] = −0.30588) ∧ (D[7][1] = −0.19225)

∧ (O[7] = −1)

∧ (D[8][0] = 0.016705) ∧ (D[8][1] = −0.40424)

∧ (O[8] = −1)

∧ (D[9][0] = 0.13191) ∧ (D[9][1] = −0.51389)

∧ (O[9] = −1) ∧ (l = 3)

Let us start by showcasing very simple proper-

ties and how their formal veriﬁcation works. Suppose

the speciﬁcation consists of a single property: “the

data set must contain at least 100 training examples,”

this property can be expressed in MSFOL simply as

↔ (m ≥ 100). Notice how φ

∧π

is not satisﬁable

as there does not exist an interpretation that makes

it evaluate to TRUE; particularly, if m is greater than

99, then the clause (in φ

) m = 10 cannot evaluate to

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

252

TRUE and since this is a conjunction, φ

∧ π

evalu-

ates to FALSE. Similarly, if m is 10, then the π

makes

the conjunction evaluate to FALSE. Thus, we say that

the data set does not hold the property π

Let us start examining more complex properties

that can be formally veriﬁed over the data set. A

slightly more complex property to verify is: “the data

set must be min-max normalized,” which can be ex-

pressed in MSFOL as π

↔ @(i, j ∈ Z)((i ≥ 0) ∧ (i <

n) ∧ ( j ≥ 0) ∧ ( j < m) ∧ ((D[i][ j] < min) ∨ (D [i][ j] >

max))). Certainly min and max are deﬁned constants

(e.g., -1 and 1) an either these variables must be de-

ﬁned or the value must be replaced; for min = −1 and

max = 1, φ

holds the property π

(as φ

∧ π

satisﬁable).

The previous properties are useful to showcase

how easy is to translate desired properties into the

formalism. However, verifying such properties is

quite simple, and furthermore can be uninteresting

as the data set can be normalized later on, for ex-

ample. As previously stated, our motivation comes

from the proper extraction and collection of the data

set. We have discussed the case where training ex-

amples are provided for some regions of the input

space and some other regions are overlooked. To

verify that “the data set is sampled across the whole

input space,” the following property can be veriﬁed

∗

↔ @(p ∈ A

Z,R

)∀(i ∈ Z)((i ≥ 0) ∧ (i < m)) =⇒

(

∑

m−1

j=0

(p[ j] − D[i][ j])

> δ); the property basically

states that there does not exist a point such that it has

a greater Eucledian distance that a chosen constant

δ. As an example, for δ = 1, our example data set

does not hold the previous property π

∗

as there exists

a point in the input space that has greater Eucliden

distance, for example if p[0] = 2 and p[1] = 2. Note

that the property never speciﬁes the minimum or max-

imum values of the input space and thus, it is likely

that no data set is sampled over an inﬁnite domain.

An easy solution is to add such constraints to π

∗

, i.e.,

@(l ∈ Z ∧ (l ≥ 0) ∧ (l < n)(∧(p[l] > max) ∨ (p[l] <

min))), for given max and min constants. We draw

the reader’s attention to the fact that a formal speciﬁ-

cation must be well-stated and this is an assumption

of our work and generally in any formal veriﬁcation

strategy.

Finally, note that sometimes it is more convenient

to state negated properties. For example, to verify that

the data set is balanced, we can verify the follow-

ing property: “there is no class which has less than

(β∗l)

samples,” where l is the number of different out-

puts (labels) and β is a chosen constant. This prop-

erty states that the data set must have equal amount

of samples, up to a given constant. For example, if

β = 1 the data set must be perfectly balanced, while

if β = 2 only half of the samples (of a perfectly bal-

anced data set) are required per class. It is impor-

tant to state that unbalanced data sets represent a real

problem for current machine learning algorithms, and

moreover, it is often encountered in the domain. Ac-

cordingly, researchers actively try to tackle the prob-

lem (see for example (Lema

ıtre et al., 2017)). Indeed,

it can be not that intuitive how to state this property

in ﬁrst order logic. There are many particularities that

must be considered; for example, the fact that there is

no notion of loops in ﬁrst order logic and we require

to deﬁne a function to count the number of instances

where a given label appears. To overcome this partic-

ular problem a recursive function can be deﬁned. In

order to keep the paper readable, we avoid this deﬁ-

nitions and simply denote deﬁned functions in math-

ematical bold-font. The interested reader can refer

to the prototype implementation section (Section 4)

and correspondingly to the tool’s repository to check

the full property implementation. We state the afore-

mentioned property as: π

≡

↔ @i ∈ Z((i ≥ 0) ∧ (i <

l) ∧ (S(O, L [i], m) <

β∗l

)), where S(A, v, s) is a func-

tion that returns the number of times the value v is

found in an array A up to index s; that is, that is how

many times the label is found in the label array.

We have exempliﬁed different properties that can

be formally veriﬁed in data sets. We do not focus

on an extensive list of properties but, rather on pro-

viding means for formally verifying any property in

a given data set. We could state much more proper-

ties, for example, there are no contradicting training

examples in the data set, i.e., there does not exist two

equal elements in D with different indices for which

the corresponding elements in O differ. We limit this

section with these examples. However, we note that

as shown in the previous examples, the formalism is

quite ﬂexible for expressing real properties of interest.

Discussion – On Standard and Domain-speciﬁc

Properties. We have showcased the ﬂexibility of

the proposed approach with somewhat standard prop-

erties to check. One can imagine more of these prop-

erties, for example, guaranteeing that there are no out-

lier training examples

in the data set can be logi-

cally expressed as ﬁnding points in the space with

high variance. Nonetheless, it is interesting to point

out that the approach is generic and domain-speciﬁc

properties coming from expert knowledge can be also

used to formulate properties. For example, consider

a real state data set, where two categorical features,

isHouse and isApt cannot be both TRUE at the same

Training examples which have extreme values, far from

the rest of data points.

Toward Formal Data Set Veriﬁcation for Building Effective Machine Learning Models

253

time. Similarly, for other speciﬁc domains as com-

puter networks, where network packets cannot have

mutually exclusive headers (e.g., transmission control

protocol and user datagram protocol). For the ﬁrst

case (real state), the property can be simply stated as

@i ∈ Z((i ≥ 0) ∧ (i < m) ∧ (D[i][J] = D [i][K])), as-

suming J and K are the corresponding indices for the

isHouse and isApt features. In general, as the prop-

erties to check can be added or removed arbitrarily,

checking a particular set of those for a particular data

set is possible (see the tool’s description in the next

section).

Discussion – On the Usability and Computational

Complexity of the Proposed Approach. The pro-

posed approach reduces the problem to the satisﬁ-

ablity of many sorted ﬁrst order logic. The problem

is known to be NP-hard (for formulas without quanti-

ﬁers), and depending on the the theories involved the

problem becomes computationally harder (PSPACE-

hard or even exponential or doubly exponential). This

fact may lead the reader to believe that the approach

is not feasible for real life data sets. This is true only

if the particular instances belong to the worst case. In

reality, millions problems which are considered un-

tractable are easily solved as instances do not fall into

the worst case. Additionally, as most of the show-

cased properties can be individually computed via ad-

hoc procedures (with small programs), it may seem

as if the approach is cumbersome. However, note that

not all properties are easily computable. Take for in-

stance the property of guaranteeing that there does not

exist a point in the space which is more distant (than a

given constant) to all data set points; brute force enu-

meration is not feasible. Furthermore, some of the

properties to be veriﬁed may be computationally hard

by themselves. As we showcase in the next section,

our preliminary experimental results show that small

data sets are veriﬁed in milliseconds. It is interesting

to observe that many of the properties can be veriﬁed

by batches in parallel. For example, the property of

verifying that the data set is min/max normalized can

be checked independently as there are no dependen-

cies between the data; in general, any property that

does not relate two training examples. However, this

is out of the scope of this initial work. Finally, we also

note that our approach is targeted to properly con-

struct a data set. However, it is still useful for existing

data sets which cannot be modiﬁed. Even if the col-

lection is ﬁnished, knowing that the data set holds a

given property or not is the ﬁrst step toward ﬁxing or

contouring the problem.

4 TOOL DEVELOPMENT AND

EXPERIMENTAL RESULTS

In order to assess the feasibility and efﬁciency of the

proposed approach, a prototype tool has been devel-

oped in Julia (Bezanson et al., 2017). Generally,

speaking, the tool takes as an input: a Comma Sepa-

rated Values (CSV) ﬁle as a data set, assuming that the

last column of each row must be the expected output

for the training example (remainder of the columns);

a directory, where the properties to be checked are

stored, one per ﬁle in the SMT-LIB language.

SMT-LIB. is a language that many SMT solvers

can take as an input and its syntax is quite intuitive.

For example, for expressing the property @(i, j ∈

Z)((i ≥ 0) ∧ (i < n) ∧ ( j ≥ 0) ∧ ( j < m) ∧ ((D[i][ j] <

min)∨ (D[i][ j] > max))) can be simply done in SMT-

LIB as shown in Listing 1.

Listing 1: π

in SMT-LIB.

( a s s e r t

( n o t

( e x i s t s ( ( i I n t ) ( j I n t ) )

( and

(>= i 0 )

(< i n )

(>= j 0 )

(< j m)

( o r

(< ( s e l e c t ( s e l e c t D i ) j ) min )

(> ( s e l e c t ( s e l e c t D i ) j ) max )

)

The tool works as described in Algorithm 2. Note

that, SMT is an SMT procedure call to determine if

the given formula is satisﬁable. In our tool, we use

the z3 (De Moura and Bjørner, 2008) solver (which

takes as an input the SMT-LIB format). The interested

reader can check the properties stated in SMT and

more information about our tool in the tool’s repos-

itory (L

opez, 2021).

4.1 Preliminary Experimental Results

All experiments were executed with commodity hard-

ware with the intention to showcase the performance

of the proposed approach. The experiments were

performed with an Ubuntu 20.04LTS with 4 Intel(R)

Core(TM) i5-6300U CPU @ 2.40GHz, and 8GB of

RAM.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

254

Algorithm 2: Data Set Veriﬁcation.

Input : A CSV data set ﬁle f (with n ≥ 1

features, and m ≥ 1 training

examples), and a directory d

containing property ﬁles

Output: Verdicts for each property π ∈ d

Step 0: Read f and store it into the arrays D

and O, and set m and n, correspondingly;

Step 1: Use Algorithm 1 to obtain φ

from

D, O, m, and n;

Step 2: foreach ﬁle p ∈ d do

Read the contents of p into the formula π;

if SMT (φ

∧ π) is satisﬁable then

display(π holds in the data set f )

else

display(π does not hold for the data

set f )

In order to evaluate the feasibility of our proposed

solution, the properties π

, π

∗

and π

≡

have been

encoded in SMT-LIB, and a data set was incremen-

tally tested. We present the results of both the perfor-

mance and satisﬁability of properties w.r.t. the data

sets in Figures 1, 2, respectively. As can be seen, the

performance of the proposed approach is acceptable;

as any formal veriﬁcation approach, the decision pro-

cedures are often exponential in the worst case. For

formally guaranteeing that the data set holds certain

properties of interest, this procedure can be executed

once, in which case the running time is not much of a

constraint. Our preliminary experimental evaluation

shows that properties are solved fast (milliseconds

per hundreds of training examples), specially simple

properties (e.g., π

0 20 40

80 100 120

5 · 10

−2

0.1

0.15

Time (s)

≡

∗

Figure 1: Performance of formal data set veriﬁcation.

It is interesting to observe the satisﬁability of the

properties. It is normal that when adding more train-

0 20 40

80 100 120

Unknown

Unsat.

Sat.

Verdict

≡

∗

Figure 2: Satisiﬁability of properties (w.r.t the data set con-

junction).

ing examples the data set may get balanced or unbal-

anced (π

≡

); it is also normal that all data sets which

have less than 100 training examples fail the property

. One can conclude that the example data set is

also well min/max normalized as π

is always satis-

ﬁable. Finally, note that even if the language allows

it and solver can read the property π

∗

, the property

is very complicated as it is quantiﬁed over an array;

the solver cannot process such complex formulation

and so the property always returns an unknown sta-

tus. We envision different strategies to overcome this

problem. For example, instead of formulating the

problem as it is, to pre-process the dimension of the

training vector, and ask the formula quantiﬁed over

n reals (∃p

, . . . , p

∈ Rψ). This should effectively

reduce the complexity of the formula, however, this

may require a Domain Speciﬁc Language (DSL) for

stating properties of interest. However, note that this

approach is out of the scope of this initial study.

5 CONCLUSION AND FUTURE

WORK

In this paper, we have proposed a formal data set ver-

iﬁcation approach. Such formal veriﬁcation can be

used for guaranteeing that the data extraction is ade-

quate for properly training a machine learning model.

We have showcased different formal properties to be

veriﬁed over the data sets, and experimentally proven

that the approach is feasible, and furthermore ﬂexible

w.r.t. the semantic capabilities of the proposed for-

malism.

As for future work we plan to test the performance

of the approach on large scale data sets. Considering

performance enhancement by automatically recogniz-

Toward Formal Data Set Veriﬁcation for Building Effective Machine Learning Models

255

ing properties that can be veriﬁed by batches in paral-

lel is another interesting direction. Also, we intend to

further investigate DSLs for property speciﬁcation (as

discussed in Section 4). Additionally, as each of the

training examples gets translated into a part of a for-

mula, it is interesting to try to remove some training

examples when a property is not satisﬁable in order to

obtain a satisﬁable one; this would allow to automat-

ically repair data sets w.r.t. a set of properties. Nev-

ertheless, different elements must be taken into con-

sideration, for example, the fact that the model found

by the solver may include other training examples,

which are ﬁctitious. Finally, an interesting direction

is to consider the formal veriﬁcation of unstructured

data for machine learning.

REFERENCES

Barrett, C., Sebastiani, R., Seshia, S., and Tinelli, C. (2009).

Satisﬁability modulo theories, volume 185 of Fron-

tiers in Artiﬁcial Intelligence and Applications, pages

825–885. IOS Press, 1 edition.

Barrett, C. and Tinelli, C. (2018). Satisﬁability modulo the-

ories. In Handbook of Model Checking, pages 305–

343. Springer.

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B.

(2017). Julia: A fresh approach to numerical comput-

ing. SIAM review, 59(1):65–98.

Carvallo., P., R. Cavalli., A., and Kushik., N. (2017). Auto-

matic derivation and validation of a cloud dataset for

insider threat detection. In Proceedings of the 12th

International Conference on Software Technologies -

ICSOFT,, pages 480–487. INSTICC, SciTePress.

De Moura, L. and Bjørner, N. (2008). Z3: An efﬁcient smt

solver. In International conference on Tools and Algo-

rithms for the Construction and Analysis of Systems,

pages 337–340. Springer.

Finkbeiner, B. and Zarba, C. G. (2006). Many-sorted

logic. https://www.react.uni-saarland.de/teaching/

decision-procedures-veriﬁcation-06/ch01.pdf. Last

Accessed: 2020-05-12.

Lema

ıtre, G., Nogueira, F., and Aridas, C. K. (2017).

Imbalanced-learn: A python toolbox to tackle the

curse of imbalanced datasets in machine learning.

Journal of Machine Learning Research, 18(17):1–5.

opez, J. (2021). Dsverif – a formal data set veriﬁcation

tool. https://github.com/jorgelopezcoronado/DSVerif.

Manna, Z. and Zarba, C. G. (2003). Combining decision

procedures. In Formal Methods at the Crossroads.

From Panacea to Foundational Support, pages 381–

422. Springer.

Manzano, M. (1993). Introduction to many-sorted logic.

In Meinke, K. and Tucker, J. V., editors, Many-sorted

Logic and Its Applications, pages 3–86. John Wiley &

Sons, Inc., New York, NY, USA.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Stump, A., Barrett, C. W., Dill, D. L., and Levitt, J. (2001).

A decision procedure for an extensional theory of ar-

rays. In Proceedings 16th Annual IEEE Symposium

on Logic in Computer Science, pages 29–37. IEEE.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

256