Optimized Linear Imputation

Yehezkel S. Resheff

1,2

and Daphna Weinshall

School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel

Edmond and Lily Safra Center for Brain Sciences, The Hebrew University, Jerusalem, Israel

{heziresheff, daphna}@cs.huji.ac.il

Keywords:

Imputation.

Abstract:

Often in real-world datasets, especially in high dimensional data, some feature values are missing. Since most

data analysis and statistical methods do not handle gracefully missing values, the ﬁrst step in the analysis

requires the imputation of missing values. Indeed, there has been a long standing interest in methods for the

imputation of missing values as a pre-processing step. One recent and effective approach, the IRMI stepwise

regression imputation method, uses a linear regression model for each real-valued feature on the basis of all

other features in the dataset. However, the proposed iterative formulation lacks convergence guarantee. Here

we propose a closely related method, stated as a single optimization problem and a block coordinate-descent

solution which is guaranteed to converge to a local minimum. Experiments show results on both synthetic and

benchmark datasets, which are comparable to the results of the IRMI method whenever it converges. However,

while in the set of experiments described here IRMI often diverges, the performance of our methods is shown

to be markedly superior in comparison with other methods.

1 INTRODUCTION

Missing data imputation is an important part of data

preprocessing and cleansing (Horton and Kleinman,

2007; Pigott, 2001), since the vast majority of com-

monly applied supervised machine learning and sta-

tistical methods for classiﬁcation rely on complete

data (Garc

ıa-Laencina et al., 2010). The most com-

mon option for many applications is to discard com-

plete records in which there are any missing values.

This approach is insufﬁcient for several reasons: ﬁrst,

when missing values are not missing at random (Lit-

tle, 1988; Heitjan and Basu, 1996), discarding these

records may bias the resulting analysis (Little and

Rubin, 2014). Other limitations include the loss of

information when discarding the entire record. Fur-

thermore, when dealing with datasets with either a

small number of records or a large number of features,

omitting complete records when any feature value is

missing may result in insufﬁcient data for the required

analysis.

Early methods for data imputation include meth-

ods for replacing a missing value by the mean or me-

dian of the feature value across records (Engels and

Diehr, 2003; Donders et al., 2006). While these val-

ues may indeed provide a “good guess” when there is

no information present, this is often not case. Namely,

for each missing feature value there are other non-

missing values in the same record. It is likely there-

fore (or indeed, we assume) that other features contain

information regarding the missing feature, and impu-

tation should therefore take into account known fea-

ture values in the same record. This is done by subse-

quent methods.

Multiple imputation (see (Rubin, 1996) for a de-

tailed review) imputes several sets of missing values,

drawn from the posterior distribution of the missing

values under a given model, given the data. Subse-

quent processing is then to be performed on each ver-

sion of the imputed data, and the resulting multiple

sets of model parameters are combined to produce a

single result. While extremely useful in traditional

statistical analysis and public survey data, it may not

be feasible in a machine learning setting. First, the

run-time cost of performing the analysis on several

copies of the full-data may be prohibitive. Second,

being a model-based approach it depends heavily on

the type and nature of the data, and can’t be used as an

out-of-the-box pre-processing step. More importantly

though, while traditional model parameters may be

combined between versions of the imputed data (re-

gression coefﬁcients for instance), many modern ma-

chine learning methods do not produce a representa-

tion that is straightforward to combine (consider the

Resheff, Y. and Weinshal, D.

Optimized Linear Imputation.

DOI: 10.5220/0006092900170025

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 17-25

ISBN: 978-989-758-222-6

parameters of an Artiﬁcial Neural Network or a Ran-

dom Forest for example

In (Raghunathan et al., 2001), a method for impu-

tation on the basis of a sequence of regression mod-

els is introduced. This method, popularized under the

acronym MICE (Buuren and Groothuis-Oudshoorn,

2011; Van Buuren and Oudshoorn, 1999), uses a non-

empty set of complete features which are known in

all the records as its base, and iteratively imputes one

feature at a time on the basis of the completed fea-

tures up to that point. Since each step produces a sin-

gle complete feature, the number of iterations needed

is exactly the number of features that have a missing

value in at least one record. The drawbacks of this

method are twofold. First, there must be at least one

complete feature to be used as the base. More im-

portantly though, the values imputed at the i −th step

can only use a regression model that includes the fea-

tures which were originally full or those imputed in

the i − 1 ﬁrst steps. Ideally, the regression model for

each feature should be able to use all other feature

values.

The IRMI method (Templ et al., 2011) goes one

step further by building a sequence of regression mod-

els for each feature that can use all other feature val-

ues as needed. This iterative method initially uses a

simple imputation method such as median imputation.

In each iteration it computes for each feature the lin-

ear regression model based on all other feature values,

and then re-imputes the missing values based on these

regression models. The process is terminated upon

convergence or after a per-determined number of iter-

ations (Algorithm 1). The authors state that although

they do not have a proof of convergence, experiments

show fast convergence in most cases.

In Section 2 we present a novel method of Opti-

mized Linear Imputation (OLI). The OLI method is

related in spirit to IRMI in that it performs a linear

regression imputation for the missing values of each

feature, on the basis of all other features. Our method

is deﬁned by a single optimization objective which we

then solve using a block coordinate-descent method.

Thus our method is guaranteed to converge, which is

its most important advantage over IRMI. We further

show that our algorithm may be easily extended to

use any form of regularized linear regression.

In Section 3 we ompare the OLI method to the

IRMI, MICE and Median Imputation (MI) methods.

In this case it would be perhaps more natural to train

the model using data pooled over the various copies of the

completed data rather than train separate models and aver-

age the resulting parameters and structure. This is indeed

done artiﬁcially in methods such as denoinsing neural nets

(Vincent et al., 2010), and has been known to be useful for

data imputation (Duan et al., 2014).

Using the same simulation studies as in the original

IRMI paper, we show that the results of OLI are rather

similar to the results of IRMI. With real datasets we

show that our method usually outperforms the alter-

natives MI and MICE in accuracy, while providing

comparable results to IRMI. However, IRMI did not

converge in many of these experimentsm while our

method always provided good results.

2 OLI METHOD

2.1 Notation

We start by listing the notation used throughout the

paper.

N Number of samples

d Number of features

i, j

The value of the j −th feature in the i−th

sample

i, j

Missing value indicators:

i, j

(

1 x

i, j

is missing

0 other wise

Indicator vector of missing values for for

the i −th feature

The following notation is used in the algorithms’

pseudo-code:

A[m] The rows of a matrix (or column vector)

A where the boolean mask vector m is

True

A[!m] The rows of a matrix (or column vector)

A where the boolean mask vector m is

False

linear

regression(X, y) A linear regression from the

columns of the matrix X to the target vec-

tor y, having the following ﬁelds:

.parameters: parameters of the ﬁtted

model.

.predict(X): the target column y as pre-

dicted by the ﬁtted model.

2.2 Optimization Problem

We formulate the linear imputation as a single opti-

mization problem. First we construct a design matrix:

X =







i, j

(1 − m

i, j

)]







(1)

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

Algorithm 1: The IRMI method for imputation of real-valued features (see (Templ et al., 2011) for more details).

input:

• X - data matrix of size N × (d + 1) containing N samples and d features

• m - missing data mask

• max iter - maximal number of iterations

output:

• Imputation values

X := median impute(X ) {assigns each missing value the median of its column}

2: while not converged and under max iter iterations do

3: for i := 1...d do

4: regression = linear regression(

−i

[!m

])

] = regression.predict(

−i

])

6: end for

7: end while

8: return

X − X

where the constant-1 rightmost column is used for the

intercept terms in the subsequent regression models.

Multiplying the data values x

i, j

by (1 − m

i, j

) simply

sets all missing values to zero, keeping non-missing

values as they are.

Our approach aims to ﬁnd consistent missing

value imputations and regression coefﬁcients as a sin-

gle optimization problem. By consistent we mean that

(a) the imputations are the values obtained by the re-

gression formulas, and (b) the regression coefﬁcients

are the values that would be computed after the impu-

tations. We propose the following optimization for-

mulation:











min

A,M

||(X + M)A − (X + M)||

s.t. m

i, j

= 0 ⇒ M

i, j

= 0

i,d+1

= 0 ∀i

i,i

= 0 i = 1...d

i,d+1

= δ

i,d+1

∀i

(2)

where ||. ||

is the Frobenius norm.

Intuitively, the objective that we minimize mea-

sures the square error of reconstruction of the imputed

data (X +M), where each feature (column) is approx-

imated by a linear combination of all other features

plus a constant (that is, linear regression of the re-

maining imputed data). The imputation process by

which M is deﬁned is guaranteed to leave the non-

missing values in X intact, by the ﬁrst and second con-

straints which make sure that only missing entries in

X have a corresponding non-zero value in M. There-

fore:

(X + M) =

(

M f or missing values

X f or non missing values

The regression for each feature is further con-

strained to use only other features, by setting the di-

agonal values of A to zero (the third constraint). The

forth constraint makes sure that the constant-1 right-

most column of the design matrix is copied as-is and

therefore does not impact the objective.

We note that all the constraints set variables to

constant values, and therefore this can be seen as an

unconstrained optimization problem on the remaining

set of variables. This set includes the non-diagonal el-

ements of A and the elements of M corresponding to

missing values in X. We further note that this is not

a convex problem in A, M since it contains the MA

factor. In the next section we show a solution to this

problem that is guaranteed to converge to a local min-

imum.

2.3 Block Coordinate Descent Solution

We now develop a coordinate descent solution for the

proposed optimization problem. Coordinate descent

(and more speciﬁcally alternating least squares; see

for example (Hope and Shahaf, 2016)) algorithms are

extremely common in machine learning and statistics,

and while don’t guarantee convergence to a global op-

timum, they often preform well in practise.

As stated above, our problem is an unconstrained

optimization problem over the following set of vari-

ables:

i, j

|i, j = 1, .., d; i 6= j} ∪ {M

i, j

= 1}

Keeping this in mind, we use the following objective

function:

L(A, M) = ||(X + M)A − (X + M)||

(3)

∑

i=1

||(X + M)

−i

− (X + M)

(4)

Optimized Linear Imputation

Algorithm 2: Optimized Linear Imputation (OLI).

input:

• X

- data matrix of size N × d containing N samples and d features

• m- missing data mask

output:

• Imputation values

1: X := median impute(X

)

2: M := zeros(N, d)

3: A := zeros(d, d)

4: while not converged do

5: for i := 1...d do

6: β := linear regression(X

−i

, X

). parameters

7: A

:= [β

, ..., β

i−1

, 0, β

, ..., β

]

8: end for‘

9: while not converged do

10: M := M − α[(X + M)A − (X + M)](A − I)

11: M[!m] := 0

12: end while

13: X := X + M

14: end while

15: return M

where C

−i

denotes the matrix C without its i −th col-

umn, C

the i − th column, and β

the i − th column

of A without the i − th element (recall that the i − th

element of the i −th column of A is always zero). The

term (X + M)

−i

is therefore a linear combination

of all but the i − th column of the matrix (X + M).

The sum in (4) is over the ﬁrst d columns only, since

the term added by the rightmost column is zero (see

fourth constraint in (2)).

We now suggest the following coordinate descent

algorithm for the minimization of the objective (3)

(the method is summarized in Algorithm 2):

1. Fill in missing values using median/mean (or any

other) imputation

2. Repeat until convergence:

(a) Minimize the objective (3) w.r.t. A (compute

the columns of the matrix A)

(b) Minimize the objective (3) w.r.t. M (compute

the missing values entries in matrix M)

3. Return M

As we will show shortly, step (a) in the iterative part

of the proposed algorithm reduces to calculating the

linear regression for each feature on the basis of all

other features, essentially the same as the ﬁrst step in

the IRMI algorithm (Templ et al., 2011) Algorithm 1.

Alternatively, in order to stay close in spirit to the lin-

ear IRMI method, we may prefer to use (X + M)A as the

imputed data.

Step (b) can be solved either as a system of linear

equations or in itself as an iterative procedure, by gra-

dient descent on (3) w.r.t M using (5).

First, we show that step (a) reduces to linear re-

gression. Taking the derivatives of (4) w.r.t the non-

diagonal elements of column i of A we have:

∂L

∂β

= 2(X + M)

−i

[(X + M)

−i

− (X + M)

]

Setting the partial derivatives to zero gives:

(X + M)

−i

[(X + M)

−i

− (X + M)

]=0

⇒β

= ((X + M)

−i

(X + M)

−i

)

−1

(X + M)

−i

(X + M)

which is exactly the linear regression coefﬁcients for

the i −th feature from all other (imputed) features, as

claimed.

Next, we obtain the derivatives of the objective func-

tion w.r.t M:

∇

∂L

∂M

= 2[(X + M)A − (X + M)](A − I)

(5)

leading to the following gradient descent algorithm

for step (b): step (b), Repeat until convergence:

(i) M := M − α∇

L(A, M)

(ii) ∀

i, j

: M

i, j

= M

i, j

where α is a predeﬁned step size and the gradient is

given by (5). Step (ii) makes sure that only missing

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

values are assigned imputation values

Our proposed algorithm uses a gradient descent

procedure for the minimization of the objective (3)

w.r.t M. Alternatively, one could use a closed form so-

lution by directly setting the partial derivative to zero.

More speciﬁcally, let

∂L

∂M

=0 (6)

Substituting (5) into (6), we get

M(A − I)(A − I)

= −X(A − I)(A − I)

which we rewrite as:

MP = Q (7)

with the appropriate matrices P, Q. Now, since only

elements of M corresponding to missing values of X

are optimization variables, only these elements must

be set to zero in the derivative (6), and hence only

these elements must obey the equality (7). Thus, we

have:

(MP)

i, j

= Q

i, j

∀i, j|m

i, j

= 1

which is a system of

∑

i, j

linear equations in

∑

i, j

variables.

2.4 Discussion

In order to better understand the difference between

the IRMI and OLI methods, we rewrite the IRMI it-

erative method (Templ et al., 2011) using the same

notation as used for our method. We start by deﬁning

an error matrix:

E = (X + M)A − (X + M)

E is the error matrix of the linear regression models

on the basis of the imputed data. Unlike our method,

however, IRMI considers the error only in the non-

missing values of the data, leading to the following

objective function:

L(M, A) =

∑

i, j|m

i, j

In order to minimize this loss function, at each

step the IRMI method (Algorithm 1) optimizes over a

single column of A (which in effect reduces to ﬁtting

a single linear regression model), and then assigns as

Note that this is not a projection step. Recall that the

optimization problem is only over elements M

i j

where x

i j

is a missing value, encoded by m

i j

= 1. The element-wise

multiplication of M by m guarantees that all other elements

of M are assigned 0. Effectively, the gradient descent pro-

cedure does not treat them as independent variables, as re-

quired.

the missing values in the corresponding column of M

the values predicted for it by the regression model.

While this heuristic for choosing M is quite effective,

it is not a gradient descent step and it therefore leads

to a process with unknown convergence properties.

The main motivation for proposing our method was to

ﬁx this shortcoming within the same general frame-

work and propose a method that is similar in spirit,

with a convergence guarantee.

Another advantage of the proposed formulation is

the ability to easily extend it to any regularized linear

regression. This can be done by re-writing the item-

ized form of the objective (4) as follows:

L(A, M) =

∑

[||(X + M)

−i

− (X + M)

+ Ω(β

)]

where Ω(β

) is the regularization term.

Now, assuming that the resulting regression prob-

lem can be solved (that is, minimizing each of the

summands in the new objective with a constant M),

and since step (b) of our method remains exactly the

same (the derivative w.r.t M does not change as the ex-

tra term does not depend on M), we can use the same

method to solve this problem as well.

Another possible extension is to use kernelized

linear regression. This may be useful in cases when

the dependencies between the features are not linear.

Here too we can use the same type of method of opti-

mization, but we defer to future research working out

the details of the derivative w.r.t M, which will obvi-

ously not remain the same.

The method of initialization is another issue de-

serving further investigation. Since our procedure

converges to a local minimum of the objective, it

may be advantageous to start the procedure from sev-

eral random initial points, and choose the best re-

sult. However, since the direct target (missing values)

are obviously unknown, we would need an alterna-

tive measure of the ”goodness” of a result. Since the

missing values are assumed to be missing at random,

it would make sense to use the distance between the

distributions of known and imputed values (per fea-

ture) as a measure of appropriateness of an imputa-

tion.

3 EXPERIMENTS

In order to evaluate our method, we compared its per-

formance to other imputation methods using various

types of data. We used complete datasets (real or syn-

thetic), and randomly eliminated entries in order to

simulate the missing data case. To evaluate the suc-

cess of each imputation method, we used the mean

Optimized Linear Imputation

square error (MSE) of the imputed values as a mea-

sure of error. MSE is computed as the mean square

distance between stored values (the correct values for

the simulated missing values) and the imputed ones.

In Section 3.1 we repeat the experimental evalua-

tion from (Templ et al., 2011) using synthetic data, in

order to compare the results of our method to the re-

sults of IRMI. In Section 3.2 we compare our method

to 3 other methods - IRMI, MI and MICE - using

standard benchmark datasets from the UCI repository

(Lichman, 2013) . In Section 3.3 we augment the

comparisons with an addition new reallife dataset of

storks migration data.

For some real datasets in the experiments de-

scribed below we report that the IRMI method did

not converge (and therefore did not return any re-

sult). This decision was reached when the MSE of

the IRMI method rose at least 6 orders of magnitude

throughout the allocated 50 iterations, or (when tested

with unlimited iterations) when it rose above the max-

imum valid number in the system of approximately

1e + 308.

3.1 Synthetic Data

The following simulation studies follow (Templ et al.,

2011) and compare OLI to IRMI. All simulations are

repeated 20 times with 10, 000 samples. 5% of all val-

ues across records are selected at random and marked

as missing. Values are stored for comparison with im-

puted values. Simulation data is multivariate normal

with mean of 1 in all dimensions. Unless stated oth-

erwise, the covariance matrix has 1 in its diagonal en-

tries and 0.7 in the off-diagonal entries.

The aim of the ﬁrst experiment is to test the re-

lationship between the actual values imputed by the

IRMI and OLI methods. The simulation is based on

multivariate normal data with 5 dimensions. Results

show that the values imputed by the two methods are

highly correlated (Fig. 1a). Furthermore, the signed

error (original − imputed) is also highly correlated

(Fig. 1b). Together, these ﬁndings point to the sim-

ilarity in the results these two methods produce.

In the next simulation we test the performance of

the two methods as we vary the number of features.

The simulation is based on multivariate normal data

with 3 − 20 dimensions. The results (Fig. 2b) show

almost identical behavior of the IRMI and OLI al-

gorithms, which also coincides with the results pre-

sented for IRMI in (Templ et al., 2011). Median im-

putation (MI) is also shown for comparison as base-

line. Fig 3 shows a zoom into a small segment of

ﬁgure 2.

As expected, imputing the median (which is also

the mean) of each feature for all missing values re-

sults in an MSE equal to the standard deviation of the

features (i.e., 1). While very close, the IRMI and the

OLI methods do not return the exact same imputation

values and errors, with an average absolute deviation

of 0.053

Next we test the performance of the two methods

as we vary the covariance between the features. The

simulation is based on multivariate normal data with 5

dimensions. Non-diagonal elements of the covariance

matrix are set to values in the range 0.1 − 0.9. The re-

sults (Fig. 2a) show again almost identical behavior

of the IRMI and OLI algorithms. As expected, when

the dependency between the feature columns is in-

creased, which is measure by the covariance between

the columns (X-axis in Fig. 2a), the performance of

the regression-based methods IRMI and OLI is mono-

tonically improving, while the performance of the MI

method remain unaltered.

3.2 UCI Datasets

The UCI machine learning repository (Lichman,

2013) contains several popular benchmark datasets,

some of which have been previously used to compare

methods of data imputation (Schmitt et al., 2015).

In the current experiment we used the following

datasets: iris (Fisher, 1936), wine (white) (Cortez

et al., 2009), Ecoli (Horton and Nakai, 1996), Boston

housing (Harrison and Rubinfeld, 1978), and power

ufekci, 2014). Each feature of each dataset was

normalized to have mean 0 and standard deviation

of 1, in order to make error values comparable be-

tween datasets. Categorical features were dropped.

For each dataset, 5% of the values were chosen at

random and replaced with a missing value indica-

tor. The procedure was repeated 10 times. For these

datasets we also consider the MICE method (Buuren

and Groothuis-Oudshoorn, 2011) using the winMice

(Jacobusse, 2005) software.

The results are quite good, demonstrating the su-

perior ability of the linear methods to impute miss-

ing data in these datasets (Table 1, rows 1-5). In the

Iris dataset our OLI method achieved an average error

identical to IRMI, which successfully converged only

9 out of the 10 runs. Both outperformed the MI and

MICE standard methods. In the Ecoli dataset both

the IRMI and OLI methods performed worse than the

alternative methods, with MICE achieving the low-

est MSE. In the Wine dataset the IRMI failed to con-

verge in all 10 repetitions, while the OLI method out-

performed the MI and MICE methods. The IRMI

method outperformed all other methods in the Hous-

ing dataset, but failed to converge 7 out of 10 times

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

Figure 1: (a) Correlation between predicted values for missing data using the IRMI and OLI methods. (b) Correlation between

the signed error of the prediction for the two methods.

Figure 2: (a) MSE of the IRMI, OLI and MI methods as a function of the covariance. Data is 5 dimensional multivariate

normal. (b) MSE of the IRMI, OLI and MI methods as a function of the dimensionality, with a constant covariance of 0.7

between pairs of features. In both cases error bars represent standard deviation over 20 repetitions.

for the Power dataset.

In summary, in cases where the linear methods

were appropriate, with sufﬁcient correlation between

the different features (shown in the second column

of Table 1), the proposed OLI method was com-

parable to the IRMI method with regard to mean

square error of the imputed values when the latter

converged, and superior in that it always converges

and therefore always returns a result. While the IRMI

method achieved slightly better results than OLI in

some cases, its failure to converge in others gives the

OLI method the edge. Overall, better results were

achieved for datasets with high mean correlation be-

tween features, as expected when using methods uti-

lizing the linear relationships between features.

Optimized Linear Imputation

Table 1: Comparison of the imputation results of the IRMI, OLI, MICE and MI methods with 5% missing data. The converged

column indicates the number of runs in which the IRMI method converged during testing; the MSE of IRMI was calculated

for converged repetitions only.

Dataset # Features correlation IRMI OLI MI MICE

converged MSE

Iris 4 0.59 9/10 0.20 0.20 1.00 0.33

Ecoli 7 0.18 9/10 8.26 5.75 1.72 1.20

Wine 11 0.18 0/10 - 0.87 1.05 1.10

Housing 11 0.45 10/10 0.28 0.30 1.14 0.56

Power 4 0.45 3/10 0.44 0.47 1.02 0.88

Storks 20 0.24 0/10 - 0.31 1.07 0.42

Figure 3: Zoom into a small part of ﬁgure 2.

3.3 Storks Behavioral Modes Dataset

In the ﬁeld of Movement Ecology, readings from

accelerometers placed on migrating birds are used

for both supervised (Resheff et al., 2014) and unsu-

pervised (Resheff et al., 2015)(Resheff et al., 2016)

learning of behavioral modes. In the following ex-

periment we used a dataset of features extracted from

3815 such measurements. As with the UCI datasets,

10 repetitions were performed, each with 5% of the

values randomly selected and marked as missing. Re-

sults (Table 1, ﬁnal row) of this experiment highlight

the relative advantage of the OLI method. While

the IRMI method failed to converge in all 10 repe-

titions, OLI achieved an average MSE considerably

lower than the MI baseline, and also outperformed the

MICE method.

4 CONCLUSION

Since the problem of missing values often haunts

real-word datasets while most data analysis methods

are not designed to deal with this problem, imputa-

tion is a necessary pre-processing step whenever dis-

carding entire records is not a viable option. Here

we proposed an optimization-based linear imputation

method that augments the IRMI (Templ et al., 2011)

method with the property of guaranteed convergence,

while staying close in spirit to the original method.

Since our method converges to a local optimum of a

different objective function, the two methods should

not be expected to converge to the same value ex-

actly. However, simulation results show that the re-

sults of the proposed method are generally similar

(nearly identical) to IRMI when the latter does indeed

converge.

The contribution of our paper is twofold. First,

we suggest an optimization problem based method for

linear imputation and an algorithm that is guaranteed

to converge. Second, we show how this method can

be extended to use any number of methods of regu-

larized linear regression. Unlike matrix completion

methods (Wagner and Zuk, 2015), we do not have

a low rank assumption. Thus, OLI should be pre-

ferred when data is expected to have some linear re-

lationships between features and when IRMI fails to

converge, or alternatively, when a guarantee of con-

vergence is important (for instance in automated pro-

cesses). We leave to future research the kernel exten-

sion of the OLI method.

REFERENCES

Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice:

Multivariate imputation by chained equations in r.

Journal of statistical software, 45(3).

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis,

J. (2009). Modeling wine preferences by data mining

from physicochemical properties. Decision Support

Systems, 47(4):547–553.

Donders, A. R. T., van der Heijden, G. J., Stijnen, T., and

Moons, K. G. (2006). Review: a gentle introduction

to imputation of missing values. Journal of clinical

epidemiology, 59(10):1087–1091.

Duan, Y., Yisheng, L., Kang, W., and Zhao, Y. (2014). A

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

deep learning based approach for trafﬁc data impu-

tation. In Intelligent Transportation Systems (ITSC),

2014 IEEE 17th International Conference on, pages

912–917. IEEE.

Engels, J. M. and Diehr, P. (2003). Imputation of missing

longitudinal data: a comparison of methods. Journal

of clinical epidemiology, 56(10):968–976.

Fisher, R. A. (1936). The use of multiple measurements in

taxonomic problems. Annals of eugenics, 7(2):179–

188.

Garc

ıa-Laencina, P. J., Sancho-G

omez, J.-L., and Figueiras-

Vidal, A. R. (2010). Pattern classiﬁcation with miss-

ing data: a review. Neural Computing and Applica-

tions, 19(2):263–282.

Harrison, D. and Rubinfeld, D. L. (1978). Hedonic housing

prices and the demand for clean air. Journal of envi-

ronmental economics and management, 5(1):81–102.

Heitjan, D. F. and Basu, S. (1996). Distinguishing missing

at random and missing completely at random. The

American Statistician, 50(3):207–213.

Hope, T. and Shahaf, D. (2016). Ballpark learning: Estimat-

ing labels from rough group comparisons. Joint Eu-

ropean Conference on Machine Learning and Knowl-

edge Discovery in Databases, pages 299–314.

Horton, N. J. and Kleinman, K. P. (2007). Much ado about

nothing. The American Statistician, 61(1).

Horton, P. and Nakai, K. (1996). A probabilistic classiﬁ-

cation system for predicting the cellular localization

sites of proteins. In Ismb, volume 4, pages 109–115.

Jacobusse, G. (2005). Winmice users manual. TNO Quality

of Life, Leiden. URL http://www. multiple-imputation.

com.

Lichman, M. (2013). UCI machine learning repository.

Little, R. J. (1988). A test of missing completely at random

for multivariate data with missing values. Journal of

the American Statistical Association, 83(404):1198–

1202.

Little, R. J. and Rubin, D. B. (2014). Statistical analysis

with missing data. John Wiley & Sons.

Pigott, T. D. (2001). A review of methods for missing data.

Educational research and evaluation, 7(4):353–383.

Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., and

Solenberger, P. (2001). A multivariate technique for

multiply imputing missing values using a sequence of

regression models. Survey methodology, 27(1):85–96.

Resheff, Y. S., Rotics, S., Harel, R., Spiegel, O., and

Nathan, R. (2014). Accelerater: a web application for

supervised learning of behavioral modes from accel-

eration measurements. Movement ecology, 2(1):25.

Resheff, Y. S., Rotics, S., Nathan, R., and Weinshall, D.

(2015). Matrix factorization approach to behavioral

mode analysis from acceleration data. In Data Science

and Advanced Analytics (DSAA), 2015. 36678 2015.

IEEE International Conference on, pages 1–6. IEEE.

Resheff, Y. S., Rotics, S., Nathan, R., and Weinshall, D.

(2016). Topic modeling of behavioral modes using

sensor data. International Journal of Data Science

and Analytics, 1(1):51–60.

Rubin, D. B. (1996). Multiple imputation after 18+

years. Journal of the American statistical Association,

91(434):473–489.

Schmitt, P., Mandel, J., and Guedj, M. (2015). A compari-

son of six methods for missing data imputation. Jour-

nal of Biometrics & Biostatistics, 2015.

Templ, M., Kowarik, A., and Filzmoser, P. (2011). Iterative

stepwise regression imputation using standard and ro-

bust methods. Computational Statistics & Data Anal-

ysis, 55(10):2793–2806.

ufekci, P. (2014). Prediction of full load electrical power

output of a base load operated combined cycle power

plant using machine learning methods. Interna-

tional Journal of Electrical Power & Energy Systems,

60:126–140.

Van Buuren, S. and Oudshoorn, K. (1999). Flexible multi-

variate imputation by mice. Leiden, The Netherlands:

TNO Prevention Center.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and

Manzagol, P.-A. (2010). Stacked denoising autoen-

coders: Learning useful representations in a deep net-

work with a local denoising criterion. The Journal of

Machine Learning Research, 11:3371–3408.

Wagner, A. and Zuk, O. (2015). Low-rank matrix recov-

ery from row-and-column afﬁne measurements. arXiv

preprint arXiv:1505.06292.

Optimized Linear Imputation