Linear Discriminant Analysis based on Fast Approximate SVD

Nassara Elhadji Ille Gado, Edith Grall-Maës and Malika Kharouf

University of Champagne / University of Technology of Troyes,

Charles Delaunay Institute(ICD) UMR 6281 UTT-CNRS / LM2S, Troyes, France

{nassara.elhadji_ille_gado, edith.grall, malika.kharouf}@utt.fr

Keywords:

LDA, Fast SVD, Dimension Reduction, Large Scale Data.

Abstract:

We present an approach for performing linear discriminant analysis (LDA) in the contemporary challeng-

ing context of high dimensionality. The projection matrix of LDA is usually obtained by simultaneously

maximizing the between-class covariance and minimizing the within-class covariance. However it involves

matrix eigendecomposition which is computationally expensive in both time and memory requirement when

the number of samples and the number of features are large. To deal with this complexity, we propose to use a

recent dimension reduction method. The technique is based on fast approximate singular value decomposition

(SVD) which has deep connections with low-rank approximation of the data matrix. The proposed approach,

appSVD+LDA, consists of two stages. The ﬁrst stage leads to a set of artiﬁcial features based on the original

data. The second stage is the classical LDA. The foundation of our approach is presented and its performances

in term of accuracy and computation time in comparison with some state-of-the-art techniques are provided

for different real data sets.

1 INTRODUCTION

Linear Discriminant Analysis (LDA) is a well-known

supervised technique for feature extraction (Fried-

man, 1989), (Duda et al., 2012), (Welling, 2005).

It has been widely used in many applications such

as face recognition (Chen et al., 2005), handwritten

code classiﬁcation (Hastie et al., 2001), text classi-

ﬁcation (Moulin et al., 2014). The traditional LDA

seeks a projection matrix so that data points in dif-

ferent classes are far from each other while those in

the same class are close to each other, thus achieving

maximum discrimination. To ﬁnd such optimal pro-

jection matrix, LDA involves eigendecomposition of

the scatter matrices. For face recognition and docu-

ments classiﬁcation for example, the intrinsic struc-

ture of samples can make a scatter matrix singular

since the data sets are from a very high-dimensional

space. In high dimensional context, the singularity

problem and eigendecomposition complexity of the

scatter matrices make LDA infeasible.

Many approaches have been proposed to outper-

form LDA in high dimension (Yu and Yang, 2001)

(Ye and Li, 2004) and (Ye et al., 2005). A common

way to deal with the curse of dimensionality is to de-

termine an intermediate subspace where optimization

problems can be solved efﬁciently with much smaller

size matrices. Dimension reduction strategies consist

in eliminating irrelevant information. The most popu-

lar techniques proposed for dimension reduction with

large scale data sets are principal component analysis

(PCA)(Lee et al., 2012) and random projection (RP)

(Achlioptas, 2003), (Cardoso and Wichert, 2012).

In this paper, we use a dimension reduction strat-

egy which uses fast approximate singular value de-

composition (SVD) (Menon and Elkan, 2011). This

technique was also used in (Boutsidis et al., 2015).

The principle is to reconstruct some d-dimensional

feature space onto its best rank-k approximation for

some k ≪ d. After dimension reduction, it becomes

practically easy to handle data in the new reduced fea-

ture space. Hence, the proposed appSVD+LDA ap-

proach deals with a multi-class supervised classiﬁca-

tion problem. It consists of outperforming the tradi-

tional LDA in a new artiﬁcial subspace constructed by

fast approximate SVD.

The remainder of this paper is organized as fol-

lows: in section 2, we give a brief description of LDA

and fast approximate SVD methods. In section 3,

we describe the proposed approach appSVD+LDA.

In section 4 numerical results supporting the perfor-

mance of the proposed approach compared to some

state-of-the-art methods are presented. Finally in sec-

tion 5 we conclude the paper.

Gado, N., Grall-Maës, E. and Kharouf, M.

Linear Discriminant Analysis based on Fast Approximate SVD.

DOI: 10.5220/0006148603590365

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 359-365

ISBN: 978-989-758-222-6

359

2 A BRIEF REVIEW OF LDA AND

FAST APPROXIMATE-SVD

2.1 Classical Linear Discriminant

Analysis (LDA)

In this section, we give a brief LDA basics. Con-

sider the following supervised multi-class classiﬁca-

tion problem : we dispose of a set of N labelled data

belonging to K classes {C

, ...,C

} with class size

, N

, ...,N

}, where N

+ N

+ ... + N

= N.

X = {x

, x

, ..., x

} where x

∈ R

1×d

is the observed

sample and {y

}

i=1,...,N

, y

∈ 1...K is the given class

membership for x

. The goal is to build a classiﬁer

based on the training set X ∈ R

N×d

to predict the class

label of a new unlabelled set X

= {x

, x

, ..., x

The LDA objective function is to seek a projection

matrix W such that the data points in the new space

which belong to the same class are very close while

data points in different classes are far from each other

(Welling, 2005). W maximizes the following ratio

J(W) = argmax

det(W

. (1)

is the between class scatter matrix and S

is the

within class scatter matrix deﬁned by

∑

i=1

− m)

− m),

∑

i=1

∑

∈C

− m

)

− m

), (2)

where m =

∑

i=1

) is the total sample mean vec-

tor, m

is the mean vector of the i-th class.

The optimal discriminative projection matrix W

can be obtained by computing the eigenvectors of the

matrix S

−1

(Chen et al., 2005). Since the rank

of S

is bounded by K − 1, there are at most K − 1

eigenvectors corresponding to non zeros eigenvalues.

The time complexity and the memory requirement in-

crease with N and d. Then, when N and d are large

(or d is large), it is difﬁcult to perform LDA.

2.2 Fast Approximate-SVD

Low-rank approximation or approximate SVD is a

minimization problem, in which the cost function

measures the ﬁt between a given matrix (the data)

and an approximating matrix (the optimization vari-

able), subject to a constraint that the approximating

matrix has a reduced rank. The problem aims to

ﬁnd a low-rank matrix X

which approximates the

matrix X in some lower rank such as min

k X −

s.t. rank(X

) = k where F indicates the Frobe-

nius norm.

Approximate SVD can be seen as a process of

ﬁnding a rank-k approximation as forcing the origi-

nal matrix to provide a shrunken description of itself.

The problem is used for mathematical modeling and

data compression. Let X ∈ R

N×d

be the data matrix,

and let the SVD of X be of the form :

X = UΣV

(3)

where, U ∈ R

N×N

, V ∈ R

d×d

and Σ ∈ R

N×d

. The

matricesU andV are orthogonal. Σ is a semi-diagonal

matrix with non-negative real numbers entries σ

≥

. . . ≥ σ

> 0 (singular values) where s ≤ min{N, d}.

Giving a value of k ≤ min{N, d} and by using (3),

the truncated form X

of X is deﬁned by :

∑

i=1

= U

, (4)

where only the ﬁrst k column vectors of U, V and

the k × k sub-matrix are selected. The form X

(4) is mathematically guaranteed to be the optimal

approximation of X (Boutsidis et al., 2015). Due to

the orthogonality of U

, the matrix XV

(resp.

X) has rank at most equal to k and ap-

proximates X. The computation complexity of (4)

is O(Nd min{N, d}) which makes it infeasible if

min{N, d} is large.

To speed up the computation of the best rank-k ap-

proximation of X, it is possible to use a fast approxi-

mate SVD algorithm. This algorithm, recently used in

(Boutsidis et al., 2015), uses random projection. The

principle is the following (Menon and Elkan, 2011) :

we consider the subspace spanned by a random pro-

jection Y = X × R where R is a d × p random matrix.

It is shown that by projecting X onto the column space

of Y, and then ﬁnding the best rank-k approximation

to this new space (i.e. the truncated SVD), we get a

good approximation to the best rank-k approximation

of X itself. Thus the algorithm of fast approximate

SVD takes as input the matrix X and integers k and

p such that 2 < k < rank(X) and k ≤ p ≪ d. The er-

ror in the approximationis directly linked to p (details

about the error bound can be found in (Boutsidis et al.,

2015)). The fast approximate SVD (Fast-AppSVD)

algorithm is the following:

1. Generate an d × p random matrix R ∼ N (0, I

2. Compute the matrix Y = XR,

3. Orthonormalize Y to obtain Q of size N × p,

4. Set G (of size d × k) as the top k right singular

vectors of Q

Then G can be used as a projection matrix.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

360

3 THE PROPOSED APPROACH

The proposed approach proceeds in two steps. Firstly,

we perform a feature selection by applying the fast

approximate SVD described in the previous section.

k-dimensional space obtained in the ﬁrst step. The

proposed approach allows to perform the linear dis-

criminant analysis with very large matrix. The algo-

rithm 1 gives the main steps of our method.

Algorithm 1: appSVD+LDA algorithm.

INPUTS : X, Y, p, k, and µ

OUTPUT :

1. Compute G = Fast-AppSVD(X, p, k),

2. Project X using G to obtain

X = XG,

3. Calculate

and

from

4. Find

W as the eigenvectors of (

)

−1

is not

singular and of (

+ µI

)

−1

else,

5. Return

If the scatter matrix

is singular, we perform a

regularized process to solve the singularity problem,

i.e, we compute (

+ µI

)

−1

where µ is a regular-

ized term. Note that (

+ µI)

−1

involves to add a di-

agonal term to

to make sure that very small eigen-

values are bounded away from zero, which ensures

the numerical stability when computing the inverse of

It can be demonstrated that the projection matrix

W is a good approximation ofW. The data covariance

matrix in the d- original space is given by

S =

(X − m)

(X − m).

The Fast-AppSVD algorithm provides G such that

XGG

is a low rank approximation of X. The ma-

trix

X = XG is a new representation of the original

data matrix in the reduced feature space. In the new

space, the covariance matrix

S can be written as

S =

(

X − em)

(

X − em)

(XG− mG)

(X − m)

(X − m)G = G

SG (5)

Similarly we get :

= G

G and

= G

G (6)

Then

W =

W = W

with W = G

W. The new LDA objective function can

be rewritten as follows:

W) =

det(

det(W

. (7)

The optimal projection matrix

W (for simplicity we

do not use

∗

for the optimal value) is formed by the

largest eigenvalues of

−1

Then the obtained projection matrix

W should be

a good approximation of W as far as

X is a good ap-

proximation of X.

4 EXPERIMENTAL RESULTS

In this section, the performancesof the proposed algo-

rithm appSVD+LDA are given. The experiments are

based on real data sets including face recognition and

text classiﬁcation which can be download at http://

www.cad.zju.edu.cn/home/dengcai/Data/data.html.

All the experiments have been performed on P4

2.7GHz Windows7 machine with 16GB memory. We

have used Matlab routine for programming.

4.1 Data Sets

Two images data sets ORL, COIL20 and two texts

data sets TDT2, Reuters21578 have been used in our

experiments. The image data have been normalized

to have L2-norm equal to 1. For text data, each docu-

ment have been represented as a term-frequency vec-

tor and have been normalized to have L2-norm equal

to 1. The statistics of these data sets are listed in Ta-

ble 1.

COIL20. This data set contains 1440 sample images

of 20 different subjects. The size of each image is

(32× 32) pixels.

ORL. This data set contains 10 different poses of 40

distinct subjects with 4096-dimension (64× 64 pix-

els). The images were taken at different times, ranged

from full right proﬁle to full left proﬁle.

TDT2. (Nist Topic Detection and Tracking corpus)

This subset is about 9394 documents in 30 categories

with 36771 features.

Reuters21578. These data were originally collected

and labeled by Carnegie Group, Inc. and Reuters,

Ltd. The corpus contains 8293 documents in 65 cate-

gories with 18933 distinct terms.

4.2 Experiments

For COIL20, TDT2 and Reuters21578 data sets, a

subset TN = [10%, 30%, 50%] of samples per class

Linear Discriminant Analysis based on Fast Approximate SVD

361

Table 1: Statistics of data sets and value of the chosen parameter p.

Statistics of data sets size of

data sets samples (N) dim (d) # of classes dim-Red (p)

COIL20 1440 1024 20 20

ORL 400 4096 40 50

Reuters21578 8293 18933 65 80

TDT2 9394 36771 30 80

with labels was selected at random to form the train-

ing set. For ORL data, we randomly selected TN =

[2, 4, 6] samples per class for training. The rest of

samples were used for testing.

We set the regularized parameter µ = 0.5 and k =

p for fast approximate SVD on the assumption that

K − 1 6 k 6 p ≪ d. Table 1 shows for each data set

the dimension p that we chose for the intermediate

space. Since K − 1 directions can be generated by

LDA, we ﬁnally retain K − 1 vectors of W and then

classify the transformed data in the new space of di-

mension K − 1.

In order to access the relevance of the proposed

method appSVD+LDA, we have compared its perfor-

mance with three other methods which are listed be-

low :

• Direct LDA (DLDA) (Friedman, 1989) which

solves the LDA problem in the original space.

• LDA/QR (Ye and Li, 2004) which is a variant of

LDA that needs to solve the QR decomposition of

a small size matrix.

• NovRP (Liu and Chen, 2009) which is an ap-

proach that uses sparse random projection as di-

mension reduction before performing LDA. The

parameters µ and p have been set in the same way

as our approach.

4.3 Performance

The experimental results are given from Table 2 to 9

for all data sets highlight above. In these tables the

results are averaged over 20 random splits for each

TN(%) and report the mean as well as the standard

deviation. As the running time is nearly constant we

just report the mean value.

Tables 2 and 3 show the performance results on

COIL20 data. DLDA achieves the best accuracy in

this case whereas its running time is signiﬁcantly the

highest. appSVD+LDA presents a quite good accu-

racy performance and its running time is nearly 100

times smaller than that of DLDA. The running time

of NovRP is the most efﬁcient in this case whereas

its accuracy is the lowest one. For ORL data, exper-

imental results are displayed on Tables 4 and 5. As

can be seen, appSVD+LDA presents the best accu-

racy (for 4 and 6 samples) and a low running time.

As the dimension in this case is relatively large, the

computation time for DLDA is very large (see Table

5).

Reuters21578 and TDT2 are very large data sets.

As DLDA needs memory to store the centered and

scatter matrices in the original features space, it is

infeasible to apply DLDA in these cases. Tables 6

to 9 display only the performance results for NovRP,

LDA/QR and appSVD+LDA. The NovRP method

gives the most efﬁcient time (see Tables 7 and 9)

whereas its accuracy is by far the lowest. It can be

seen that the chosen value of p is widely sufﬁcient

for appSVD+LDA to recover nearly 86% of accuracy

for Reuters21578 and 95% for TDT2 and the compu-

tational time is quite small (see Tables 7 and 9). In

the whole results appSVD+LDA signiﬁcantly outper-

forms LDA in running time and its accuracy perfor-

mance let believe in its effectiveness and efﬁciency

compared to other methods.

4.4 Parameter Tunning

There are three essential parameters in the proposed

method which are µ, p and k. µ is used for the reg-

ularization process of the scatter matrix. k is the

dimension of the new feature space where LDA is

performed. p is the dimension size of the interme-

diate subspace where the original features are ran-

domly mapped. A sensitive way of the proposed

appSVD+LDA is the choice of p. This parameter

should guarantee a minimum distortion between data

points after random map. In the ﬁnal dimensional

space each point is represented as a k feature vector

that leads to a faster classiﬁcation process. In our ex-

periments, we chose k = p. To illustrate the impact

of this parameter, we take various values of p. The

accuracy and the training time as a function of p av-

eraged over 20 random splits are plotted on ﬁgures

1 and 2. The methods DLDA and LDA/QR do not

depend on p contrary to appSVD+LDA and NovRP.

In ﬁgure 1 (right), as the training time of DLDA is

widely high, we have not plotted it. It can be seen that

the accuracyof the proposed method is good for small

values of p (p = 80) and it increases slowly with p

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

362