SPEEDING UP LATENT SEMANTIC ANALYSIS

A Streamed Distributed Algorithm for SVD Updates

Radim

Reh˚uˇrek

NLP lab, Masaryk University in Brno, Brno, Czech Republic

Keywords:

Distributed LSA, Streamed SVD, One-pass SVD, Matrix decomposition, Subspace tracking.

Abstract:

Since its inception 20 years ago, Latent Semantic Analysis (LSA) has become a standard tool for robust,

unsupervised inference of semantic structure from text corpora. At the core of LSA is the Singular Value

Decomposition algorithm (SVD), a linear algebra routine for matrix factorization. This paper introduces a

streamed distributed algorithm for incremental updates, which allows the factorization to be computed rapidly

in a single pass over the input matrix on a cluster of autonomous computers.

1 INTRODUCTION

The purpose of Latent Semantic Analysis (LSA) is to

ﬁnd hidden (latent) structure in a collection of texts

represented in the Vector Space Model (Salton, 1989).

LSA was introduced in (Deerwester et al., 1990) and

has since become a standard tool in the ﬁeld of Nat-

ural Language Processing and Information Retrieval.

At the heart of LSA lies the Singular Value Decom-

position algorithm, which makes LSA (sometimes

also called Latent Semantic Indexing, or LSI) really

just another member of the broad family of applica-

tions that make use of SVD’s robust and mathemat-

ically well-founded approximation capabilities

. In

this way, although we will discuss our results in the

perspectiveand terminology of LSA and Natural Lan-

guage Processing, our results are in fact applicable to

a wide range of problems and domains across much

of the ﬁeld of Computer Science.

1.1 Singular Value Decomposition

Latent Semantic Analysis assumes that each docu-

ment (observation) can be described using a ﬁxed set

of real-valued features (variables). These features

capture the usage frequency of distinct words in the

document, and are typically re-scaled by some TF-

IDF scheme (Salton, 1989). However, no assumption

Another member of that family is the discrete KarhunenLove

Transform, from Image Processing; or Signal Processing, where

SVD is commonly used to separate signal from noise. SVD is

also used in solving shift-invariant differential equations, in Geo-

physics, in Antenna Array Processing, ...

is made on what the particular features are or how

to extract them from raw data—LSA simply repre-

sents the input data collection of n documents, each

described by m features, as a matrix A ∈ R

m×n

, with

documents as columns and features as rows.

In theory, the term-document matrix A can be di-

rectly mapped to physical memory and processed.

However, A is commonly left implicit, as it is often

distributed among many computers, unknown in its

entirety and/or computed partially on-demand. In-

deed, in many cases the implied matrix is even inﬁnite

in size, gradually growing as the collection grows.

Nevertheless, there is a reason why it is worth-

while to view the collection as a single gigantic ma-

trix. It allows us to consider linear algebra decompo-

sition algorithms that provide succint, mathematically

well-founded ways of discovering latent structure of

the matrix and, thereby, of the original data collection.

In case of LSA, the algorithm of choice is the Singu-

lar Value Decomposition, SVD. Assuming n ≫ m, the

SVD of A yields A

m×n

= U

m×m

m×n

, where U,

V are orthogonal matrices (called left and right sin-

gular vectors) and S is a diagonal matrix of singular

values with diagonal entries in decreasing order.

Here and elsewhere, we use the subscripts A

x×y

denote that matrix A has x rows and y columns but

omit the subscripts whenever there is no risk of con-

fusion.

1.2 Related Work

Historically, most research on SVD optimization has

gone into Krylov subspace methods, such as Lanczos-

446

Reh˚u

rek R..

SPEEDING UP LATENT SEMANTIC ANALYSIS - A Streamed Distributed Algorithm for SVD Updates.

DOI: 10.5220/0003191304460451

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 446-451

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

based iterative solvers (see e.g. (Vigna, 2008) for

a recent large-scale SVD effort). Our problem is,

however, different in that we can only afford a sin-

gle pass over the input corpus (iterative solvers re-

quire O(k) passes in general). Our scenario, where

the decomposition must be updated on-the-ﬂy and

in constant memory, as the stream of observations

cannot be repeated or even stored in off-core mem-

ory, can be viewed as an instance of subspace track-

ing. See (Comon and Golub, 1990) for an excellent

overview on the complexity of various forms of ma-

trix decomposition algorithms in the context of sub-

space tracking.

An explicitly formulated O(m(k + c)

) method

(where c is the number of newly added documents)

for incremental LSA updates is presented in (Zha and

Simon, 1999). They also give formulas for updating

rows of A as well as rescaling row weights. Their al-

gorithm is completely streamed and runs in constant

memory. It can therefore also be used for online sub-

space tracking, by simply ignoring all updates to the

right singular vectors V. The complexity of updates

was further reduced in (Brand, 2006) who proposed

a linear O(mkc) update algorithm by a series of c fast

rank-1 updates. However, in the process, the ability to

track subspaces is lost (despite their tentative claim to

the contrary). Their approach is akin to the k-Nearest

Neighbours (k-NN) method of Machine Learning: the

lightning speed during training is offset by memory

requirements of storing the intermediate model.

These incremental methods concern themselves

with adding new documents to an existing decompo-

sition. What is needed for a distributed version of

LSA is a slightly different task: given two existing

decompositions, merge them together into one. We

did not ﬁnd any explicit, efﬁcient algorithm for merg-

ing decompositions in the literature. In this research

paper we therefore seek to close this gap, provide

such algorithm and use it for computing distributed

LSA. The following section describes the algorithm

and states conditions under which the merging makes

sense when dealing with only truncated rank-k ap-

proximation of the decomposition.

2 DISTRIBUTED LSA

In this section, we derive an algorithm for distributed

computing of LSA over a cluster of autonomous com-

puters.

2.1 Algorithm Overview

Parallel computation will be achieved by column-

partitioning the input matrix A into several

smaller submatrices, called jobs, A

m×n

m×c

,··· ,A

m×c

∑

= n. Since columns

of A correspond to documents, each job A

amounts to

processing a chunk of c

input documents. The sizes

of these chunks are chosen to ﬁt available resources

of the processing nodes: bigger chunks mean faster

overall processing but on the other hand consume

more memory.

Jobs are then distributed among the available clus-

ter nodes, in no particular order, so that each node will

be processing a different set of column-blocksfrom A.

The nodes need not process the same number of jobs,

nor process jobs at the same speed; the computations

are completely asynchronous and independent. Once

all jobs have been processed, the decompositions ac-

cumulated in each node will be merged into a single,

ﬁnal decomposition P = (U,S).

What is needed are thus two algorithms:

1. Find P

= (U

m×c

×c

) eigen decomposition of

a single job A

m×c

such that A

= U

These P

decompositions will form the base case

for merging.

2. Merge two decompositions P

= (U

), P

) of two jobs A

, A

into a single decom-

position P = (U, S) such that







We’d like to highlight the fact that the ﬁrst algo-

rithm will perform decomposition of a sparse input

matrix; the second algorithm will merge two dense

decompositions into another dense decomposition.

This is in contrast to incremental updates discussed

in the literature (Levy and Lindenbaum, 2000; Zha

and Simon, 1999; Brand, 2006), where the existing

decomposition and the new documents are mashed to-

gether into a single matrix, losing any potential bene-

ﬁts of sparsity.

2.2 Solving the Base Case

In LSA (using the same notation introduced in Sec-

tion 1), the truncation factor k is typically in the hun-

dreds or thousands, the number of features m be-

tween 10

to 10

and the number of documents n

tends to inﬁnity, so that k ≪ m ≪ n. Density of the

job matrices is well below 1%, so a sparse solver is

called for, that makes efﬁcient use of the sparse ma-

trix structure. Also, a direct sparse SVD solver of A

is preferable to the roundabout eigen decomposition

of A

, for memory-conserving as well as numeri-

cal accuracy reasons (see e.g. (Golub and Van Loan,

1996)). Finally, because k ≪ m, a partial decompo-

sition is required which only returns the k greatest

SPEEDING UP LATENT SEMANTIC ANALYSIS - A Streamed Distributed Algorithm for SVD Updates

447

factors—computing the full spectrum would be a ter-

rible overkill.

For these reasons, we solve base cases with a

“black-box” in-core sparse partial SVD algorithm.

2.3 Merging Decompositions

No explicit algorithm (as far as we know) exists for

merging two truncated eigen decompositions(or SVD

decompositions) into one. We therefore propose our

own, novel algorithm here, starting with its derivation

and summing up the ﬁnal version in the end.

The problem can be stated as follows. Given two

truncated eigen decompositions P

= (U

m×k

×k

= (U

m×k

×k

), which come from the (by now

lost and unavailable) input matrices A

m×c

, A

m×c

≤ c

and k

≤ c

, ﬁnd P = (U,S) that is the eigen

decomposition of





Our ﬁrst approximation will be the direct naive

U,S

eigen

←−−−







. (1)

This is terribly inefﬁcient, and forming the ma-

trix product of size m × m on the right hand side is

prohibitively expensive. Writing SVD

for truncated

SVD that returns only the k greatest singular numbers

and their associated singular vectors, we can equiva-

lently write:

Algorithm 1: SVD merge.

Input: Truncation factor k, Decay factor γ,

= (U

m×k

×k

), P

= (U

m×k

×k

)

Output: P = (U

m×k

k×k

)

U,S,V

SVD

←−−−



γU



This is more reasonable, with the added bonus of

increased numerical accuracy over the related eigen

decomposition. Note, however, that the computed

right singular vectors V

are not used at all, which

is a sign of further inefﬁciency. This leads us to break

the algorithm into several steps:

Algorithm 2: QR merge.

Input: Truncation factor k, Decay factor γ,

= (U

m×k

×k

), P

= (U

m×k

×k

)

Output: P = (U

m×k

k×k

)

Q,R

←−−



γU



,S,V

SVD

←−−− R2

m×k

← Q

m×(k

)

)×k

On line 1, an orthonormal subspace basis Q is

found which spans both of the subspaces deﬁned by

columns of U

and U

, span(Q) = span(





Multiplications by S

, S

and γ provide scaling for R

only and do not affect Q in any way, as Q will always

be orthogonal (with columns of unit length). Our al-

gorithm of choice for constructing the new basis is

QR factorization, because we can use its other prod-

uct, the upper trapezoidal matrix R, to our advantage.

Now we’re almost ready to declare (Q,R) our target

decomposition (U,S), except R is not diagonal. To

diagonalize R, we perform an SVD on it, on line 2.

This gives us the singular values S we need as well

as the rotation of Q necessary to represent the ba-

sis in this new subspace. The rotation is applied on

line 3. Finally, both output matrices are truncated to

the requested rank k. The costs are O(m(k

+ k

)

O((k

+ k

)

) and O(m(k

+ k

)

) for line 1, 2 and 3

respectively, for a combined total of O(m(k

+ k

)

Although more elegant than Algorithm 1, the

baseline algorithm is only marginally more efﬁcient

than a direct SVD. This comes as no surprise, as the

two algorithms are quite similar and SVD of rectangu-

lar matrices is often internally implemented by means

of QR in exactly this way. Luckily, we can do better.

First, we observe that the QR decomposition

makes no use of the fact that U

and U

are already

orthogonal. Capitalizing on this will allow us to

represent U as an update to the existing basis U

U =



′



, dropping the complexity of the ﬁrst

step to O(mk

). Secondly, the application of rota-

tionU

toU can be rewritten as UU



′



′

, dropping the complexity of the last step

to O(mkk

+ mkk

). Plus, the algorithm can be made

to work by modifying the existing matrices U

place inside BLAS routines, which is a considerable

practical improvement over Algorithm 2, which re-

quires allocating additional m(k

+ k

) ﬂoats.

Algorithm 3: Optimized merge.

Input: Truncation factor k, Decay factor γ,

= (U

m×k

×k

), P

= (U

m×k

×k

)

Output: P = (U

m×k

k×k

)

×k

← U

′

←−− U

−U

,S,V

SVD

←−−−



γS

0 RS



)×(k

)

×k

= U

U ← U

′

The ﬁrst two lines construct orthonormal basis U

′

for the component of U

that is orthogonal to U

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

448

span(U

′

) = span((I −U

) (2)

= span(U

−U

)). (3)

As before, we use QR factorization because the

upper trapezoidal matrix R will come in handy when

determining the singular vectors S.

Line 3 is perhaps the least obvious, but follows

from the requirement that the updated basis



U,U

′



must satisfy







′



X, (4)

so that

X =



′







. (5)

By multiplying,



′T









′T



(6)

and using the equalities R = U

′T

, U

′T

= 0

and U

= I (all by construction) we obtain

X =



0 U

′T





0 RS



. (7)

Line 4 is just a way of saying that on line 5, U

will be multiplied by the ﬁrst k

rows of U

, while U

′

will be multiplied by the remaining k

rows.

Finally, line 5 seeks to avoid realizing the full



′



matrix in memory and is a direct application

of the equality



′



m×(k

)

)×k

= U

′

. (8)

As for complexity of this algorithm, it is again

dominated by the matrix products and the dense

QR factorization, but this time only of a matrix of

size m × k

. The SVD of line 3 is a negligible

O(k

+ k

)

, and the ﬁnal basis rotation comes up to

O(mk max(k

)). Overall, with k

≈ k

≈ k, this is

an O(mk

) algorithm.

2.4 Effects of Truncation

While the equations above are provably correct when

using matrices of full rank (putting aside the techni-

cal question of gradual loss of accuracy due to ﬁnite

machine precision and rounding errors for now), it is

not at all clear how to justify truncating all interme-

diate matrices to rank k in each update. What effect

does this have on the merged decomposition? How do

these effects stack up as we perform several updates

in succession?

In (Zha and Zhang, 2000), the authors did the hard

work and identiﬁed conditions under which operating

with truncated matrices produces exact results. More-

over, they show by way of perturbation analysis that

the results are stable (though no longer exact) even

if the input matrix only approximately satisﬁes this

condition. They investigate matrices of the so-called

low-rank-plus-shift structure, which (approximately)

satisfy

A/m ≈ CWC

+ σ

, (9)

where C, W are of rank k, W positive semi-

deﬁnite, σ

the variance of noise. That is, A

A can

be expressed as a sum of a low-rank matrix and a

multiple of the identity matrix. They show that ma-

trices coming from natural language corpora do in-

deed possess this structure and that in this case, a

rank-k approximation of A can be expressed as a com-

bination of rank-k approximations of its submatrices

A =





without a serious loss of precision.

2.5 Putting It Together

Let N = {N

,...,N

} be the p available cluster nodes.

Each node will be processing incoming jobs sequen-

tially, running the base case decomposition from Sec-

tion 2.2 followed by merging the result with the cur-

rent decomposition by means of Algorithm 1, 2 or 3:

Algorithm 4: LSA Node.

Input: Truncation factor k, Queue of jobs A

,...

Output: P = (U

m×k

k×k

) decomposition of



,...



P = (U,S) ← 0

m×k

k×k

foreach job A

do2

′

= (U

′

) ← BaseCase Decomposition(k,A

P ← Merge Algo{1,2,3}(k, P,P

′

end5

To construct decomposition of the full matrix A,

we let the nodes work in parallel, distributing the jobs

as soon as they arrive, to whichever node seems idle.

We do not describe the technical issues of load bal-

ancing and recovery from node failure here, but stan-

dard practices apply. Once we have processed all the

jobs (or temporarily exhausted the input job queue, in

the inﬁnite streaming scenario), we merge their indi-

vidual results into one:

Algorithm 5: Distributed LSA.

Input: Truncation factor k, Queue of jobs

A =



,...



Output: P = (U

m×k

k×k

) decomposition of A

= (U

) ← LSA Node Algo4(k, subset of jobs1

from A), for i = 1,... , p

P ← Reduce(Merge Algo,



,... , P



)

SPEEDING UP LATENT SEMANTIC ANALYSIS - A Streamed Distributed Algorithm for SVD Updates

449

Here, line 1 is executed in parallel, making use of

all p nodes at once, and is the source of parallelism of

the algorithm. On line two, Reduce applies the func-

tion that is its ﬁrst argument cummulatively to the se-

quence that is its second argument, so that it effec-

tively merges P

with P

, followed by merging that

result with P

, etc.

We note that other divide-and-conquer schemes

are also possible, such as one where the merging in

line 2 of Algorithm 5 happens on pairs of decompo-

sitions coming from approximately the same number

of input documents, so that the two sets of singular

values are of comparable magnitude. Doing so could

lead to improved numerical properties, but we have

not investigated this effect yet.

The algorithm is formulated in terms of a (poten-

tially inﬁnite) sequence of jobs, so that when more

jobs arrive, we can continue updating the decomposi-

tion in a natural way. The whole algorithm can in fact

act as a continuousdaemon service, providing decom-

position of all the jobs processed so far on demand.

3 EXPERIMENTS

In this section, we describe a set of experiments mea-

suring numerical accuracy of the proposed algorithm.

In all experiments, the decay factor γ is set to 1.0,

that is, there is no discounting in favour of new obser-

vation. The number of requested factors is k = 200 in

all cases.

3.1 Set Up

We will be comparing four implementations for par-

tial Singular Value Decomposition:

SVDLIBC A direct sparse SVD implementation due

to Douglas Rohde

. SVDLIBC is based on the

SVDPACK package by Michael Berry (Berry,

1992). We use the LAS2 routine (Lanczos of the

related implicit A

A or AA

matrix with selective

orthogonalizations) to retrieve only the k domi-

nant singular triplets. The implementation works

in-core and therefore doesn’t scale.

ZMS Implementationof the incremental one-pass al-

gorithm from (Zha et al., 1998). The right singular

vectors and their updates are completely ignored

so that our implementation of their algorithm also

realizes subspace tracking.

DLSA Our proposed method. We will be evalu-

ating three different versions of merging, Algo-

rithms 1, 2 and 3, calling them DLSA

, DLSA

and

http://tedlab.mit.edu/∼dr/SVDLIBC/

DLSA

in the results, respectively. Sparse SVD

during base case decomposition is realized by an

adapted LAS2 routine from SVDLIBC, see above.

With the exception of SVDLIBC, all the other al-

gorithms operate in a streaming fashion (ehek and So-

jka, 2010), so that the corpus need not reside in core

memory all at once. Although memory footprint of

all algorithms is independent of the size of the cor-

pus, it is still linear in the number of features, O(m).

It is assumed that the decomposition (U

m×k

k×k

) ﬁts

entirely into core memory.

For the experiments, we will be using a corpus of

3,494 mathematical articles collected from the digi-

tal libraries of NUMDAM, arXiv and DML-CZ. Af-

ter the standard procedure of pruning out word types

that are too infrequent (hapax legomena, typos, OCR

errors, etc.) or too frequent (stop words), we are

left with 39,022 distinct features. The ﬁnal matrix

39,022×3,494

has 1.5 million non-zero entries. This

corpus was chosen so that it ﬁts into core memory of

a single computer and its decomposition can there-

fore be computed directly. This will allow us to es-

tablish the “ground-truth” decomposition and set an

upper bound on achievable accuracy and speed.

3.2 Accuracy

Figure 1 plots the relative accuracy of singular val-

ues found by DLSA, ZMS, SVDLIBC and HEBB al-

gorithms compared to known, “ground-truth” values

. We measure accuracy of the computed singular

values S as r i = |s i − s G i|/s G i, for i = 1, ..., k.

The ground-truth singular values S

are computed di-

rectly with LAPACK’s DGESVD routine.

We observe that the largest singular values are

practically always exact, and accuracy quickly de-

grades towards the end of the returned spectrum. This

leads us to the following reﬁnement: When request-

ing x factors, compute the truncated updates for k > x,

such as k = 2x, and discard the extra x− k factors only

when the ﬁnal projection is actually needed. The er-

ror is then below 5%, which is comparable to the ZMS

algorithm (while DLSA is at least an order of magni-

tude faster even without any parallelization).

4 CONCLUSIONS

We developed and presented a novel single-pass eigen

decomposition method, which runs in constant mem-

ory w.r.t. the number of observations. The method is

embarrassingly parallel, so we also give its distributed

version.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

450

Figure 1: Accuracy of singular values for various decomposition algorithms.

This method is suited for processing extremely

large (possibly inﬁnite) sparse matrices that arrive as a

stream of observations, where each observation must

be immediately processed and then discarded. It is

therefore best suitable for environments where the in-

put stream cannot be repeated and must be processed

in constant memory.

Future work will include a more thorough eval-

uation of the algorithm’s accuracy and performance,

with a side-by-side comparison to other decomposi-

tion algorithms.

ACKNOWLEDGEMENTS

This study was partially supported by the LC536

grant of MMT CR.

REFERENCES

Berry, M. (1992). Large-scale sparse singular value compu-

tations. The International Journal of Supercomputer

Applications, 6(1):13–49.

Brand, M. (2006). Fast low-rank modiﬁcations of the thin

singular value decomposition. Linear Algebra and its

Applications, 415(1):20–30.

Comon, P. and Golub, G. (1990). Tracking a few extreme

singular values and vectors in signal processing. Pro-

ceedings of the IEEE, 78(8):1327–1343.

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and

Harshman, R. (1990). Indexing by Latent Semantic

Analysis. Journal of the American society for infor-

mation science, 41(6):391–407.

ehek, R. and Sojka, P. (2010). Software Framework for

Topic Modelling with Large Corpora. In Proceedings

of LREC 2010 workshop on New Challenges for NLP

Frameworks, pages 45–50.

Golub, G. and Van Loan, C. (1996). Matrix computations.

Johns Hopkins University Press.

Levy, A. and Lindenbaum, M. (2000). Sequential

Karhunen–Loeve basis extraction and its application

to images. IEEE Transactions on Image processing,

9(8):1371.

Salton, G. (1989). Automatic text processing: the transfor-

mation, analysis, and retrieval of information by com-

puter. Addison-Wesley Longman Publishing Co., Inc.

Boston, MA, USA.

Vigna, S. (2008). Distributed, large-scale latent semantic

analysis by index interpolation. In Proceedings of the

3rd international conference on Scalable information

systems, pages 1–10. ICST.

Zha, H., Marques, O., and Simon, H. (1998). Large-scale

SVD and subspace-based methods for Information

Retrieval. Solving Irregularly Structured Problems in

Parallel, pages 29–42.

Zha, H. and Simon, H. (1999). On updating problems in

Latent Semantic Indexing. SIAM Journal on Scientiﬁc

Computing, 21:782.

Zha, H. and Zhang, Z. (2000). Matrices with low-rank-

plus-shift structure: Partial SVD and Latent Semantic

Indexing. SIAM Journal on Matrix Analysis and Ap-

plications, 21:522.

SPEEDING UP LATENT SEMANTIC ANALYSIS - A Streamed Distributed Algorithm for SVD Updates

451