Theoretical Notes on Unsupervised Learning in Deep Neural

Networks

Vladimir Golovko

1,2

and Aliaksandr Kroshchanka

1

1

Brest State Technical University, Moskowskaja 267, Brest, Belarus

2

National Research Nuclear University (MEPHI), Moscow, Russia

Keywords: Deep Neural Networks, Deep Learning, Restricted Boltzmann Machine, Data Visualization, Machine

Learning, Cross-entropy.

Abstract: Over the last decade the deep neural networks are the powerful tool in the domain of machine learning. The

important problem is training of deep neural network, because learning of such a network is much

complicated compared to shallow neural networks. This is due to the vanishing gradient problem, poor local

minima and unstable gradient problem. Therefore a lot of deep learning techniques were developed that

permit us to overcome some limitations of conventional training approaches. In this paper we investigate the

unsupervised learning in deep neural networks. We have proved that maximization of the log-likelihood

input data distribution of restricted Boltzmann machine is equivalent to minimizing the cross-entropy and to

special case of minimizing the mean squared error. The main contribution of this paper is a novel view and

new understanding of an unsupervised learning in deep neural networks.

1 INTRODUCTION

Deep neural networks (DNN) currently provide the

best performance to many problems in images,

video, speech recognition, and natural language

processing, etc. (Krizhevsky et al., 2012; Hinton et

al., 2012; Hinton and Salakhutdinov, 2006). In the

general case a deep neural network consists of

multiple layers of neural units and can accomplish a

deep hierarchical representation of their input data.

This kind of neural network has been investigated in

many studies (Hinton et al., 2006; Bengio, 2009;

Bengio et al., 2007.).

This paper deals with an unsupervised learning

technique for restricted Boltzmann machine (RBM),

which can be applied for the training of deep neural

networks. The conventional approach to unsupervi-

sed training the RBM uses an energy-based model

and is based on maximization of the log-likelihood

input data distribution using gradient descent

approach. In this paper we consider the unsupervised

deep learning from another point of view, which

provides a deeper understanding of the nature of

unsupervised learning in deep neural networks. First

of all we use two training criteria, namely square

error and cross-entropy, instead of energy-based

technique. Next, we present the RBM as PCA or

auto-encoder neural network, which consist of three

layers: visible, hidden and visible. Finally, the Gibbs

sampling in order to define mean square error and

cross-entropy loss function is used. As a result we

have proved that maximization of the log-likelihood

input data distribution of restricted Boltzmann

machine is equivalent to minimizing the cross-

entropy and to special case of minimizing the mean

squared error. The rest of the paper is organized as

follows. Section 2 introduces the conventional

approach for restricted Boltzmann machine training

based on an energy model. In Section 3 we propose

the novel techniques for inference of RBM training

rules and finally we give our conclusion.

2 RELATED WORKS

Let us consider the related works in this domain

(Hinton, 2002; Hinton et al., 2006; Erhan et al.,

2010; Mikolov et al., 2011; Bengio et al., 2013).

There are different kinds of deep neural networks:

deep belief neural networks, deep perceptron, deep

convolutional neural networks, deep recurrent neural

networks, deep auto-encoder, deep R-CNN and so

on. It should be noted that the training rules are

identical for different kind of deep neural networks.

Golovko V. and Kroshchanka A.

Theoretical Notes on Unsupervised Learning in Deep Neural Networks.

DOI: 10.5220/0006084300910096

In Proceedings of the 8th International Joint Conference on Computational Intelligence (IJCCI 2016), pages 91-96

ISBN: 978-989-758-201-1

Copyright

c

2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

91

Therefore we will take the many-layered perceptron

as a deep neural network in order to investigate deep

learning rules (Fig.1).

The j-th output unit for k-th layer is given by

)(

k

j

k

j

SFy =

(1)

=

−

+=

1

1

i

k

j

k

i

k

ij

k

j

TyS

ω

(2)

where F is the activation function,

k

j

S

is the

weighted sum of the j-th unit,

k

ij

ω

is the weight from

the i-th unit of the (k-1)-th layer to the j-th unit of

the k-th layer, and

k

j

T

is the threshold of the j-th

unit.

For the first layer

ii

xy =

0

(3)

There exist the two main techniques for learning

of deep neural networks: learning with pre-training

using a greedy layer-wise approach and stochastic

gradient descent approach (SGD) with rectified

linear unit (ReLU) transfer function (LeCun et al.,

2015).

The learning with pre-training consists of two

stages (Hinton et al., 2006). The first stage is the

pre-training of neural network using greedy layer-

wise approach. This procedure is started from the

first layer and performed in unsupervised manner.

The second one is fine-tuning all of parameters of

neural network using back-propagation algorithm.

The training with stochastic gradient descent

approach is the online or mini-batch learning using

conventional backpropagation algorithm (Glorot et

al., 2011). The use of ReLU activation function can

help to avoid of vanishing gradient problem, poor

local minima and unstable gradient problem due to

the greater linearity of such kind of activation

function (LeCun et al., 2015).

At present the following paradigm for DNN

learning is used. If training data set is large then

SGD with ReLU is used for deep neural network

learning. Otherwise pre-training and fine-tuning is

applied. So, for instance, for smaller data sets,

unsupervised pre-training helps to prevent

overfitting (LeCun et al., 2015).

The most important stage of deep neural network

training is the pre-training of each layer of the DNN

in unsupervised manner. There exist two main

techniques for DNN pre-training. As a rule the DNN

pre-training is based on either the restricted

Boltzmann machine (RBM) or auto-encoder

approach (Larochelle et al., 2009). In accordance

with the greedy layer-wise training procedure, in the

beginning the first layer of the DNN is trained using

RBM or auto-encoder training rule and its

parameters are fixed. After this the next layer is

trained, and so on. As a result a good initialization of

the neural network is achieved and we can then use

back-propagation algorithm for fine tuning the

parameters of the whole neural network.

Further we will consider the DNN pre-training

technique based on the restricted Boltzmann

machine. In this case the deep neural network can be

represented as a set of restricted Boltzmann

machines. The traditional approach to RBM training

was proposed by G. Hinton and is based on an

energy model. Let's consider the conventional

restricted Boltzmann machine, which consists of two

layers of units: visible and hidden (Fig. 2).

The restricted Boltzmann machine can represent

any discrete distribution if enough hidden units are

used (Bengio, 2009). Often the binary units are used

(Hinton, 2010). The RBM is a stochastic neural

network and the states of visible and hidden units are

defined using a probabilistic version of the sigmoid

activation function.

Figure 1: Deep perceptron.

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

92

Figure: 2. Restricted Boltzmann machine.

The key idea of RBM training is to reproduce as

closely as possible the distribution of the input data

using the states of the hidden units. This is

equivalent to maximizing the likelihood of the input

data distribution P(x) by the modification of synaptic

weights using the gradient of the log probability of

the input data. As a result we can obtain the RBM

training rules. In case of CD-k

))()0(()()1(

))()0(()()1(

))()()0()0((

)()1(

kyytTtT

kxxtTtT

kykxyx

tt

jjjj

iiii

jiji

ijij

−+=+

−+=+

−

+=+

α

α

α

ω

ω

(4)

Here α is the learning rate.

Training an RBM is based on presenting a

training sample to the visible units, then using the

CD-k procedure to compute the binary states of the

hidden units p(y|x), sampling the visible units

(reconstructed states) p(x|y), and so on. After

performing these iterations the weights and biases of

the restricted Boltzmann machine are updated. Then

we stack on another hidden layer to train a new

RBM. This approach is applied to all layers of the

deep neural network (greedy layer-wise training).

Finally, supervised fine-tuning of the whole neural

network is performed.

3 A NEW INSIGHT INTO

UNSUPERVISED LEARNING

OF RBM

In this section we will consider the restricted

Boltzmann machine from another point of view,

namely as auto-encoder or the PCA neural network.

We will use two training criteria in order to obtain

RBM learning rule. As a result we have proposed a

new unsupervised learning rule and the novel

techniques to infer the RBM training rules. It is

based on minimization of the reconstruction mean

square error and cross-entropy error function, which

we can obtain using simple iterations of Gibbs

sampling. In contrast to the traditional energy-based

method, which is based on a linear representation of

neural units, the proposed approach permits us to

take into account the nonlinear nature of neural

units.

Let's examine the restricted Boltzmann machine.

We will represent the RBM using three layers

(visible, hidden and visible) (Golovko et al., 2014)

as shown in Fig. 3. As can be seen such a

representation of RBM is equivalent to PCA neural

network, where the hidden and last visible layer is

respectively compression and reconstruction

(inverse) layer.

Let’s consider the Gibbs sampling using

unfolded representation of RBM.

Then Gibbs sampling will consist of the

following procedure. Let x(0) be the input data,

which arrives at the visible layer at time 0. Then the

output of the hidden layer is defined as follows:

)),0(()0(

jj

SFy =

(5)

+=

i

jiijj

TxS )0()0(

ω

(6)

Figure 3: Unfolded representation of RBM.

Theoretical Notes on Unsupervised Learning in Deep Neural Networks

93

The inverse layer reconstructs the data from the

hidden layer. As a result we can obtain x(1) at time

1:

)),1(()1(

ii

SFx =

(7)

+=

j

ijiji

TyS )0()1(

ω

(8)

After this, x(1) enters the visible layer and we

can obtain the output of the hidden layer the

following way:

)),1(()1(

jj

SFy =

(9)

+=

i

jiijj

TxS )1()1(

ω

(10)

Continuing the given process we can obtain on a

step k, that

.)1()(

)),(()(

+−=

=

j

iji

ii

TkykS

kSFkx

ij

ω

(11)

.)()(

)),(()(

+=

=

i

jij

jj

TkxkS

kSFky

ij

ω

(12)

There exist the different ways for RBM training.

It is based on the use of the different learning

criteria. As mentioned before G. Hinton proposed an

energy-based model, which is based on

maximization of the log-likelihood input data

distribution P(x). We suggest using the two loss

functions for RBM learning. The first training

criterion is based on minimization of mean square

error (MSE). The second one involves the

minimization of cross entropy error function. Both

training criteria have the attractive properties and

have been studied in many papers (Golik, 2013;

Glorot and Bengio, 2010). Our main goal here is to

show, that the use of different training criteria leads

to the same learning rules. In the next subsections

we will study these criteria in more detail.

3.1 MSE Training Criterion

Let’s consider the use of mean square error function

for RBM learning. Then the primary goal of training

RBM is to minimize the reconstruction mean

squared error (MSE) in the hidden and visible layers.

The MSE in the hidden layer is proportional to the

difference between the states of the hidden units at

the various time steps. Then in case of CD-k

2

111

))1()((

2

1

)( −−=

===

pypykE

l

j

L

l

m

j

k

p

l

jh

(13)

Similarly, the MSE in the inverse layer is

proportional to the difference between the states of

the inverse units at the various time steps:

== =

−−=

L

l

n

i

k

p

l

i

l

iv

pxpxkE

11 1

2

))1()((

2

1

)(

(14)

where L is the number of training patterns.

In case of CD-k the common reconstruction

mean squared error is defined as the sum of errors:

)()()( kEkEkE

vhs

+=

(15)

Тheorem 1. Maximization of the log-likelihood

input data distribution P(x) in the space of synaptic

weights of the restricted Boltzmann machine is

equivalent to special case of minimizing the

reconstruction mean squared error in the same space,

if we use linear transfer function for neurons.

This theorem states that if we use identity

activation function for RBM units, then the CD-k

training rule for RBM in order to minimizing

reconstruction mean squared error (15) will be

identical to the conventional RBM training rules

Thus the conventional RBM training rules are linear

in terms of MSE minimization. Therefore we shall

call such a machine linear RBM.

Corollary 1. The training rule for a nonlinear

restricted Boltzmann machine in the case of CD-k is

defined as

)))

,

(()1())1()((

)(()())1()(((

)()1(

1

pSFpypxpx

pSFpxpypy

tt

ijii

k

p

jijj

ijij

′

−−−+

′

−−

−=+

=

α

ωω

(16)

))),(())1()(((

)1(

1

pSFpypy

tT

jjj

k

p

j

′

−−−

=+Δ

=

α

(17)

)))(())1()(((

)1(

1

pSFpxpx

tT

iii

k

p

i

′

−−−

=+Δ

=

α

(18)

In this section we have obtained the novel

unsupervised learning rules for restricted Boltzmann

machines, using MSE training criterion. The

traditional energy-based method is based on

maximization of the log-likelihood input data

distribution and leads to the linear representation of

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

94

neural units in terms of minimizing the MSE. The

proposed approach, which can be obtained using

simple iterations of Gibbs sampling is based on

minimization of reconstruction mean square error

and leads to nonlinear and linear representation of

neurons. We will call the proposed approach the

reconstruction error-based approach (REBA). For

the first time, the approach described above has been

proposed in (Golovko et al., 2014) for the CD-1 and

in (Golovko et al., 2015; Golovko, 2015) for CD-k.

3.2 Cross-Entropy Training Criterion

The cross-entropy measure (CE) can be used as an

alternative to mean squared error. Let’s consider a

sigmoid neural network and the cross entropy error

function instead of mean square error. The goal of

training RBM is to minimize the cross-entropy in the

hidden and visible layers. In the case of CD-k the

cross-entropy error function in the inverse layer is

defined as

===

−−−

+−

−

=

L

l

k

p

n

i

l

i

l

i

l

i

l

i

v

pxpx

pxpx

kCE

111

))(1log())1(1(

))(log()1(

)(

(19)

Similarly, the cross-entropy error function in the

hidden layer

===

−−−

+−

−

=

L

l

k

p

m

j

l

j

l

j

l

j

l

j

h

pypy

pypy

kCE

111

))(1log())1(1(

))(log()1(

)(

(20)

The common cross entropy error function in case

of CD-k is defined as the sum of errors:

)()()( kCEkCEkСE

vhs

+=

(21)

Тheorem 2. Maximization of the log-likelihood

input data distribution P(x) in the space of synaptic

weights restricted Boltzmann machine is equivalent

to minimizing the cross-entropy error function.

Proof. Let’s consider the cross entropy for CD-k.

In this case the cross entropy error function for a

single example is

==

==

−−−

+−

−

−−−

+−

−=

k

p

m

j

jj

jj

k

p

n

i

ii

ii

pypy

pypy

pxpx

pxpx

kCE

11

11

))(1log())1(1(

))(log()1(

))(1log())1(1(

))(log()1(

)(

(22)

Then

()

=

−−−−−=

∂

∂

k

p

jiji

ij

pypxpypx

w

kCE

1

)1()()1()1(

)(

()

()

=

=

=−−−

=−−−

k

p

jiij

k

p

ijij

pypxpxpy

pxpypxpy

1

1

)1()1()()(

)()()()1(

).0()0()()()1()1()()(

...)1()1()2()2()0()0()1()1(

jijijiij

jiijjiij

yxkykxkykxkxky

yxxyyxxy

−=−−−

++−+−

Accordingly, for the thresholds

)0()(

)(

ii

i

xkx

T

kCE

−=

∂

∂

(23)

)0()(

)(

jj

j

yky

T

kCE

−=

∂

∂

(24)

The theorem is proved. As follows from theorem

the RBM learning rules can be obtained in a simpler

way compared to the conventional energy-based

approach. Thus using minimization of the cross-

entropy error function and simple iterations of Gibbs

sampling we have received the conventional linear

RBM learning rules.

The obtained results can be summarized in the

following general theorem.

Theorem 3. Maximization of the log-likelihood

input data distribution P(x) in the space of synaptic

weights restricted Boltzmann machine is equivalent

to minimizing the cross-entropy and to special case

of minimizing the mean squared error:

)min()min())(max(ln

ss

ECExP ==

(25)

Theorem 3 represents a generalization of the

previous results in this paper. It follows from the

theorem that the use of various training criteria leads

to the same learning rules. Therefore the nature of

unsupervised learning of RBM is the same, even if

we use different objective function. The

maximization of the log-likelihood input data

distribution and minimization cross-entropy error

function leads to the linear representation of neural

units in terms of minimizing the MSE. It should be

noted, that applying of training criterion, which is

based on minimization of MSE, we can take into

account also nonlinear representation of neurons.

4 CONCLUSIONS

In this paper we have addressed the key aspects of

Theoretical Notes on Unsupervised Learning in Deep Neural Networks

95

unsupervised learning in deep neural networks. We

described both the traditional energy-based method,

which is based on a linear representation of neural

units, and the proposed approach, which is based on

nonlinear representation of neurons. We have proved

that maximization of the log-likelihood input data

distribution of restricted Boltzmann machine is

equivalent to minimizing the cross-entropy and to

special case of minimizing the mean squared error.

Thus using MSE training criterion we can get both

conventional and novel learning rules.

REFERENCES

Hinton, G., Osindero, S., Teh, Y., 2006. A fast learning

algorithm for deep belief nets. Neural Computation,

18, 1527-1554.

Hinton, G., 2002. Training products of experts by

minimizing contrastive divergence. Neural

Computation, 14, 1771-1800.

Hinton, G., Salakhutdinov, R., 2006. Reducing the

dimensionality of data with neural networks. Science,

313 (5786), 504-507.

Hinton, G. E., 2010. A practical guide to training restricted

Boltzmann machines. (Tech. Rep. 2010-000). Toronto:

Machine Learning Group, University of Toronto.

Krizhevsky, A., Sutskever, L., Hinton, G., 2012. ImageNet

classification with deep convolutional neural networs.

In Proc. Advances in Neural information Processing

Systems, 25, 1090-1098.

LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning

Nature, 521 (7553), 436-444.

Mikolov, T, Deoras, A., Povey, D., Burget, L., Cernocky,

J., 2011. Strategies for training large scale neural

network language models. In Automatic Speech

Recognition and Understanding, 195-201.

Hinton, G. at al., 2012. Deep neural network for acoustic

modeling in speech recognition. IEEE Signal

Processing Magazine, 29, 82-97.

Bengio, Y., 2009. Learning deep architectures for AI.

Foundations and Trends in Machine Learning, 2(1), 1-

127.

Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.,

2007. Greedy layer-wise training of deep networks. In

B. Sch\"olkopf, J. C. Platt, T. Hoffman (Eds.),

Advances in neural information processing systems,

11, pp. 153-160. MA: MIT Press, Cambridge

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A.,

Vincent, P., Bengio, S., 2010. Why does unsupervised

pre-training help deep learning? Journal of Machine

Learning Research, 11:625-660.

Larochelle H., Bengio Y., Louradour J., Lamblin P., 2009

Exploring strategies for training deep neural

networks//Journal of Machine Learning Research 1, 1-

40.

Bengio, Y., Courville, A., Vincent, P., 2013.

Representation learning a review and new

percpectives. IEEE Trans. Pattern Anal. Machine

Intell. 35, 1798-1828.

Glorot, X., Bordes, A., & Bengio, Y., 2011. Deep sparse

rectifier networks. In Proceedings of the 14th

International Conference on Artificial Intelligence and

Statistics. JMLR W&CP Volume (Vol. 15, pp. 315-

323).

Golovko, V., Kroshchanka A., Rubanau U., Jankowski S.,

2014. A Learning Technique for Deep Belief Neural

Networks. In book Neural Networks and Artificial

Intelligence, Springer, 2014. – Vol. 440.

Communication in Computer and Information

Science. – P. 136-146.

Golovko, V., Kroshchanka, A., Turchenko, V., Jankowski,

S., Treadwell, D., 2015. A New Technique for

Restricted Boltzmann Machine Learning. Proceedings

of the 8th IEEE International Conference IDAACS-

2015, Warsaw 24-26 September 2015. – Warsaw,

2015 –P.182-186.

Golovko, V., From multilayers perceptrons to deep belief

neural networks: training paradigms and application,

Lections on Neuroinformatics, Golovko, V.A., Ed.,

Moscow: NRNU MEPhI, 2015, pp. 47–84 [in

Russian].

Golik, P. Cross-Entropy vs. Squared Error Training: a

Theoretical and Experimental Comparison / P. Golik,

P. Doetsch, H. Ney // In Interspeech. - Lyon, France,

2013. – P. 1756-1760.

Glorot, X. and Bengio, Y.. 2010. Understanding the

difficulty of training deep feed-forward neural

networks. in Proc. of Int. Conf. on Artificial

Intelligence and Statistics, vol. 9, Chia Laguna Resort,

Italy, 2010, pp. 249–256.

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

96