Generative vs. Discriminative Deep Belief Netwok for 3D Object

Categorization

Nabila Zrira

, Mohamed Hannat

, El Houssine Bouyakhf

and Haris Ahmad Khan

2,3

LIMIARF Laboratory, Mohammed V University Rabat, Faculty of Sciences, Rabat, Morocco

Le2i, UMR CNRS 6306, Arts et M

etiers, Univ. Bourgogne Franche-Comt

e, Dijon, France

NTNU, Norwegian University of Science and Technology, Gjøvik, Norway

{nabilazrira, mohamedhannat}@gmail.com, bouyakhf@mtds.com, Haris-Ahmad.Khan@u-bourgogne.fr

Keywords:

3D Object Categorization, Point Clouds, Viewpoint Feature Histogram (VFH), DDBN, GDBN, RBM, Joint

Density Model, Bback-propagation.

Abstract:

Object categorization has been an important task of computer vision research in recent years. In this paper, we

propose a new approach for representing and learning 3D object categories. First, We extract the Viewpoint

Feature Histogram (VFH) descriptor from point clouds and then we learn the resulting features using deep

learning architectures. We evaluate the performance of both generative and discriminative deep belief network

architectures (GDBN/DDBN) for object categorization task. GDBN trains a sequence of Restricted Boltzmann

Machines (RBMs) while DDBN uses a new deep architecture based on RBMs and the joint density model.

Our results show the power of discriminative model for object categorization and outperform state-of-the-art

approaches when tested on the Washington RGBD dataset.

1 INTRODUCTION

With the advent of new 3D sensors like Microsoft

Kinect, 3D perception became a fundamental vision

research in mobile robotic applications like scene

manipulation or grasping, scene understanding, and

3D point cloud classiﬁcation. The Point Cloud Li-

brary (PCL) was developed by Rusu et al. (Rusu

and Cousins, 2011) in 2010 and was ofﬁcially pub-

lished in 2011. This open source library, licensed

under Berkeley Software Distribution (BSD) terms,

represents a collection of state-of-the-art algorithms

and tools that operate with 3D point clouds. Sev-

eral studies have been made based on PCL detectors

and descriptors, allowing for 3D object recognition

applications. PCL integrates several 3D detectors as

well as 3D local and global descriptors. In 3D local

descriptors, each point is described by its local ge-

ometry. They are developed for speciﬁc applications

such as object recognition, and local surface catego-

rization. This local category includes Signature of

Histograms of OrienTation (SHOT) (Tombari et al.,

2010), Point Feature Histograms (PFH) (Rusu et al.,

), Fast Point Feature Histograms (FPFH) (Rusu et al.,

2009), SHOTCOLOR (Tombari et al., 2011), and so

on. On the other hand, the 3D global descriptors de-

scribe object geometry and they are not computed for

individual points, but for a whole cluster instead. The

global descriptors are high-dimensional representa-

tions of object geometry. They are usually calculated

for subsets of the point clouds that are likely to be ob-

jects. The global category encodes only the shape in-

formation and includes Viewpoint Feature Histogram

(VFH) (Rusu et al., 2010), Clustered Viewpoint Fea-

ture Histogram (CVFH) (Aldoma et al., 2011), CVFH

(OUR-CVFH) (Aldoma et al., 2012), and Ensemble

of Shape Functions (ESF) (Wohlkinger and Vincze,

2011).

The ability to identify or recognize 3D objects

is highly valuable for performing imperative tasks in

mobile robotics. Machine learning techniques are ap-

plied for this task which include Support Vector Ma-

chines (SVMs) (LeCun et al., 2004) (Zhang et al.,

2006) (Janoch et al., 2013), Nearest Neighbor (NN)

(McCann and Lowe, 2012), Hidden Markov Model

(HMM) (Torralba et al., 2003), and Artiﬁcial Neural

Network (ANN) (Basu et al., 2010).

The origin of ANN dates back to efforts for ﬁnd-

ing a mathematical representation for information

processing in human brains. The brain consists of a

large number of processing units (10

units accord-

ing to (Azevedo et al., 2009)) which operate in paral-

lel and are highly inter-connected. ANN are designed

in a similar manner using a large number of process-

Zrira N., Hannat M., Bouyakhf E. and Ahmad khan H.

Generative vs. Discriminative Deep Belief Netwok for 3D Object Categorization.

DOI: 10.5220/0006151100980107

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 98-107

ISBN: 978-989-758-226-4

ing units called perceptrons that operate in the parallel

process. Research on ANN discover some limitations

of the capability of perceptrons and invent multilayer

perceptron (MLP) neural networks (Rumelbart and

McClelland, 1986). Unfortunately, MLP also shows

the limitations for some complex nonlinear functions

that cannot be efﬁciently represented by this type of

networks. In (Serre et al., 2007), authors show the

evidence that the brain of a mammal is organized in

the form of a deep architecture. A speciﬁed input is

characterized at various levels of abstraction, where

every level relates to a diverse area of cortex. Re-

searchers used the deep architecture concept in neu-

ral networks for training new deep multi-layer neural

networks which are stimulated by the biological depth

of brain. Such deep models involve numerous layers

and parameters that require learning through the com-

plex process. To deal with this problem, the authors

in (Hinton et al., 2006a) suggest a deep belief network

(DBN) with multiple layers of hidden units.

DBN is a graphical model consisting of undirected

networks at the top hidden layers and directed net-

works in the lower layers. The learning algorithm

uses greedy layer-wise training by stacking restricted

Boltzmann machines (RBM) which contain hidden

layer for modeling the probability distribution of per-

ceptible variables. The idea of having multiple hid-

den layers is that the preceding hidden layer acts as

the visible layer for the next hidden layer and thus the

model can incrementally learn more complex features

of data.

In general, deep learning architectures can be

broadly classiﬁed into three main categories (Deng,

2014):

1. Generative deep architectures: the aim is to char-

acterize the high-order correlation properties of

the visible data for pattern analysis or synthesis

purposes, and/or characterize the joint statistical

distributions of the visible data and their associ-

ated classes;

2. Discriminative deep architectures: the aim is to

directly provide discriminative power for pattern

classiﬁcation, often by characterizing the poste-

rior distributions of classes conditioned on the vis-

ible data;

3. Hybrid deep architectures: the aim is to com-

bine the power of discrimination with the outputs

of generative architectures via better optimization

or/and regularization.

Our work focuses on 3D object categorization for

mobile robotic grasping. First, we extract Viewpoint

Feature Histogram (VFH) descriptors that encode ge-

ometric features of 3D point clouds followed by learn-

ing the resulting features using effective deep archi-

tectures. We evaluate both generative and discrimina-

tive deep belief network (GDBN/DDBN) using dif-

ferent RBM training techniques which include Con-

trastive Divergence (CD), Persistent Contrastive Di-

vergence (PCD), and Free Energy in Persistent Con-

trastive Divergence (FEPCD).

The main contributions of our paper are:

• We propose a new 3D object categorization

pipeline based on VFH descriptor and deep learn-

ing architectures;

• We compare the extracted features with GDBN

and DDBN architectures in order to show the

difference between generative and discriminative

models for object categorization.

The rest of the paper is organized as follows. In Sec-

tion 2 we review previous works. In Section 3 we give

a brief description of VFH descriptor. In Section 4

we present an overview of our proposed approach. In

Section 5 two different deep learning architectures are

illustrated. And in Section 6, the experimental results

carried out to demonstrate the functionality and us-

ability of this work are presented. Finally, in Section

7 the main conclusions and future works are outlined.

2 PREVIOUS WORK

Most of the recent work on 3D object categoriza-

tion have focused on appearance, shapes, and Bag of

Words (BoW) extracted from certain viewing point

changes of the 3D objects. In (Toldo et al., 2009),

authors introduce Bag of Words (BoW) approach for

3D object categorization. They use spectral cluster-

ing to select seed-regions then compute the geometric

features of the object sub-parts. Vector quantization is

applied to these features in order to obtain BoW his-

tograms for each mesh. Finally, Support Vector Ma-

chine is used to classify different BoW histograms for

3D objects. In (Nair and Hinton, 2009), a top-level

model of Deep Belief Networks (DBNs) is presented

for 3D object recognition. This model is a third-order

Boltzmann machine that is trained using a combina-

tion of both generative and discriminative gradients.

The model performance is evaluated on NORB im-

ages where the dimensionality for each stereo-pair

image is reduced by using a foveal image. The ﬁnal

representation consists of 8976-dimensional vectors

that are learned with a top-level model for Deep Belief

Nets (DBN). In (Bo et al., 2011), a set of kernel fea-

tures is introduced for object recognition. The authors

develop kernel descriptors on depth maps that model

Generative vs. Discriminative Deep Belief Netwok for 3D Object Categorization

size, depth edges, and 3D shape. The main match ker-

nel framework deﬁnes pixel attributes, designs match

kernels in order to measure the similarities of image

patches and then determines low dimensional match

kernels. In (Lai et al., 2011a), the authors build a

new RGBD dataset and propose methods to recognize

and detect RGBD objects. They use SIFT descriptor

to extract visual features and spin image descriptor

to extract shape features that are used for computing

efﬁcient match kernel (EKM). Finally, linear support

vector (LiSVM), gaussian kernel support vector ma-

chine (kSVM) and random forest (RF) are trained to

recognize both the category and the instance of ob-

jects. In (Madry et al., 2012), the authors propose

the Global Structure Histogram (GSH) in order to

describe the point cloud information. The approach

encodes the structure of local feature response on a

coarse global scale to retain low local variations and

keep the advantage of global representativeness. GSH

can be instantiated in partial object views and learned

using complete or incomplete information about an

object. In (Socher et al., 2012), the authors intro-

duce the ﬁrst convolutional-recursive deep learning

model for 3D object recognition. They compute a sin-

gle CNN layer to extract low-level features from both

color and depth images. Then, these representations

are provided as input to a set of RNNs with random

weights that produce high-quality features. Finally,

The concatenation of all the resulting vectors forms

the ﬁnal feature vector for a softmax classiﬁer. The

authors in (Schwarz et al., 2015) develop a meaning-

ful feature set that results from the pre-trained stage of

Convolutional Neural Network (CNN). The depth and

RGB images are processed independently by CNN

and the resulting features are then concatenated to de-

termine the category, instance, and pose of the ob-

ject. In (Alexandre, 2016), author proposes a new

approach for RGBD object classiﬁcation. Four inde-

pendent Convolutional Neural Networks (CNNs) are

trained, one for each depth data and three for RGB

data and then trains these CNNs in a sequence. The

decisions of each network are combined to obtain the

ﬁnal classiﬁcation result. The authors of (Ouadiay

et al., 2016) propose a new approach for real 3D ob-

ject recognition and categorization using Deep Belief

Networks. First, they extract 3D keypoints from point

clouds using 3D SIFT detector and then they com-

pute SHOT/SHOTCOLOR descriptors. The perfor-

mance of the approach is evaluated on two datasets:

Washington RGBD object dataset and real 3D object

dataset.

3 METHOD OVERVIEW

In this work, we use the VFH descriptor to describe

the set of 3D point clouds and then we extract the

geometric features which are considered as the input

layer x of GDBN and DDBN architectures. The input

layer has a number N of units which is equivalent to

the quantity of sample data x (308). Finally, we ﬁx

three hidden layers in both GDBN and DDBN archi-

tectures. Figure 1 summarizes the main steps of our

approach.

Figure 1: Our proposed approach for 3D object categoriza-

tion.

The general approach is achieved as follows:

1. Extract geometric features for all training set us-

ing VFH descriptor;

2. Input layer x takes the extracted geometric fea-

tures;

3. Train RBMs using CD, PCD or FEPCD training

(see Section 5.3):

• training the ﬁrst RBM;

• training the second RBM using the training data

resulting from the ﬁrst RBM learning;

• training the third RBM: for GDBN architecture,

RBM is generative. Whereas for DDBN archi-

tecture, we train a joint density model through

discriminative RBM and then each conceivable

label is tested with a test vector. The label

which contains the least energy is selected as

the best corresponding class.

4. Use the back-propagation technique through the

whole classiﬁer to ﬁne-tune the weights for an op-

timal classiﬁcation.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

100

4 POINT CLOUD PROCESSING

4.1 Point Clouds

The point cloud is a data structure which represents

three-dimensional data. In 3D cloud, the points are

usually described by their x, y and z geometric coor-

dinates of a sampled surface. When the point cloud

contains the color information, the structure becomes

four-dimensional data. Point clouds can be obtained

using stereo cameras, 3D laser scanners, Time-of-

ﬂight cameras or Kinect.

In 3D space, points are deﬁned in a clockwise ref-

erence frame that is centered at the intersection of the

optical axis with the plane which contains the front

wall of the camera. The reference frame is decom-

posed as follows:

• x-axis: is horizontal and directed to the left;

• y-axis: is vertical and faces up;

• z-axis: coincides with the optical axis of the cam-

era. It is turned towards the object.

4.2 Washington RGBD Dataset

Washington RGBD dataset is a large dataset built

for 3D object recognition and categorization applica-

tions. It is a collection of 300 common household

objects which are organized into 51 categories. Each

object is placed on a turntable and is captured for

one whole rotation in order to obtain all object views

using Kinect camera that records synchronized and

aligned 640x480 RGB and depth images at 30 Hz (Lai

et al., 2011b).

4.3 Viewpoint Feature Histogram

(VFH)

The global descriptors are high-dimensional represen-

tations of object geometry. They are more efﬁcient

in object recognition, geometric categorization, and

shape retrieval. Global descriptors describe object ge-

ometry. They are not computed for individual points,

but for a whole cluster.

The viewpoint feature histogram (VFH) (Rusu

et al., 2010) computes a global descriptor of the point

cloud and consists of two components:

1. a surface shape component;

2. a viewpoint direction component.

VFH aims to combine the viewpoint direction di-

rectly into the relative normal angle calculation in the

FPFH descriptor (Rusu et al., 2009). The viewpoint-

dependent component of the descriptor is a histogram

Figure 2: A sample of selected point clouds from Washing-

ton RGBD dataset.

of the angles between the vector (p

− p

) and each

point’s normal. This component is binned into a 128-

bin histogram. The other component is a simpliﬁed

point feature histogram (SPFH) estimated for the cen-

troid of the point cloud and an additional histogram of

distances of all points in the cloud to the cloud’s cen-

troid.

The three angles (α, φ, θ) with the distance d be-

tween each point and the centroid are binned into a

45-bin histogram. The total length of VFH descrip-

tor is the combination of these two histograms and is

equal to 308 bins.

5 DEEP LEARNING

ARCHITECTURES

In this section, we brieﬂy introduce both Gen-

erative and Discriminative Deep Belief Network

(GDBN/DDBN) architectures. We also illustrate the

difference between the Generative and Discrimina-

tive restricted Boltzmann machine (GRBM/DRBM)

which constitute many layers in DBN architecture.

Generative vs. Discriminative Deep Belief Netwok for 3D Object Categorization

101

Figure 3: (a) 3D point cloud of food box. (b) VFH descriptor of food box point cloud: x-axis represents a number of histogram

bin and y-axis represents a percentage of points falling in each bin.

5.1 Generative Deep Belief Network

(GDBN)

5.1.1 Restricted Boltzmann Machines (RBMs)

Restricted Boltzmann Machines (RBMs) (Smolensky,

1986) are a speciﬁc category of energy based model

which include hidden variables. RBMs are restricted

in the sense so that no hidden-hidden or variable-

variable connections exist. The architecture of a gen-

erative RBM is illustrated in Figure 4.

RBMs are a parameterized generative stochastic

neural network which contain stochastic binary units

on two layers: the visible layer and the hidden layer.

1. Visible units (the ﬁrst layer): they contain visible

units (x) that correspond to the components of an

observation (i.e. VFH descriptors in this case of

study);

2. Hidden units (the second layer): they contain hid-

den units (h) that model dependencies between

the components of observations.

The stochastic nature of RBMs results from the

fact that the visible and hidden units are stochastic.

The units are binary, i.e. x

∈ {0, 1}∀ i and j, and

the joint probability which characterize the RBM con-

ﬁguration is the Boltzmann distribution:

p(x,h) =

−E(x,h)

(1)

The normalization constant is Z =

∑

x,h

−E(x,h)

and the energy function of an RBM is deﬁned as:

E(x,h) = −b

x − c

h − h

W x (2)

where:

• W represents the symmetric interaction term be-

tween visible units (x) and hidden units (h);

• b and c are vectors that store the visible (input)

and hidden biases (respectively).

RBMs are proposed as building blocks of multi-

layer learning deep architectures called deep belief

Figure 4: Generative RBM model. The visible units x and

hidden units h are connected through undirected and sym-

metric connections. There are no intra-layer connections.

networks. The idea behind is that the hidden neu-

rons extract pertinent features from the visible neu-

rons. These features can work as the input to another

RBM. By stacking RBMs in this way, the model can

learn features for a high-level representation.

5.1.2 GDBN Architecture

Deep Belief Network (DBN) is the probabilistic gen-

erative model with many layers of stochastic and hid-

den variables. In (Hinton et al., 2006b), the authors

introduce the motivation for using a deep network ver-

sus a single hidden layer (i.e. a DBN vs. an RBM).

The power of deep networks is achieved by having

more hidden layers. However, one of the major prob-

lems for training deep network is how to initialize the

weights W between the units of two consecutive lay-

ers ( j − 1 and j), and the bias b of layer j. Random

initializations of these parameters can cause poor lo-

cal minima of the error function resulting in low gen-

eralization. For this reason, Hinton et al. introduced

a DBN architecture based on training sequence of

RBMs. DBN train sequentially as many RBMs as the

number of hidden layers that constitute its architec-

ture, i.e for a DBN architecture with l hidden layers,

the model has to train l RBMs. For the ﬁrst RBM,

the inputs consist of the DBN’s input layer (visible

units) and the ﬁrst hidden layer. For the second RBM,

the inputs consist of the hidden unit activations of the

previous RBM and the second hidden layer. The same

holds for the remaining RBMs to browse through the l

layers. After the model performs this layer-wise algo-

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

102

rithm, a good initialization of the biases and the hid-

den weights of the DBN is obtained. At this stage,

the model should determine the weights from the last

hidden layer for the outputs. To obtain a successfully

supervised learning, the model ”ﬁne-tunes” the result-

ing weights of all layers together. Figure 5 illustrates

a generative DBN architecture with one visible layer

and three hidden layers.

Figure 5: Generative DBN architecture (GDBN).

5.2 Discriminative Deep Belief Network

(DDBN)

5.2.1 Discriminative Restricted Boltzmann

Machines (DRBMs)

RBMs are used as generative models for various ap-

plications. They use a layer of hidden variables for

modeling the scattering over visible variables. Those

models are typically trained only for modeling the in-

puts of a classiﬁcation task. They are also capable of

modeling the joint distribution of the inputs and their

associated target classes similar to the last layer of a

DDBN (see 7). We are interested in such joint models

for a classiﬁcation application.

In this work (Larochelle and Bengio, 2008), the

authors propose a Discriminative Restricted Boltz-

mann Machines which aim to train a density model

by means of a particular RBM consisting of two sets

of visible elements. RBM acts as a parametric model

of joint distribution among a layer of hidden variables

h = (h

,..., h

) and the visible variables of the inputs

x = (x

,..., x

) and the target y, that is deﬁned as:

p(y, x,h) ∝ e

−E(y,x,h)

(3)

where

E(y,x, h) = −h

W x − b

x − c

h − d

~y − h

U~y (4)

with Θ = (W, b,c, d,U) is the set of parameters and

~y = (1

y=i

)

i=1

for C classes.

5.2.2 DDBN Architecture

DDBN architectures have been proposed for different

applications (Zhou et al., 2010) (Liu et al., 2011). In

Figure 6: Discriminative RBM model (DRBM). RBM mod-

eling the joint distribution of inputs x and target class y (rep-

resented as one-hot vector by~y ) from (Larochelle and Ben-

gio, 2008). Hidden units are denoted by h.

our work, we use a learning algorithm; discriminative

deep belief network (DDBN) based on discriminative

restricted Boltzmann machine (DRBM) as deﬁned in

(Keyvanrad and Homayounpour, 2014).

DBN aims at letting every RBM model in the

structure to obtain a diverse representation of data. In

other words, after RBM is trained, the activity values

from the hidden units act as the training data for a

higher-level RBM learning.

Figure 7: Discriminative DBN architecture (DDBN). The

last RBM models the joint distribution of inputs x and target

class y. Hidden units are denoted by h.

In DDBN, we need to use a DRBM in the last

layer as a classiﬁer for obtaining labels from the in-

put data. The input layer has a N number of units

which is equivalent to the quantity of sample data x.

The label layer has C representing y as the number of

classes. DDBN trains a joint density model through

discriminative RBM and then each conceivable label

is tested with a test vector. The label which contains

the least energy is selected as the best corresponding

class. Afterward, we use the back-propagation tech-

nique through the entire classiﬁer for ﬁne-tuning the

weights for optimal classiﬁcation.

5.3 Training in GRBM/DRBM

5.3.1 Contrastive Divergence (CD)

CD is the most popular gradient approximation algo-

rithm. CD initializes the Markov chain with a training

data then the binary hidden units are computed. Once

Generative vs. Discriminative Deep Belief Netwok for 3D Object Categorization

103

the method deﬁnes binary hidden unit states then the

visible values are recalculated. Lastly, the probabil-

ity of hidden unit instigations is calculated by means

of hidden and visible unit’s values (Carreira-Perpinan

and Hinton, 2005) .

5.3.2 Persistent Contrastive Divergence (PCD)

As the CD sampling has a few drawbacks and is not

precise, PCD method is proposed so that only the last

chain state is used in the preceding update step (Tiele-

man, 2008).

5.3.3 Free Energy in Persistent Contrastive

Divergence (FEPCD)

Numerous insistent chains can be utilized in paral-

lel during PCD sampling and can mention the present

state as fantasy points in each of these chains. How-

ever, there is a blind chain selection and it’s not nec-

essary that the best one is always selected. Recently,

(Keyvanrad and Homayounpour, 2014) proposed a

new sampling method that deﬁnes a standard for the

goodness of chain. This method employs free energy

as a measure to acquire best samples from the gen-

erative model that are able to precisely calculate the

gradient of log probability from training data.

6 EXPERIMENTAL RESULTS

In this section, we tested our categorization approach

on Washington RGBD dataset. The training and test-

ing sets contain 14800 point clouds that are computed

using a Xeon(R) 3.50 GHz CPU 32 GB RAM and

K2000 Nvidia card on Ubuntu 14.04. Figure 2 shows

the examples of 3D point clouds from our training set.

Table 1: GDBN/DDBN characteristics that are used in our

experiment.

Characteristics Value

Hidden layers 3

Hidden layer units 300-300-1500

Learn rates 0.3

Epochs 200

Input layer units size of VFH descriptor (308)

We use a GDBN/DDBN with one visible layer

that contains the VFH descriptors of all training sets,

as well as three hidden layers in order to deﬁne a

308-300-300-1500 GDBN/DDBN structures. Then,

we train the weights of each layer separately with the

ﬁxed number of epochs equal to 200 (see Table 1).

6.1 Evaluation of Generative Model:

GDBN

GDBN aims at allowing each RBM model in the se-

quence to receive a different representation of the

data. In other words, after RBM is trained, the activ-

ity values of its hidden units are used as the training

data for learning a higher-level RBM. As a compar-

ison, we evaluate the training process of GDBN us-

ing CD, PCD or FEPCD training methods. Table 2

shows that the classiﬁcation error decreases with the

FEPCD training method. We can also remark that

FEPCD training method presents the best accuracy

value. The best training performance indicates the it-

erations at which the validation performance reaches

a minimum mean squared normalized error which is

deﬁned as follows:

MSE =

∑

i=1

(

−Y

)

(5)

With:

is a vector of n predictions, and Y

is the vec-

tor of observed values. As shown in Figure 8, the best

performance is obtained with FEPCD training.

Table 2: Classiﬁcation error and accuracy on Washington

RGBD dataset for 308-300-300-1500 GDBN structure us-

ing different training methods: CD, PCD and FEPCD.

Error Acc.

GDBN-CD 0.6549 34.51%

GDBN-PCD 0.0250 97.5%

GDBN-FEPCD 0.0206 97.9%

6.2 Evaluation of Discriminative Model:

DDBN

The approach trains RBMs one after another and uses

their training data resulting for training stage in the

next RBM using CD, PCD or FEPCD training meth-

ods. The last layer trains a joint density model with

a discriminative RBM. We use the back-propagation

technique through the whole classiﬁer to ﬁne-tune

the weights in order to optimize the classiﬁcation re-

sult. Table 3 illustrates the classiﬁcation error be-

fore and after using back-propagation technique. We

notice that the error decreases after using the back-

propagation technique especially with FEPCD train-

ing. Figure 9 shows the best training performance that

indicates the iterations at which the validation perfor-

mance reached a minimum mean squared normalized

error performance criterion. The best performance is

obtained with FEPCD training.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

104

0 50 100 150 200

−1

Best Training Performance is 0.10789 at epoch 200

Mean Squared Error (mse)

200 Epochs

Train

Best

(CD)

0 50 100 150 200

−2

−1

Best Training Performance is 0.013614 at epoch 200

Mean Squared Error (mse)

200 Epochs

Train

Best

(PCD)

0 50 100 150 200

−2

−1

Best Training Performance is 0.013379 at epoch 200

Mean Squared Error (mse)

200 Epochs

Train

Best

(FEPCD)

Figure 8: Best training performance on Washington RGBD dataset of 308-300-300-1500 GDBN structure. The minimum

mean squared normalized error 0.013379 is reached at epoch 200 with FEPCD.

0 50 100 150 200

−2

−1

Best Training Performance is 0.065007 at epoch 200

Mean Squared Error (mse)

200 Epochs

Train

Best

(CD)

0 50 100 150 200

−3

−2

−1

Best Training Performance is 0.0095101 at epoch 200

Mean Squared Error (mse)

200 Epochs

Train

Best

(PCD)

0 50 100 150 200

−3

−2

−1

Best Training Performance is 0.0052535 at epoch 200

Mean Squared Error (mse)

200 Epochs

Train

Best

(FEPCD)

Figure 9: Best training performance on Washington RGBD dataset of 308-300-300-1500 DDBN structure. The minimum

mean squared normalized error 0.0052535 is reached at epoch 200 with FEPCD.

Table 3: Classiﬁcation errors and accuracy on Washington

RGBD dataset for 308-300-300-1500 DDBN structure us-

ing different training methods: CD, PCD and FEPCD. After

training each RBM, DDBN is ﬁne-tuned in 200 epochs us-

ing the back-propagation method.

Before After Acc.

DDBN-CD 0.3810 0.3155 68.45%

DDBN-PCD 0.4491 0.0201 97.98%

DDBN-FEPCD 0.4053 0.0111 98.89%

6.3 Comparison to Other Methods

In this subsection, we compare our approach to re-

lated state-of-the-art approaches. Table 4 shows the

main accuracy values and compares our 3D recogni-

tion pipeline to the published results (Lai et al., 2011a;

Bo et al., 2011; Schwarz et al., 2015). Lai et al. (Lai

et al., 2011a) extract a set of features that captures the

shape of the object view using spin images, and an-

other set which captures the visual appearance using

SIFT descriptors. These features are extracted sepa-

rately from both depth and RGB images. In contrast,

we extract the geometric features from a single point

cloud using only the VFH descriptor. A recent work

by Schwarz et al. (Schwarz et al., 2015) uses both col-

orizing depth and RGB images that are processed in-

dependently by a convolutional neural network. CNN

features are then learned using SVM classiﬁer in or-

der to successively determine the category, instance,

and pose. In our approach, we use VFH features for

training GDBN/DDBN with three hidden layers that

model a deep network architecture. The results show

also that our recognition pipeline with DDBN arch-

tecture and FEPCD training works perfectly with the

accuracy rate of 98.89% and outperforms all methods

that are mentioned in the state-of-the-art.

Table 4: The accuracy rates on Washington RGBD dataset

for 308-300-300-1500 DDBN structure using CD,PCD and

FEPCD training methods that are compared with the state-

of-the-art methods.

Methods Accuracy rates

CNN (Schwarz et al., 2015) 89.4%

RGBD dataset (Lai et al., 2011a) 64.7%

Depth Kernel (Bo et al., 2011) 78.8%

DDBN-CD 68.45%

DDBN-PCD 97.98%

DDBN-FEPCD 98.89%

Generative vs. Discriminative Deep Belief Netwok for 3D Object Categorization

105

In general, the use of PCD training is better than

CD training, and FEPCD outperforms PCD training.

This result is pertinent, since FEPCD uses free en-

ergy as a criterion for the goodness of a chain in or-

der to obtain elite samples from the generative model

that can more accurately compute the gradient of

the log probability of training data. FEPCD outper-

forms PCD and CD in terms of accuracy, although

its computational complexity is high and takes rela-

tively longer time in training as compared to the two

methods. Our next goal will be to optimize the per-

formance of FEPCD in order to reduce the computa-

tional complexity.

7 CONCLUSIONS

In this work, we focused on 3D object categorization

using geometric features extracted from Viewpoint

Feature Histogram (VFH) descriptor and learned with

both Generative and Discriminative Deep Belief Net-

work (GDBN/DDBN) architectures. GDBN is the

probabilistic model with many Restricted Boltzmann

Machines (RBMs) which are trained sequentially. On

the other hand, DDBN is constructed from the Dis-

criminative Restricted Boltzmann Machine (DRBM)

which is based on RBM and the joint distribution

model. The experimental results using DDBN are en-

couraging, especially that our approach is able to rec-

ognize 3D objects under different views. In a future

work, we will attempt to embed our algorithm in our

robot TurtleBot2 in order to grasp the real-world ob-

jects. Moreover, we will utilize a hybrid deep archi-

tecture that combines the advantage of generative and

discriminative models.

REFERENCES

Aldoma, A., Tombari, F., Rusu, R., and Vincze, M. (2012).

OUR-CVFH–Oriented, Unique and Repeatable Clus-

tered Viewpoint Feature Histogram for Object Recog-

nition and 6DOF Pose Estimation. Springer.

Aldoma, A., Vincze, M., Blodow, N., Gossow, D., Gedikli,

S., Rusu, R., and Bradski, G. (2011). Cad-model

recognition and 6dof pose estimation using 3d cues. In

Computer Vision Workshops (ICCV Workshops), 2011

IEEE International Conference on, pages 585–592.

IEEE.

Alexandre, L. A. (2016). 3d object recognition using con-

volutional neural networks with transfer learning be-

tween input channels. In Intelligent Autonomous Sys-

tems 13, pages 889–898. Springer.

Azevedo, F. A., Carvalho, L. R., Grinberg, L. T., Farfel,

J. M., Ferretti, R. E., Leite, R. E., Lent, R., Herculano-

Houzel, S., et al. (2009). Equal numbers of neuronal

and nonneuronal cells make the human brain an iso-

metrically scaled-up primate brain. Journal of Com-

parative Neurology, 513(5):532–541.

Basu, J. K., Bhattacharyya, D., and Kim, T.-h. (2010). Use

of artiﬁcial neural network in pattern recognition. In-

ternational Journal of Software Engineering and Its

Applications, 4(2).

Bo, L., Ren, X., and Fox, D. (2011). Depth kernel descrip-

tors for object recognition. In Intelligent Robots and

Systems (IROS), 2011 IEEE/RSJ International Con-

ference on, pages 821–826. IEEE.

Carreira-Perpinan, M. A. and Hinton, G. E. (2005). On

contrastive divergence learning. In Proceedings of the

tenth international workshop on artiﬁcial intelligence

and statistics, pages 33–40. Citeseer.

Deng, L. (2014). A tutorial survey of architectures, algo-

rithms, and applications for deep learning. APSIPA

Transactions on Signal and Information Processing,

3:e2.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006a). A

fast learning algorithm for deep belief nets. Neural

computation, 18(7):1527–1554.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006b). A

fast learning algorithm for deep belief nets. Neural

computation, 18(7):1527–1554.

Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M.,

Saenko, K., and Darrell, T. (2013). A category-level

3d object dataset: Putting the kinect to work. In Con-

sumer Depth Cameras for Computer Vision, pages

141–165. Springer.

Keyvanrad, M. A. and Homayounpour, M. M. (2014).

Deep belief network training improvement using elite

samples minimizing free energy. arXiv preprint

arXiv:1411.4046.

Lai, K., Bo, L., Ren, X., and Fox, D. (2011a). A large-

scale hierarchical multi-view rgb-d object dataset. In

Robotics and Automation (ICRA), 2011 IEEE Interna-

tional Conference on, pages 1817–1824. IEEE.

Lai, K., Bo, L., Ren, X., and Fox, D. (2011b). A large-

scale hierarchical multi-view rgb-d object dataset. In

Robotics and Automation (ICRA), 2011 IEEE Interna-

tional Conference on, pages 1817–1824. IEEE.

Larochelle, H. and Bengio, Y. (2008). Classiﬁcation us-

ing discriminative restricted boltzmann machines. In

Proceedings of the 25th international conference on

Machine learning, pages 536–543. ACM.

LeCun, Y., Huang, F. J., and Bottou, L. (2004). Learning

methods for generic object recognition with invari-

ance to pose and lighting. In Computer Vision and

Pattern Recognition, 2004. CVPR 2004. Proceedings

of the 2004 IEEE Computer Society Conference on,

volume 2, pages II–97. IEEE.

Liu, Y., Zhou, S., and Chen, Q. (2011). Discriminative deep

belief networks for visual data classiﬁcation. Pattern

Recognition, 44(10):2287–2296.

Madry, M., Ek, C. H., Detry, R., Hang, K., and Kragic, D.

(2012). Improving generalization for 3d object cate-

gorization with global structure histograms. In Intelli-

gent Robots and Systems (IROS), 2012 IEEE/RSJ In-

ternational Conference on, pages 1379–1386. IEEE.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

106

McCann, S. and Lowe, D. G. (2012). Local naive bayes

nearest neighbor for image classiﬁcation. In Computer

Vision and Pattern Recognition (CVPR), 2012 IEEE

Conference on, pages 3650–3656. IEEE.

Nair, V. and Hinton, G. E. (2009). 3d object recognition

with deep belief nets. In Advances in Neural Informa-

tion Processing Systems, pages 1339–1347.

Ouadiay, F. Z., Zrira, N., Bouyakhf, E. H., and Himmi,

M. M. (2016). 3d object categorization and recogni-

tion based on deep belief networks and point clouds.

In Proceedings of the 13th International Conference

on Informatics in Control, Automation and Robotics,

pages 311–318.

Rumelbart, D. and McClelland, J. (1986). Parallel dis-

tributed processing: Explorations in the microstuc-

tures of cognition.

Rusu, R., Blodow, N., Marton, Z., and Beetz, M. Aligning

point cloud views using persistent feature histograms.

In Intelligent Robots and Systems, 2008. IROS 2008.

IEEE/RSJ International Conference on, pages 3384–

3391.

Rusu, R. and Cousins, S. (2011). 3D is here: Point Cloud

Library (PCL). In IEEE International Conference on

Robotics and Automation (ICRA), Shanghai, China.

Rusu, R. B., Blodow, N., and Beetz, M. (2009). Fast

point feature histograms (fpfh) for 3d registration. In

Robotics and Automation, 2009. ICRA’09. IEEE In-

ternational Conference on, pages 3212–3217. IEEE.

Rusu, R. B., Bradski, G., Thibaux, R., and Hsu, J. (2010).

Fast 3d recognition and pose using the viewpoint fea-

ture histogram. In Intelligent Robots and Systems

(IROS), 2010 IEEE/RSJ International Conference on,

pages 2155–2162. IEEE.

Schwarz, M., Schulz, H., and Behnke, S. (2015). Rgb-

d object recognition and pose estimation based on

pre-trained convolutional neural network features. In

Robotics and Automation (ICRA), 2015 IEEE Interna-

tional Conference on, pages 1329–1335. IEEE.

Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U.,

and Poggio, T. (2007). A quantitative theory of imme-

diate visual recognition. Progress in brain research,

165:33–56.

Smolensky, P. (1986). Information processing in dynamical

systems: Foundations of harmony theory.

Socher, R., Huval, B., Bath, B., Manning, C. D., and Ng,

A. Y. (2012). Convolutional-recursive deep learning

for 3d object classiﬁcation. In Advances in Neural

Information Processing Systems, pages 665–673.

Tieleman, T. (2008). Training restricted boltzmann ma-

chines using approximations to the likelihood gradi-

ent. In Proceedings of the 25th international confer-

ence on Machine learning, pages 1064–1071. ACM.

Toldo, R., Castellani, U., and Fusiello, A. (2009). A bag of

words approach for 3d object categorization. In Com-

puter Vision/Computer Graphics CollaborationTech-

niques, pages 116–127. Springer.

Tombari, F., Salti, S., and D. Stefano, L. (2010). Unique

signatures of histograms for local surface descrip-

tion. In Computer Vision–ECCV 2010, pages 356–

369. Springer.

Tombari, F., Salti, S., and Stefano, L. (2011). A com-

bined texture-shape descriptor for enhanced 3d fea-

ture matching. In Image Processing (ICIP), 2011 18th

IEEE International Conference on, pages 809–812.

IEEE.

Torralba, A., Murphy, K. P., Freeman, W. T., and Ru-

bin, M. A. (2003). Context-based vision system for

place and object recognition. In Computer Vision,

2003. Proceedings. Ninth IEEE International Confer-

ence on, pages 273–280. IEEE.

Wohlkinger, W. and Vincze, M. (2011). Ensemble of shape

functions for 3d object classiﬁcation. In Robotics and

Biomimetics (ROBIO), 2011 IEEE International Con-

ference on, pages 2987–2992. IEEE.

Zhang, H., Berg, A. C., Maire, M., and Malik, J. (2006).

Svm-knn: Discriminative nearest neighbor classiﬁca-

tion for visual category recognition. In 2006 IEEE

Computer Society Conference on Computer Vision

and Pattern Recognition (CVPR’06), volume 2, pages

2126–2136. IEEE.

Zhou, S., Chen, Q., and Wang, X. (2010). Discriminative

deep belief networks for image classiﬁcation. In 2010

IEEE International Conference on Image Processing,

pages 1561–1564. IEEE.

Generative vs. Discriminative Deep Belief Netwok for 3D Object Categorization

107