Deep Spatial Pyramid Match Kernel for Scene Classiﬁcation

Shikha Gupta

, Deepak Kumar Pradhan

, Dileep Aroor Dinesh

and Veena Thenkanidiyoor

School of Computing and EE, Indian Institute of Technology, Mandi, H.P., India

Department of CSE, National Institute of Technology Goa, Ponda, Goa, India

Keywords:

Scene Classiﬁcation, Dynamic Kernel, Set of Varying Length Feature Map, Support Vector Machine,

Convolutional Neural Network, Deep Spatial Pyramid Match Kernel.

Abstract:

Several works have shown that Convolutional Neural Networks (CNNs) can be easily adapted to different

datasets and tasks. However, for extracting the deep features from these pre-trained deep CNNs a ﬁxed-

size (e.g., 227 × 227) input image is mandatory. Now the state-of-the-art datasets like MIT-67 and SUN-397

come with images of different sizes. Usage of CNNs for these datasets enforces the user to bring different

sized images to a ﬁxed size either by reducing or enlarging the images. The curiosity is obvious that “Isn’t

the conversion to ﬁxed size image is lossy ?”. In this work, we provide a mechanism to keep these lossy ﬁxed

size images aloof and process the images in its original form to get set of varying size deep feature maps,

hence being lossless. We also propose deep spatial pyramid match kernel (DSPMK) which amalgamates set

of varying size deep feature maps and computes a matching score between the samples. Proposed DSPMK

act as a dynamic kernel in the classiﬁcation framework of scene dataset using support vector machine. We

demonstrated the effectiveness of combining the power of varying size CNN-based set of deep feature maps

with dynamic kernel by achieving state-of-the-art results for high-level visual recognition tasks such as scene

classiﬁcation on standard datasets like MIT67 and SUN397.

1 INTRODUCTION

CNNs have been popular these days for their ap-

plicability to wide range of tasks, such as object

recognition (Simonyan and Zisserman, 2014), (Gir-

shick et al., 2014), (Chatﬁeld et al., 2014), image

segmentation (Kang and Wang, 2014), image re-

trieval (Zhao et al., 2015), scene classiﬁcation (He

et al., 2015), (Yoo et al., 2014) and so on. Spectacu-

lar results for the state-of-the-art tasks are mainly be-

cause of powerful feature representation learnt from

CNNs. Scene image classiﬁcation, being the most

basic and important aspect of computer vision, has

received a high degree of attention among the re-

searchers. An important issue in scene image clas-

siﬁcation is intra-class variability i.e, the images of

a particular class differ so much in their visual ap-

pearance and inter-class similarity i.e, images of dif-

ferent class are very much confusable and composed

by the similar concepts. For addressing these is-

sues many deep CNNs such as AlexNet (Krizhevsky

et al., 2012), GoogLeNet (Szegedy et al., 2015)

and VGGNet-16 (Simonyan and Zisserman, 2014)

have already been trained on datasets like Places-205,

Places-365 (Zhou et al., 2017) and ImageNet (Deng

et al., 2009) for image classiﬁcation tasks. These

deep CNNs can be adapted and retrained for other

datasets and tasks with a little ﬁne-tuning. In all

such cases, features obtained from pre-trained or ﬁne-

tuned CNNs are used to build fully connected neural

network or SVM-based classiﬁer. These CNNs also

became popular to greater extent as they are useful

in providing base architecture and features for many

other similar tasks than the one for which they are

trained. For example, AlexNet (Krizhevsky et al.,

2012) is trained for object recognition. However,

(Mandar et al., 2015) used the features for scene clas-

siﬁcation by further enhancing through Fisher encod-

ing. These CNNs require images to be input in a

ﬁxed size. For example the AlexNet accepts images

of size “227 × 227”. However the state-of-the-art

datasets like SUN397 (Xiao et al., 2010) or MIT-67

indoor (Quattoni and Torralba, 2009) scene datasets

comprise of varying sized images which are much

larger than “227 × 227”. The conventional approach

to use these CNNs is to resize the arbitrary-sized im-

ages to a ﬁxed size. This results in loss of informa-

tion of the image before feeding it to the CNN for

feature extraction. The performance of classiﬁcation

Gupta, S., Pradhan, D., Dinesh, D. and Thenkanidiyoor, V.

Deep Spatial Pyramid Match Kernel for Scene Classiﬁcation.

DOI: 10.5220/0006596101410148

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 141-148

ISBN: 978-989-758-276-9

141

Figure 1: Block diagram of our proposed approach.

gets compromised due to such practice, which is ev-

ident from (He et al., 2015). To avoid any such in-

formation loss, many researchers explored different

approaches to feed varying sized images (i.e, original

size of the image) to CNN. (He et al., 2015) elimi-

nates the requirement of ﬁxed size image by includ-

ing a spatial pyramid pooling (SPP) layer in CNN and

named the network as SPP-Net. (Gao et al., 2015)

follows the same approach by further processing the

outputs of convolutional layer to encode it into a sin-

gle vector using either vector of locally aggregated

descriptor (VLAD) (Gong et al., 2014) or Fisher vec-

tor (FV) (Yoo et al., 2014), by building a Gaussian

mixture model (GMM).

The research work presented in this paper focuses on

giving the images in their original size as input to the

CNN and then take sets of deep feature maps from the

last convolutional layer. As we learn convolutional

layers are the indispensable part of CNN and respon-

sible for generating discriminative features. These

deep feature maps are of varying size with respect

to the corresponding original image size. They con-

tain more spatial information compared to the activa-

tion of the fully connected layers, as fully connected

layer destroy the spatial content present in the convo-

lutional layer features. These sets of varying size deep

feature maps are used to build support vector ma-

chine (SVM) based classiﬁers for varying length pat-

tern classiﬁcation. There are two approaches to vary-

ing length pattern classiﬁcation using SVMs. In the

ﬁrst approach, a set of varying size deep feature maps

is ﬁrst mapped onto a ﬁxed length pattern as in (Gao

et al., 2015), and then a kernel for ﬁxed length pattern

is used to build the SVM-based classiﬁer. In the sec-

ond approach a suitable kernel for set of varying size

deep feature maps is designed. Then it is used to build

SVM-based classiﬁer. The kernel designed for set

of varying length feature vectors are called dynamic

kernels (Dileep and Chandra Sekhar, 2014). The dy-

namic kernels in (Lazebnik et al., 2006), (Dileep and

Chandra Sekhar, 2014), (Gupta et al., 2016a) and

(Gupta et al., 2016b) show promising results for clas-

siﬁcation of varying size image and speech signals.

In this paper, we focus on dynamic kernel based

SVMs for classiﬁcation of set of varying size deep

feature maps obtained from convolutional neural net-

work. We propose deep spatial pyramid match ker-

nel (DSPMK) for SVM-based classiﬁcation of im-

ages represented as set of varying size deep feature

maps inspired by the spatial pyramid match kernel

of (Lazebnik et al., 2006). The entire process is shown

in block diagram of Figure 1. In this block diagram,

two images I

and I

of arbitrary size are passed as

input to the convolutional layer of deep CNN to ob-

tain set of deep feature maps. Size of feature maps

obtain for image I

can be different from size of fea-

ture maps obtain for I

. These varying size deep fea-

ture maps for two images are further spatially divided

and sum-pooled at each level of pyramid. These spa-

tially pooled feature maps are normalized to obtain

the probability vectors. Then, we propose to com-

pute a matching score between probability vector rep-

resentation of I

and I

using DSPMK. A DSPMK-

based SVM classiﬁer is used to learn the association

of sets of deep feature maps with the class label. The

main contribution of this paper toward exploring set

of varying size deep feature maps obtained from deep

CNN is in computing DSPMK for building SVM-

based classiﬁer. This is in contrast to (Gao et al.,

2015) where varying size deep feature maps are en-

coded to ﬁxed length Fisher vector and then linear

kernel-based SVM is used for classiﬁcation. Salient

feature of the proposed DSPMK is that it works for

different sized images by building spatial pyramid of

L+1 levels ranging from 0, 1 to L using set of varying

size deep feature maps.

This paper is organized as follows: In Section 2, a

review of related approach for scene image classiﬁ-

cation using CNN-based features is presented. The

proposed DSPMK for set of varying size deep fea-

ture map is described in Section 3. In Section 4 the

experimental studies using the proposed approach on

scene classiﬁcation is presented. The conclusion is

presented in Section 5.

2 RELATED WORK

In this section, we review the state-of-the-art ap-

proaches for classiﬁcation of varying length fea-

ture vector representation. In the last decade, tra-

ditional hand engineered low-level image descriptors

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

142

like the histogram of oriented gradient (HOG) (Dalal

and Triggs, 2005) and scale invariant feature trans-

form (SIFT) (Lowe, 2004) were popular. These

descriptors result in the set of local feature vector

representation for images. An SVM-based classi-

ﬁer for such representation can be built using the

standard kernels such as Gaussian kernel by map-

ping local feature vectors onto a ﬁxed dimensional

representation. One of the commonly used ﬁxed-

dimensional vector representations of an image is

bag-of-visual words (BOVW) representation (Lazeb-

nik et al., 2006). The BOVW representation is a ﬁxed-

dimensional vector with frequencies of occurrence of

visual words in an image as its elements. A lim-

itation of the BOVW representation is that there is

the loss of spatial information. Alternatively, SVM-

based classiﬁers can be built for scene classiﬁcation

using the dynamic kernels that are designed for the

varying size sets of local feature vectors (Lazebnik

et al., 2006) (Thenkanidiyoor et al., 2017). One of

the approaches to the design of dynamic kernels is

the matching based approach that involves comput-

ing value of kernel function between two sets of local

feature vectors by matching the local feature vectors.

Spatial pyramid match kernel (Lazebnik et al., 2006),

GMM-based intermediate matching kernel (Dileep

and Chandra Sekhar, 2014) and segment-level pyra-

mid match kernel (Gupta et al., 2016a) are some of

the matching based dynamic kernels for classiﬁca-

tion of varying size image and speech signals. With

the advancement of deep convolutional neural net-

works, traditional features and related techniques are

being replaced by CNN-based features with linear

kernel based SVM classiﬁer. Due to the strong feature

capturing ability of CNN trained on large datasets,

one can directly use features from fully connected

layers of deep CNNs to build SVM-based classiﬁer

and achieve better performance than traditional meth-

ods (Zhou et al., 2014). Some researchers also tried

to encode CNN-based features into a new representa-

tion. (Mandar et al., 2015) have encoded the features

from fully connected layer of deep CNN into a Bag-

of-Semantics (BoS). This BoS is then summarized in

“semantic Fisher vector” representation.

For extracting fully connected features from pre-

trained deep CNNs, the input images are resized to

a ﬁxed size. The coercive nature of fully-connected

layer expects a ﬁxed length representation of input,

whereas the convolution process is not constrained

with ﬁxed length representation. In other words, the

necessity of ﬁxed size image as input to deep CNNs is

an indirect requirement of the fully-connected layer.

The process of resizing the images to a ﬁxed size lead

to a loss in information (He et al., 2015). On the other

hand, any arbitrary sized input image can be fed into

the convolutional layers of CNN which results in ar-

bitrary sized deep feature maps. These deep feature

maps correspond to the strongest responses from ﬁl-

ters of convolution layer and at the same time they

preserves spatial information. The similar idea can

be observed from the work in (Yoo et al., 2015) and

(He et al., 2015). These works have considered im-

age scaled pyramid and spatial pyramid approaches

to incorporate the concept information of images into

the feature maps. (Yoo et al., 2015) have focused in

considering scale characteristics over activation maps.

Dense activation maps are obtained for a image in

pyramid of seven layers. Each layer in this case con-

sists of activation map from differently scaled image.

These multi scale dense activation maps are aggre-

gated in a Fisher kernel framework. (He et al., 2015)

have followed spatial pyramid approach to eliminate

the requirement of ﬁxed size images input in CNNs.

Here, the CNN is fed with original image size. How-

ever, in (Yoo et al., 2015) the CNN is fed with differ-

ently scaled images. (Gao et al., 2015) also followed

the same approach and fed the arbitrary sized image

to CNN. However, the approach for spatial pyramid is

different. Here, a GMM using ﬁxed size vector rep-

resentation obtained from spatial pyramid pooling is

built to generate Fisher vectors. Finally, all the Fish-

ers vectors are concatenated to form a ﬁxed dimen-

sional representation. Our work focuses on combin-

ing the power of dynamic kernel with CNN-based set

of varying size deep feature maps to obtain a match-

ing score between a pair of images of different size.

We propose to compute DSPMK instead of building

GMM based dictionary. In the next Section, the pro-

posed DSPMK for the set of varying size deep feature

maps is presented.

3 DEEP SPATIAL PYRAMID

MATCH KERNEL

In designing deep spatial pyramid match ker-

nel (DSPMK), an image represented by a set of deep

feature maps is decomposed into the pyramid of in-

creasingly ﬁner spatial regions. Here the size of deep

feature maps obtains from last layer convolutional ﬁl-

ters for the particular image is same but vary from

one image to other as the images are fed to CNN

in its original size. DSPMK between a pair of im-

ages is computed by matching the corresponding sum

pooled feature vectors from spatially partitioned deep

feature maps at each level of the pyramid. Let D =

, I

, . . . , I

} be the set of all images in the

dataset and‘ f ’ be the number of ﬁlters in last convo-

Deep Spatial Pyramid Match Kernel for Scene Classiﬁcation

143

Figure 2: Illustration of computing deep spatial pyramid match kernel between two different sized images I

and I

. Here, set

of varying size deep feature maps X

and X

are computed from a pre-trained CNN. X

and X

are sum pooled deep feature

representation of I

and I

which are computed from spatially partitioned deep feature maps at level l, (l=0 to 2).

and

are the probability vector representation of X

and X

obtained using (1). The intermediate similarity score at each level l

(i.e, S

, S

and S

) is computed using equation (2).

lutional layer of pretrained deep CNN. Consider the

mapping F , which take input as original image and

transform it to set of deep feature maps using con-

volutional layers of CNN. Mapping F is given as,

= F (I

At ﬁrst phase, we feed the images to CNN in its

original size. The size of images are different from

each other in database and so as their size of deep

feature maps. For image I

, we have a set X

, x

, . . . , x

m f

} consisting of ‘ f ’ deep fea-

ture maps from mapping F , where x

∈ R

×q

and

× q

is the size of each feature map obtained from

last convolutional layer which varies according to the

input image size. This leads to varying size deep fea-

ture map as shown in Figure 2 for images I

and I

In the second phase, a deep feature map is spa-

tially partitioned into sub-regions forming deep spa-

tial pyramid. At level-0 of pyramid, there will not be

any spatial partition. At level-1, a deep feature map

splits into 4 spatial regions corresponding to 4 quad-

rant, as shown in Figure 2. Let L + 1 be the num-

ber of levels in spatial pyramid ranging from 0, 1 to

L. At any level-l, a deep feature map x

is spatially

split into 2

regions. The activation values of all the

cells in a spatial region are sum pooled. At any level-

l, activation values of cells in every spatial region of

all the f deep feature maps are sum pooled and con-

catenated to form a vector X

of size f 2

× 1 i.e,

= [X

, X

, . . . , X

m j

. . . X

m( f 2

)

]

. This process is

illustrated in Figure 2 by considering three levels, l =

0, 1 and 2.

In our proposed approach we considered three such

levels. At level-0, (i.e, l = 0) the entire feature maps

are sum pooled and we get f × 1 dimensional vec-

tor representation. At level-1 (i.e, l = 1), the same

feature maps are divided into four equal spatial re-

gions. Each sum pooled spatial region of f feature

maps results into a vector of 4 f × 1 dimensional vec-

tor. Similarly, at level-2 (i.e. l = 2), again the same

feature maps are divided into sixteen equal spatial re-

gions resulting into a vector of 16 f × 1 dimensional

vector. The X

can now be seen as representation

of image I

at level-l of pyramid. At this stage, we

propose to compute deep spatial pyramid match ker-

nel (DSPMK) to match two images rather than deriv-

ing Fisher vector (FV) representation as in (Gao et al.,

2015). Our proposed approach avoids building GMM

to obtain FV and hence reduces the computation com-

plexity as compared to (Gao et al., 2015). The process

of computing DSPMK is motivated from spatial pyra-

mid match kernel (SPMK) (Lazebnik et al., 2006).

SPMK involves the histogram intersection function

that match the frequency based image representation

of two images at every levels of pyramid (Lazebnik

et al., 2006). However, X

is not the probability vec-

tor representation of image I

. We sum normalize X

to transform it into probability vector representation.

Let X

and X

be the deep representation at level-l

of two images I

and I

respectively. The probability

vector representation of X

and X

is obtained as:

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

144

f 2

∑

j=1

m j

f 2

∑

j=1

n j

(1)

A histogram intersection function is used to compute

intermediate matching score S

between

and

each level l as,

f 2

∑

j=1

min(

m j

n j

) (2)

Here, the matching score S

found at level l also in-

cludes all the matches found at the ﬁner level l + 1.

Therefore, the number of new matches found at level

l is given by S

−S

l+1

for l = 0, . . . , L-1. The DSPMK

is computed as a weighted sum of the number of new

matches at different levels of the spatial pyramid. The

weight associated with level l is set to

(L−l)

, which is

inversely proportional to width of spatial regions at

that level. The DSPMK kernel is computed as,

DSPMK

, X

) =

L−1

∑

l=0

L−l

− S

l+1

) + S

(3)

The main advantages of using DSPMK is it works

for different sized images. Secondly, it empowers

the deep CNN-based set of varying size deep feature

maps with dynamic kernel based SVM.

4 EXPERIMENTAL STUDIES

In this section, the effectiveness of the proposed dy-

namic kernel i.e, DSPMK is studied for scene classi-

ﬁcation task using SVM-based classiﬁers.

4.1 Datasets

We tested our proposed approach on four widely used

scene classiﬁcation datasets: MIT-8 Scene (Oliva

and Torralba, 2001), Vogel-Schiele (VS) (Vogel

and Schiele, 2004), MIT-67 (Quattoni and Torralba,

2009) and SUN-397 (Xiao et al., 2010).

MIT-8-scene dataset comprises of 2688 scene images

belonging to 8 semantic classes, namely, ‘coast’,

‘forest’, ‘mountain’, ‘open-country’, ‘highway’,

‘inside-city’, ‘tall building’ and ‘street’. We follow

the procedure of (Oliva and Torralba, 2001) and

randomly select 100 images per class for training in

each of the 5 trials and test on remaining images in

each trial. The results presented correspond to the

average classiﬁcation accuracy for 5 trials.

Vogel-Schiele dataset comprises of 700 scene images

of 6 semantic classes, namely, ‘coasts’, ‘forests’,

‘mountains’, ‘open-country’, ‘river’ and ‘sky-clouds’.

The results presented for the studies using this dataset

correspond to the average classiﬁcation accuracy for

stratiﬁed 5-fold cross validation.

MIT-67 indoor dataset comprises of 15,620 images

with 67 indoor scene categories. It is a challenging

dataset, because interclass variation between different

classes is very less. The standard split (Quattoni and

Torralba, 2009) for this dataset consist of approxi-

mately 80 training and 20 test example per class.

SUN397 is the large dataset for scene recognition. It

includes 397 categories of indoor, urban and nature

where each category has at least 100 images. The

train and test splits are ﬁxed and publicly available

from (Xiao et al., 2010), where each split has 50

training and 50 testing images per category. We

select the ﬁrst ﬁve splits and the result presented is

the average classiﬁcation accuracy for 5 splits.

4.2 Experiment Details

In our studies, we have used three different CNN

architectures namely AlexNet (Krizhevsky et al.,

2012), GoogLeNet (Szegedy et al., 2015) and

VGGNet-16 (Simonyan and Zisserman, 2014) for

deep feature extraction which are pre-trained net-

works on ImageNet (Deng et al., 2009), Places205

and Places365 (Zhou et al., 2017) datasets. Reason

for using the different network in our study is that

ImageNet dataset is having mainly object centric im-

ages so it gives activations for object-like structures in

the image, whereas Places205 and Places365 datasets

comprises mostly scene images and CNNs trained

on scene images are selective for landscapes, natu-

ral structure of scene with more spatial feature. In all

these pre-trained CNN models weights are kept ﬁxed

without ﬁne-tuning. These CNNs are employed with-

out its fully-connected layers in our experiments so

that input images of arbitrary size can be accepted.

As discussed in Section 3, we have passed the original

image of arbitrary size as input to deep CNNs and ex-

tracted set of varying size deep feature maps from last

convolutional layer. The size of feature map depends

on the number of ﬁlters f in last convolutional layer

of deep CNN architecture and image size. The num-

ber of ﬁlters f , in last convolution layer of AlexNet,

GoogLeNet and VGGNet-16 are 256, 1024 and 512

respectively. The architecture of these CNNs also dif-

fers from each other. So, feature map size will vary

from image to image and architecture to architecture.

DSPMK between varying size deep feature map for

pair of images is computed as in Figure 2 using

equation (1) to (3). We consider L + 1 = 3 as the

Deep Spatial Pyramid Match Kernel for Scene Classiﬁcation

145

Table 1: Comparison of classiﬁcation accuracy (CA) (in %) with 95% conﬁdence interval for the SVM based classiﬁer using

DSPMK on different datasets. Base features for the proposed approach are extracted from AlexNet (Krizhevsky et al., 2012),

GoogLeNet (Szegedy et al., 2015) and VGGNet-16 (Simonyan and Zisserman, 2014) which are pre-trained deep network on

ImageNet, Places-205 and Places-365 dataset. The highest accuracy of each column is marked in bold.

Different pre-trained deep CNN architectures

used to build DSPMK-based SVM

MIT-8 scene Vogel-Schiele MIT-67 SUN-397

ImageNet-AlexNet (Krizhevsky et al., 2012) 93.52±0.13 79.46±0.23 62.46 45.46±0.12

Places205-AlexNet (Zhou et al., 2014) 93.56±0.12 82.21±0.25 62.24 53.21±0.23

Places365-AlexNet (Zhou et al., 2017) 94.15±0.11 82.90±0.31 66.67 55.43±0.24

ImageNet-GoogLeNet (Szegedy et al., 2015) 92.02±0.06 82.30±0.25 71.78 50.32±0.31

Places205-GoogLeNet (Zhou et al., 2014) 92.15±0.18 85.84±0.36 75.97 57.43±0.26

Places365-GoogLeNet (Zhou et al., 2017) 93.70±0.16 85.54±0.21 75.60 59.89±0.21

ImageNet-VGG (Simonyan and Zisserman, 2014) 93.90±0.07 84.62±0.31 75.78 53.67±0.25

Places205-VGG (Zhou et al., 2014) 94.54±0.03 86.92±0.26 81.87 61.86±0.24

Places365-VGG (Zhou et al., 2017) 95.09±0.14 84.68±0.28 77.76 62.31±0.25

Table 2: Comparison of classiﬁcation accuracy (CA) (in %) with 95% conﬁdence interval of proposed approach with state-

of-the-art approaches on MIT-8 scene, Vogel-Schiele, MIT-67 Indoor and SUN-397 dataset. (SIFT: Scale invariant feature

transform, IFK: Improved Fisher kernel, BoP: Bag of part, MOP: Multi-scale orderless pooling, FV: Fisher vector, DSP:

Deep spatial pyramid, MPP: Multi-scale pyramid pooling, DSFL: Discriminative and shareable feature learning). The highest

accuracy of each column is marked in bold.

Methods Description MIT-8-Scene Vogel-Schiele MIT-67 SUN-397

(Lowe, 2004) SIFT+BOVW 79.13±0.13 67.49±0.21 45.86 24.82±0.34

(Juneja et al., 2013) IFK+BoP 85.76±0.12 73.23±0.23 63.18 -

(Gong et al., 2014) MOP-CNN 89.45±0.11 76.81±0.27 68.88 51.98±0.24

(Zhou et al., 2014) Places-CNN-fc7 88.30±0.09 76.02±0.31 68.24 54.32±0.14

(Zhou et al., 2014) Hybrid-CNN-fc7 91.23±0.04 78.56 ±0.21 70.80 53.86±0.21

(Mandar et al., 2015) fc8-FV 88.43±0.08 79.56±0.23 72.86 54.40±0.30

(Gao et al., 2015) VGGNet-16 + DSP 92.34±0.12 81.34±0.27 76.34 57.27±0.34

(Yoo et al., 2015) MPP(Alex-fc7)+DSFL - - 80.78 -

Ours DSPMK + VGGNet 95.09±0.14 86.92±0.26 81.87 62.31±0.25

number of levels in spatial pyramid. We consider

LIBSVM (Chang and Lin, 2011) tool to build the

DSPMK-based SVM classiﬁer. Speciﬁcally, we uses

one-against-the-rest approach for multi-class scene

image classiﬁcation. In SVM for building the clas-

siﬁer, we uses default value of trade-off parameter

C = 1.

4.3 Results on Scene Classiﬁcation

In this section, we present the experimental studies of

scene image classiﬁcation using proposed DSPMK-

based SVM classiﬁer and compare with state-of-the-

art approaches. The scene classiﬁcation accuracies

for DSPMK-based SVMs are given in Table 1. Ta-

ble 1 compares the classiﬁcation accuracies of SVM

using DSPMK which is constructed from varying size

deep feature maps obtained from different pre-trained

CNN models. It is seen that performance of SVM-

based classiﬁer with DSPMK obtained using deep

features from VGGNet-16 is signiﬁcantly better than

that of SVM with DSPMK obtained using deep fea-

tures from GoogLeNet and AlexNet. Reason being

VGGNet-16 has very deep network compare to other

architectures and it learns the hierarchical representa-

tion of visual data more efﬁciently.

Table 2 presents the comparison of scene image clas-

siﬁcation accuracy of proposed DSPMK-based SVM

classiﬁer with that of state-of-the-art approaches.

From Table 2, it is seen that our proposed approach

is giving better performance in comparison with tra-

ditional feature based approaches in (Lowe, 2004),

(Juneja et al., 2013) and also with CNN-based ap-

proaches in (Zhou et al., 2014), (Gong et al., 2014),

(Mandar et al., 2015), (Gao et al., 2015), (Yoo et al.,

2015).

(Lowe, 2004) uses scale invariant feature trans-

form (SIFT) descriptors for scene images feature rep-

resentation. In (Lowe, 2004) paper, scene images are

represented as set of local feature vectors, which are

further converted into bag-of-visual word (BOVW)

representation for classiﬁcation using linear kernel

based SVM classiﬁer. (Juneja et al., 2013) uses the

learned bag-of-part (BoP) representation and combine

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

146

with improved Fisher vector for building SVM based

classiﬁer using linear kernel. (Gong et al., 2014) ex-

tracted CNN-based features from multiple scale of

image at different levels and performs orderless vec-

tors of locally aggregated descriptors (VLAD) pool-

ing (J

egou et al., 2010) at every scale separately. The

representations from different level are then concate-

nated to form a new representation known as multi-

scale orderless pooling (MOP) which is used for

training linear kernel based SVM classiﬁer. (Zhou

et al., 2014) uses more direct approach, where a

large scale image dataset (Places dataset) is used for

training the AlexNet architecture and extracted fully

connected (fc7) layer feature from the trained net-

work. The basic architecture of their Places-CNN is

same as that of the AlexNet (Krizhevsky et al., 2012)

trained on ImageNet. (Zhou et al., 2014) also trained

a Hybrid-CNN, by combining the training data of

Places dataset with ImageNet dataset. Here, features

from fully connected (fc7) layer are then used for

training linear SVM based classiﬁer . (Mandar et al.,

2015) obtained the semantic Fisher vector (FV) using

standard Gaussian mixture encoding for CNN-based

feature. Further linear kernel based SVM classiﬁer

is build using semantic FV for classiﬁcation of scene

images. (Gao et al., 2015) used the generative model

based approach to build a dictionary on top of CNN

feature maps. A FV representation for different spa-

tial region of activation map is then obtained from

the dictionary. A power and l

normalization is ap-

plied on the combined FV from different spatial re-

gion. A linear kernel based SVM classiﬁer is then

used for scene classiﬁcation. (Yoo et al., 2015) com-

bined the features from fc7 layer of AlexNet (Alex-

fc7) and their complementary features named dis-

criminative and shareable feature learning (DSFL).

DSFL learns discriminative and shareable ﬁlters with

a target dataset. The ﬁnal image representation is used

with the linear kernel based SVM classiﬁer for the

scene classiﬁcation task.

In contrast to all the above brieﬂy explained ap-

proaches, our proposed approach uses the image of

arbitrary size and get the deep feature map of varying

size without any loss of information. Secondly, we

have proposed the deep spatial pyramid match ker-

nel which handles the set of varying size deep feature

maps and intend to incorporate the local spatial in-

formation at the time of computing level wise match-

ing score. Speciﬁcally, our proposed approach is very

simple and discriminative in nature which outper-

forms the other CNN-based approach without com-

bining any complementary features as in (Yoo et al.,

2015). Our proposed approach reveals that for scene

classiﬁcation with complicated standard dataset good

classiﬁcation accuracy is achievable by using last con-

volutional layer features and DSPMK based-SVM.

Proposed approach is independent of fully connected

layer feature, capture the original size image, simple,

memory efﬁcient and take very less computation time

in compare to state-of-the-art-approaches.

5 CONCLUSION

In this paper, we proposed DSPMK for enhancing the

base features from last convolutional layer of CNN.

DSPMK-based SVM classiﬁes varying size scene im-

ages which are represented as the set of varying size

deep feature maps. It is certain that better the fea-

ture the higher the performance. Our framework

is equipped with a dynamic kernel which computes

layer wise intermediate matching score and strength-

ens the matching procedure of convolutional layer

features. The training of DSPMK-based SVM classi-

ﬁer consumes much lesser time than that of the train-

ing of GMM in (Gao et al., 2015). We have shown

how the concepts in an image can be matched at the

coarser level of the pyramid that is not matched in

ﬁner level. In our study, we have considered last con-

volutional layer features instead of fully connected

layer features because fully connected layer restricts

these features to ﬁxed size and requires heavy com-

putation time as it contains roughly 90% of the total

parameter of CNN. Hence, convolutional layer fea-

tures proved to be beneﬁcial for us as images size of

SUN-397 and MIT-67 datasets are also varying and

large. The resized images will result in huge amount

of concept and spatial information loss. Thus, convo-

lutional ﬁlters will not be capable of capturing those

resized concepts present in image. In terms of perfor-

mance, our proposed approach achieves state-of-the-

art results for standard scene classiﬁcation datasets.

In future, for capturing variations of the activations

caused by the different size of concepts in an image

and multi-scale deep spatial pyramid match kernel

can be explored.

REFERENCES

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library

for support vector machines. ACM Transactions on

Intelligent Systems and Technology, 2:27:1–27:27.

Chatﬁeld, K., Simonyan, K., Vedaldi, A., and Zisserman,

A. (2014). Return of the devil in the details: Delv-

ing deep into convolutional nets. arXiv preprint

arXiv:1405.3531.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Deep Spatial Pyramid Match Kernel for Scene Classiﬁcation

147

Pattern Recognition (CVPR), 2005 IEEE Conference

on, volume 1, pages 886–893.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In Computer Vision and Pattern Recogni-

tion, 2009. CVPR 2009. IEEE Conference on, pages

248–255. IEEE.

Dileep, A. D. and Chandra Sekhar, C. (2014). GMM-

based intermediate matching kernel for classiﬁcation

of varying length patterns of long duration speech us-

ing support vector machines. IEEE Transactions on

Neural Networks and Learning Systems, 25(8):1421–

1432.

Gao, B.-B., Wei, X.-S., Wu, J., and Lin, W. (2015). Deep

spatial pyramid: The devil is once again in the details.

CoRR, abs/1504.05277.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

Gong, Y., Wang, L., Guo, R., and Lazebnik, S. (2014).

Multi-scale orderless pooling of deep convolutional

activation features. In European conference on com-

puter vision, pages 392–407. Springer.

Gupta, S., Dileep, A. D., and Thenkanidiyoor, V. (2016a).

Segment-level pyramid match kernels for the classi-

ﬁcation of varying length patterns of speech using

svms. In Signal Processing Conference (EUSIPCO),

2016 24th European, pages 2030–2034. IEEE.

Gupta, S., Thenkanidiyoor, V., and Dinesh, D. A. (2016b).

Segment-level probabilistic sequence kernel based

support vector machines for classiﬁcation of varying

length patterns of speech. In International Confer-

ence on Neural Information Processing, pages 321–

328. Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyra-

mid pooling in deep convolutional networks for visual

recognition. IEEE transactions on pattern analysis

and machine intelligence, 37(9):1904–1916.

egou, H., Douze, M., Schmid, C., and P

erez, P. (2010). Ag-

gregating local descriptors into a compact image rep-

resentation. In Computer Vision and Pattern Recogni-

tion (CVPR), 2010 IEEE Conference on, pages 3304–

3311. IEEE.

Juneja, M., Vedaldi, A., Jawahar, C., and Zisserman, A.

(2013). Blocks that shout: Distinctive parts for scene

classiﬁcation. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

923–930.

Kang, K. and Wang, X. (2014). Fully convolutional neu-

ral networks for crowd segmentation. arXiv preprint

arXiv:1411.4464.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond

bags of features: Spatial pyramid matching for recog-

nizing natural scene categories. In Computer Vision

and Pattern Recognition (CVPR), 2006 IEEE Confer-

ence on, volume 2, pages 2169–2178.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International journal of computer

vision, 60(2):91–110.

Mandar, Dixit, S., Chen, D., Gao, N., Rasiwasia, Nuno, and

Vasconcelos (2015). Scene classiﬁcation with seman-

tic ﬁsher vectors. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2974–2983.

Oliva, A. and Torralba, A. (2001). Modeling the shape

of the scene: A holistic representation of the spatial

envelope. International journal of computer vision,

42(3):145–175.

Quattoni, A. and Torralba, A. (2009). Recognizing indoor

scenes. In Computer Vision and Pattern Recognition,

2009. CVPR 2009. IEEE Conference on, pages 413–

420. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Thenkanidiyoor, V., Dileep, A. D., and Chandra Sekhar, C.

(2017). Dynamic kernels based approaches to anal-

ysis of varying length patterns in speech and image

processing tasks. In Amita Pal, S. K. P., editor, Pat-

tern Recognition and Big Data. World Scientiﬁc.

Vogel, J. and Schiele, B. (2004). Natural scene retrieval

based on a semantic modeling step. In International

Conference on Image and Video Retrieval, pages 207–

215. Springer.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba,

A. (2010). Sun database: Large-scale scene recogni-

tion from abbey to zoo. In Computer vision and pat-

tern recognition (CVPR), 2010 IEEE conference on,

pages 3485–3492. IEEE.

Yoo, D., Park, S., Lee, J.-Y., and Kweon, I. S. (2014). Fisher

kernel for deep neural activations. arXiv preprint

arXiv:1412.1628.

Yoo, D., Park, S., Lee, J.-Y., and So Kweon, I. (2015).

Multi-scale pyramid pooling for deep convolutional

representation. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition

Workshops, pages 71–80.

Zhao, F., Huang, Y., Wang, L., and Tan, T. (2015). Deep

semantic ranking based hashing for multi-label im-

age retrieval. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

1556–1564.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-

ralba, A. (2017). Places: A 10 million image database

for scene recognition. IEEE Transactions on Pattern

Analysis and Machine Intelligence.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva,

A. (2014). Learning deep features for scene recog-

nition using places database. In Advances in neural

information processing systems, pages 487–495.

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

148