Supervised Person Re-ID based on Deep Hand-crafted and CNN Features

Salma Ksibi

, Mahmoud Mejdoub

1,2

and Chokri Ben Amar

REGIM: Research Groups on Intelligent Machines, University of Sfax, ENIS, 3038, Sfax, Tunisia

Department of Computer Science, College of AlGhat, Majmaah University, 11952, Riyadh, Saudi Arabia

Keywords:

Person Re-identiﬁcation, Fisher Vector, Gaussian Weight, Deep Hand-crafted Feature, Deep CNN, XQDA.

Abstract:

Gaussian Fisher Vector (GFV) encoding is an extension of the conventional Fisher Vector (FV) that effectively

discards the noisy background information by localizing the pedestrian position in the image. Nevertheless,

GFV can only provide a shallow description of the pedestrian features. In order to capture more complex

structural information, we propose in this paper a layered extension of GFV that we called LGFV. The rep-

resentation is based on two nested layers that hierarchically reﬁne the FV encoding from one layer to the

next by integrating more spatial neighborhood information. Besides, we present in this paper a new rich

multi-level semantic pedestrian representation built simultaneously upon complementary deep hand-crafted

and deep Convolutional Neural Network (CNN) features. The deep hand-crafted feature is depicted by the

combination of GFV mid-level features and high-level LGFV ones while a deep CNN feature is obtained by

learning in a classiﬁcation mode an effective embedding of the raw pedestrian pixels. The proposed deep

hand-crafted features produce competitive accuracy with respect to the deep CNN ones without needing nei-

ther pre-training nor data augmentation, and the proposed multi-level representation further boosts the re-ID

performance.

1 INTRODUCTION

Person re-identiﬁcation (re-ID) (Guo et al., 2006;

Zheng et al., 2016) is a challenging task in the cam-

era surveillance ﬁeld (Wali et al., 2010), since it ad-

dresses the problem of matching people across poten-

tially multiple non-overlapping cameras. Many re-ID

works (Mahmoud Mejdoub and Koubaa, 2017; Ksibi

et al., 2016b; Ksibi et al., 2016a; Ksibi et al., 2016c)

used either shallow or deep representations coupled

with supervised metric learning. Shallow methods

are built upon the low-level and the mid-level appear-

ance features. We can cite as the most successful

low-level appearance features the Local Maximal Oc-

currence (LOMO) (Liao et al., 2015), the Symmetry-

Driven Accumulation of Local Features (SDALF) and

the eSDC (Zhao et al., 2013). Concerning the mid-

level features (Mejdoub et al., 2009; Mejdoub et al.,

2009; Ben Aoun et al., 2014; Mejdoub et al., 2008)

that demonstrated their robustness in the image clas-

siﬁcation ﬁeld (M. El Arbi et al., 2011; Mejdoub

et al., 2015b; Mejdoub et al., 2008), the Bag of visual

Words (BOW) model (Ksibi et al., 2012; Zheng et al.,

2015) that quantiﬁes the low-level features into a dic-

tionary of visual words, has been presented for the

person re-ID task in (Zheng et al., 2015). Locality-

constrained linear coding (LLC) was proposed in (Li

et al., 2015) by using a soft quantization. It considers

the locality information in the feature encoding pro-

cess by taking into account only the k-nearest basis

vectors from each local feature. Fisher Vector (FV)

(Ma et al., 2012; Messelodi and Modena, 2015; Liu

et al., 2015; Wu et al., 2017; Sekma et al., 2015a;

Sekma et al., 2015b; Mejdoub and Ben Amar, 2013)

is another encoding method that learns a Gaussian

Mixture model (GMM) on the local descriptors, in

order to compute the visual words. B. Ma et al. (Ma

et al., 2012) were the ﬁrst to introduce the FV encod-

ing scheme in person re-ID task. They employed a

spatial representation that divides the pedestrian im-

age into 4 × 3 ﬁxed regions and used a simple 7-d lo-

cal descriptor. These local descriptors are turned into

FVs and these are employed to measure the similarity

between two persons using the Euclidean distance be-

tween their representations. In (Messelodi and Mod-

ena, 2015), the authors introduced a boosting method

that learns a scoring function taking into account the

likelihood between the local FVs of the same iden-

tity. Regarding the deep representations, most of the

current state-of-the-art methods used a Convolutional

Neural Network (CNN) (M. El Arbi et al., 2006;

Bouchrika et al., 2014) veriﬁcation model. This infers

Ksibi, S., Mejdoub, M. and Amar, C.

Supervised Person Re-ID based on Deep Hand-crafted and CNN Features.

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, pages

63-74

ISBN: 978-989-758-290-5

positive image pairs and negative ones as input to the

CNN, owing to the lack of training data (Boughrara

et al., 2017; Othmani et al., 2010) per pedestrian

identity. However, the recognition accuracy is gen-

erally badly inﬂuenced by the absence of the intra-

class similarity and inter-class dissimilarity informa-

tion. To tackle this problem, the ID-discriminative

Embedding (IDE) deep CNN feature was presented

in (Zhong et al., 2017; Zheng et al., 2016) to learn an

embedded feature space in a classiﬁcation mode. It

was stated in (Wu et al., 2016a) that IDE performs

better than the previously used veriﬁcation model.

Pedestrian matching is then operated in the learned

feature space. Wu et al. presented another classiﬁ-

cation CNN model (Wu et al., 2017). They built a

hybrid network by moving the input FVs on the fully

connected layers and enforcing the linear discrimina-

tive analysis (LDA) as an objective function to pro-

duce embeddings that have low intra-class variance

and high inter-class variance. In (Wu et al., 2016b;

Xiong et al., 2014) low-level hand-crafted features are

combined with high-level CNN features. Afterwards,

metric learning is applied to the obtained combina-

tion. The good re-ID results obtained by the concate-

nation between low-level hand-crafted features and

CNN ones provide support on their complementary

nature. To enhance the discriminative ability of the

appearance features, supervised metric learning, such

as the Keep It Simple and Straightforward MEtric

learning (KISSME) (K

ostinger et al., 2012), the lo-

cally adaptive decision functions (LADF) (Li et al.,

2013), the Null Space (NS) metric learning (Zhang

et al., 2016), and the Cross-view Quadratic Discrim-

inant Analysis (XQDA) (Liao et al., 2015) are often

applied upon the generated features in order to learn

an optimal distance allowing to increase the intra-

similarity and decrease the inter-similarity. Among

them, XQDA achieves good re-ID results (Zheng

et al., 2016). This is mainly due to the fact that XQDA

has the ability to simultaneously learn a discrimina-

tive subspace as well as a distance in the low dimen-

sional subspace. Indeed, in this paper, we propose

to encode the appearance features throughout a rich

histogram representation well adapted to the person

re-ID ﬁeld. In this sense, an extension of the tra-

ditional FV encoding method is introduced, namely

the Gaussian weighted FV (GFV). This consists in

weighting the histogram encoding process (Dammak

et al., 2014b; Mejdoub et al., 2015a; Dammak et al.,

2014c; Mejdoub et al., 2015b; Dammak et al., 2015;

Dammak et al., 2014a) via the pedestrian Gaussian

template. This latter fosters the locations that lie

nearby the pedestrian in the image. GFV provides

a shallow representation of the pedestrian. To fur-

ther describe the complex spatial structural informa-

tion that can be present in a pedestrian image, we pro-

pose in this paper a layered version of GFV called

layered GFV (LGFV). This latter is based on two

nested layers that hierarchically reﬁne the FV encod-

ing from one layer to the next by integrating more

spatial neighborhood information. We separately ap-

ply LGFV on three low-level local appearance fea-

tures (Color Name (CN), Color Histogram (CHS) and

15-d descriptors). In the ﬁrst layer, we densely sample

a set of spatial sub-windows on the pedestrian image.

We then perform a local FV encoding within every

sub-window on the low-level features describing the

sub-window patches. Thus we obtain for each sub-

window a local FV that depicts the rich spatial struc-

tural information. The second layer applies GFV on

the set of the local FVs, within the whole image. After

that, by combining the GFV and the LGFV global im-

age signatures, we obtain a deep hand-crafted feature

per pedestrian image. The latter is further combined

with the deep CNN feature (IDE) to provide a rich

multi-level representation. Consequently, we obtain

seven pedestrian global histograms (see Figure 1). Fi-

nally, the pedestrian are matched by combining the

XQDA distance learned upon these seven histograms.

It is worth mentioning that all images are pre-treated

with Retinex transform (Liao et al., 2015), to reduce

the illumination variation before the application of the

encoding methods. Besides, both GFV and LGFV are

applied upon a stripe representational scheme in order

to consider the spatial alignment information between

pedestrian parts. Our main contributions in this work

are:

• We propose a new deep hand-crafted feature

based on the combination of the GFV and LGVF

representations. LGFV explores how the perfor-

mance of the shallow hand-crafted features can

be improved with increased structural spatial in-

formation depth. The experimental results have

shown that the proposed deep hand-crafted fea-

ture can produce competitive results with the deep

CNN feature: IDE (Zhong et al., 2017; Zheng

et al., 2016), without needing neither pre-training

nor data augmentation.

• We propose to combine the deep hand-crafted fea-

tures with the well performing IDE deep CNN

features. The resulting multi-level representation

is semantically rich since it exploits the comple-

mentary power between the spatial structural in-

formation generated by the two deep representa-

tions.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

Figure 1: Overview of the proposed method pipeline. (a) Image pre-processing by Retinex Transform to deal with illumination

variation. (b) Low-level feature extraction: CN, CHS and 15-d descriptors. (c) Extraction of the Gaussian template weights.

(d) Gaussian weighted FV (GFV) Encoding. (e) Layered Gaussian weighted FV (LGFV) Encoding. (f) GFV and LGFV are

calculated separately on each given low-level color feature (CN, CHS and 15-d), at each stripe. (g) The generated histograms

are concatenated over the stripes producing the ﬁnal GFV and LGFV representations (one histogram per low-level feature).

(h) Extraction of the deep CNN feature: IDE. (i) Computation of the XQDA distance learned separately on the IDE features

and each kind of the previously generated histograms. (j) Dissimilarity computation: combination of the XQDA distances

learned from the seven histograms.

2 THE PROPOSED METHOD

2.1 Dealing with Illumination

Variations

We apply in this paper the Multi-scale Retinex trans-

form with Color Restoration (MSRCR) (Jobson et al.,

1997) to handle illumination variations. Single Scale

Retinex algorithm (SSR) is the basic Retinex algo-

rithm which uses a single scale. The original image

is processed in the logarithmic space in order to high-

light the relative details. Besides, a 2D convolution

operation with Gaussian surround function is applied

to smooth the image. Afterwards, the smooth part

is subtracted from the image to obtain the ﬁnal en-

hanced image. SSR can either provide dynamic range

compression (small scale), or tonal rendition (large

scale), but not both simultaneously. The MSRCR al-

gorithm bridges the gap between color images and the

human observation by combining effectively the dy-

namic range compression of the small-scale Retinex

and the tonal rendition of the large scale with a color

restoration function. In the experiments, we used two

scales of the Gaussian surround function ( σ = 5 and

σ = 20).

2.2 Low-level Feature Extraction

In this work, the pedestrian image is sampled with

dense patches, using a size of 4 × 4, and a stride of

4 pixels, respectively. For each patch three kinds of

color low-level descriptors are extracted (CN, CHS

and 15-d). The latter are chosen underlying their good

compromise between efﬁciency and re-ID accuracy

(Zheng et al., 2015; Ma et al., 2012). Indeed, their

small dimensionality as compared to other state of the

art descriptors such as the global LOMO descriptor

(Liao et al., 2015) and the local dColorSift one (Zhao

Supervised Person Re-ID based on Deep Hand-crafted and CNN Features

et al., 2013) makes them well adapted to the efﬁciency

factor required by the person re-ID task.

2.2.1 Color Names (CN)

Authors in (Kuo et al., 2013) have demonstrates that

the color description based on color names ensures a

good robustness against the photometric variance. In

this paper, as was done in (Kuo et al., 2013), we use

the 11 basic color terms of the English language, i.e.

black, blue, brown, gray, green, orange, pink, purple,

red, white, and yellow. First, the CN feature vector

of each pixel is calculated by making a mapping from

HSV pixel values to an 11 dimensional CN vector.

Afterwards, we apply a sum pooling on the CN pixel

features related to each patch. Finally, the resulting

histogram undergoes a square rooting operation fol-

lowed by l1 normalization. The size of the generated

CN descriptor is then equal to 11.

2.2.2 Color Histogram (CHS)

For each patch, a 16-bin color histogram is computed

in each HSV color space channel. For each color

channel, the patch color histogram is square-rooted

and subsequently l1 normalized. The three obtained

histograms are then concatenated, generating a color

descriptor of size 16 × 3.

2.2.3 15-d Descriptor

Inspired by (Ma et al., 2012), we design a simple 15-

d descriptor. First the pedestrian image is split into 3

color channels (HSV). For each channel C, each pixel

is converted into a 5-d local feature, which contains

the pixel intensity, the ﬁrst-order and second-order

derivative of this pixel. The description is on the fol-

lowing equation:

f (x,y,C) = (C(x, y),C

(x,y),C

(x,y))

(1)

where C(x, y) is the raw pixel intensity at position

(x,y), C

and C

are the ﬁrst-order derivatives with re-

spect to pixel coordinates x and y, and C

, C

are the

second-order derivatives. Then, we apply, for each

color channel, a sum-pooling operation over the 15-

d descriptors of the pixels located within each patch.

Each of the three obtained patch descriptors under-

goes a square root operation followed by l1 normal-

ization. Afterwards, we horizontally concatenate the

three normalized descriptors into one single signature.

2.3 Extraction of the Gaussian

Template Weights

In (Farenzena et al., 2010), the authors proposed to

separate the foreground from the background of the

pedestrian image, and that by using segmentation.

However, it was difﬁcult to obtain an aligned bound-

ing box, and an accurate segmentation, especially in

the presence of cluttered backgrounds. This makes

the extraction of reliable features describing the per-

son of interest hard. We propose in this paper a sim-

ple solution by employing a 2-D Gaussian template

on the pedestrian image, in order to remove the noisy

background. Consider p

w,h

the patch whose spatial

center is located at the w-th row and h-th column

in the image, I =



h,w

,h = 1 ...H, w = 1...W



width W and height H. Inspired by (Farenzena et al.,

2010; Zheng et al., 2015), the Gaussian function is

deﬁned by N(µ

σ), where µ

is the mean value of the

horizontal coordinates, and σ is the standard devia-

tion. We set µ

to the image center (µ

= W /2), and

σ = W /4. This method uses a prior knowledge on

the person position, which assumes that the pedestrian

lies in the image center. Therefore, the Gaussian tem-

plate works by weighting the locations near the verti-

cal image center with higher probabilities. This per-

mits to discard the noise surrounding the person’s sil-

houette, and thus to keep meaningful parts of the im-

ages and eliminate needless ones. Explicitly, we en-

dow each patch p

h,w

with a Gaussian weight G(p

h,w

given by:

G(p

h,w

) = exp(−(w −µ

)

/2σ

) (2)

2.4 Proposed Gaussian Weighted Fisher

Vector (GFV) Encoding Method

We propose in this paper a rich extension of the tra-

ditional FV encoding method. It consists in the in-

corporation of the Gaussian weight in the encoding

process of the latter. The proposed encoding opera-

tion is made within three steps. Indeed, as operated

in the traditional FV, the ﬁrst step consists in learning

a Gaussian Mixture model (GMM) represented by K

components, on the local descriptors extracted from

all training pedestrian images. Then, the second step

implies the computation the mixture weights, means,

as well as the diagonal covariance of the GMM, which

are respectively denoted as π

,µ

,σ

. For concise

clarity, we omit hereinafter the patch index (h,w)

used in the previously notation (subsection 2.3), and

we replace it by i. Thus, the Gaussian weight of

the image patch p

is noted G(p

). Consider M lo-

cal descriptors d

corresponding to the M patches p

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

of an image I. The proposed encoding incorporates

the Gaussian weight in the traditional FV encoding,

as given by the following Equation:

√

2π

∑

i=1

G(p

) ×α

)



−µ



(3)

√

2π

∑

i=1

G(p

) ×α

)



−µ

)

−1



(4)

where, α

(di) is the soft assignment weight of the i-

th descriptor d

to the k-th Gaussian G(p

) that rep-

resents the Gaussian weight. For each GMM com-

ponent, the sum-pooling operation aggregates the M

descriptors in the image, into a single encoded feature

vector, given by the concatenation of u

and v

for all

K components:

FV = [u

.. .u

.. .v

] (5)

Finally, we apply power normalization to each FV

component before normalizing them jointly. Such

normalization demonstrates a good performance in

previous works (Sapienza et al., 2014). We note that

the proposed GFV encoding is applied separately to

the three proposed low-level descriptors: CN, CHS

and 15-d descriptor.

2.5 Proposed Deep Hand-crafted

Feature

The deep hand-crafted feature is obtained by combin-

ing the shallow GFV with its layered extension. The

process is described hereinafter:

2.5.1 Proposed Layered GFV (LGFV)

Representation

The proposed shallow GFV robustly encodes the lo-

cal features of a pedestrian, aggregating them by

performing a sum-pooling operation over the en-

tire pedestrian image. The obtained representation

achieves the encoding directly from the ﬂat local fea-

ture space, without considering the complex spatial

structure that can be present in a pedestrian image. In

order to incorporate the spatial information in GFV

encoding, we propose to design a LGFV represen-

tation that describes the pedestrian with higher level

structures extracted from the spatial pedestrian neigh-

borhood. This idea is an adaptation of the layered FV

encoding presented in the context of image and action

recognition (Sekma et al., 2015a; Peng et al., 2014;

Simonyan et al., 2013) to the person re-ID case. The

layered encoding is performed over two layers (see

Figure 2 ).

First Layer: In the ﬁrst layer, we perform local FV

encoding within each sub-window. Toward this end,

we ﬁrst perform the FV encoding (M = 1 in Eqs.

3, 4 without accounting the weighting) on each im-

age local feature using the same GMM codebook pre-

learned in GFV. This produces for each local feature

a tiny high dimensional FV. After the generation of

these tiny FVs, we aggregate them by applying sum-

pooling within every sub-window. This is compara-

ble to the traditional FV. The difference is that the

encoding is achieved in the sub-window rather than

the whole image. The sub-windows are obtained by

scanning in a dense way the image using a stride of

4 pixels (the size of a patch). Each sub-window cor-

responds to the spatial neighborhood constituted by

3 ×3 patches. The obtained local FVs related to the

sub-windows are subsequently power-l2 normalized.

As a result, instead of a unique FV, representing the

whole image, the latter is described by a set of dense

local FVs, each of which reﬂects the local spatial

information among spatially adjacent local features.

Therefore, the generated representation can reﬂect a

rich image structural information.

Second Layer: Since the local FVs are too high-

dimensional to be straightly used as the inputs of the

second encoding layer, we adopt XQDA to reduce

the local FVs in a discriminative supervised manner.

As the only provided annotation is the identity label,

we exploit this information by (1) averaging the lo-

cal FVs over the whole image, (2) applying XQDA

to the resulting intermediate vectors, and (3) project-

ing each local FV on the learned XQDA projection

matrix. This gives a good compromise between ac-

curacy and efﬁciency, and de-correlates the local FVs

in order to make them suitable to their further FV en-

coding (Sekma et al., 2015a). After learning a GMM

on the set of the reduced local FVs, we perform GFV

encoding upon them. The weight used in GFV for

each reduced local FV is given by the weight of its

corresponding sub-window. The latter is computed

by averaging the weights of each individual patch in

the sub-window. The output vector is subsequently

power-l2 normalized to form the ﬁnal LGFV of the

image.

2.5.2 Final Deep Representation

The tiny FVs obtained in the ﬁrst layer can be further

exploited to form a global pedestrian FV by perform-

ing a weighted sum-pooling over the entire pedestrian

image. This is then equivalent to the GFV represen-

tation. Therefore, both global GFV and LGFV his-

tograms are combined to constitute the outputs of the

Supervised Person Re-ID based on Deep Hand-crafted and CNN Features

Figure 2: Comparison between the proposed LGFV and GFV methods. Top: The pipeline of the proposed LGFV with two

layers. Bottom: pipeline of GFV. The image representation in LGFV is constructed based on sub-windows of size of 3 ×3

patches.

proposed hand-crafted deep representation.

2.6 Histogram Computation based on

Pedestrian Image Partition

In order to take beneﬁt of the spatial alignment infor-

mation among the different body parts in the persons

images, appearance modeling typically exploits the

spatial pyramidal model (Zheng et al., 2015; Sekma

et al., 2014) to treat the appearance of different body

parts independently. Inspired by these works, we pro-

pose to sub-divide the pedestrian into a set of stripes.

Since, the spatial information of the horizontal y-axis

exhibits greater intra-class variance than the verti-

cal x-axis due to viewpoint and pose variations, we

choose to divide the silhouette according to the y-axis.

Indeed, the image is split into N

= 8 stripes as it has

shown a good compromise between accuracy and ef-

ﬁciency in (Zheng et al., 2015). The proposed GFV

method is applied separately in every single stripe

rather than the whole pedestrian image. Afterwards,

histograms corresponding to each stripe are l2 nor-

malized separately prior to stacking. Since we use,

in this paper, three low-level descriptors (CN, CHS

and 15-d) for GFV and LGFV, we obtain six global

histograms. Finally, every global histogram is further

l2 normalized to ensure the linear separability of the

data.

2.7 Deep CNN Feature: IDE

In this paper, the IDE feature introduced in (Zhong

et al., 2017) is employed since it has been shown to

outperform many other deep CNN models. Specif-

ically, we use CaffeNet (Krizhevsky et al., 2012) to

train the CNN in a classiﬁcation mode. In the train-

ing phase, images are resized to 227×227 pixels, and

they are passed to the CNN model, along with their

respective identities. The CaffeNet network contains

ﬁve convolutional layers with the same original ar-

chitecture, and two globally connected layers each

with 1,024 neurons. The number of neurons in the

ﬁnal fully connected layer is deﬁned by the number

of training identities in each dataset. In the testing

phase, 1024 dimensional CNN features are extracted,

for each pedestrian image, throughout the 7-th layer

of CaffeNet. The CNN features are then subsequently

l2 normalized.

2.8 Dissimilarity Computation

After the generation of the six global histograms ba-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

ed on GFV and LGFV, as well as the IDE feature,

we separately learn on each of them an XQDA (Liao

et al., 2015) distance in a supervised way. XQDA

learns a reduced subspace from the original training

data, and at the same time learns a distance function in

the resulting subspace for the dissimilarity measure.

Once the distances are learned, they are summed-up

to derive the ﬁnal dissimilarity function. Given a

probe, dissimilarity scores are assigned to all gallery

items. The gallery set is then ranked according to the

dissimilarity to the probe. It is worth mentioning that,

as performed in (Liao et al., 2015), we select as sub-

space components the eigenvectors corresponding to

the eigenvalues of S

−1

that are larger than 1, where

and S

refer to the within and the between scatter

matrices, respectively.

2.9 Multiple Queries

When each identity has multiple queries (MultiQ) in

a single camera, we could merge them into a single

query to reformulate the MultiQ problem to one query

(OneQ) one. In this way, the intra-class variation is

taken into account, and the method will be more ro-

bust to the pedestrian variations over the gallery im-

ages. We apply an average pooling on the GFV and

LGFV related histograms over the multiple queries.

As for the IDE feature, we use max pooling since

the latter has shown better re-ID results than average

pooling in (Zheng et al., 2016). The resulting pooled

vectors are then used to perform the matching process

with the probe set.

3 EXPERIMENTS

3.1 Datasets

CUHK03 (Li et al., 2014). contains 13,164 De-

formable Part Model (DPM) (Arandjelovic and Zis-

serman, 2012) bounding boxes, of 1, 467 different

identities of the training set. Each single identity

is observed through two different cameras and there

are on average 4.8 images, for each view and each

identity. We follow the experimentation protocol pro-

ceeded in (Zheng et al., 2015). Indeed, we select 100

persons randomly and for each person, all the DPM

bounding boxes are taken as queries in turns. Af-

ter that, a cross camera search is performed. This

test process is repeated 20 times and statistics are re-

ported next. Note that the dataset comes with manual

(Labelled) and algorithmically (Detected) pedestrian

bounding boxes.

Market-1501 (Zheng et al., 2015). contains

32,643 fully annotated boxes of 1501 pedestrians,

making it one of the largest person re-ID image

datasets. This dataset is captured with 6 different

cameras placed in front of a supermarket, and con-

tains 32,643 bounding boxes of 1501 different iden-

tities. Actually, each single identity is captured by at

most 6 cameras and at least 2. Each identity may have

multiple images under each camera, and even if im-

ages of same identity are captured by the same cam-

era, they are totally distinct and different. Market-

1501 is randomly divided into training and testing

sets, containing respectively 750 and 751 identities.

In the testing phase, for each single identity, there is

one query image selected in each camera. The search

is processed in a cross-camera mode, i.e. we dis-

card from the re-ID process images that belong to the

same camera as the query. Note that there are 3, 368

queries in the gallery, 19, 732 images used for testing

and 12, 936 images for training. We use the provided

ﬁxed training and test set, under both OneQ and Mul-

tiQ evaluation settings.

3.2 Experimental Settings

In this paper, we use a codebook of 256 GMM com-

ponents for GFV and LGFV since it yields a good

compromise between accuracy and efﬁciency. Unless

otherwise stated, all results generated by our proposed

method are given for the supervision case obtained by

XQDA and the one query setting. We also note that

the Cosine distance is used for the unsupervised case.

3.3 Evaluation Metrics

In this paper, we use the Cumulative Matching Char-

acteristics (CMC) curve in the but of evaluating per-

formances of the proposed methods and comparing

with the re-ID state-of-art ones, on all datasets. Ev-

ery probe image is matched with every gallery one,

and ranks of the correct matches are obtained. In-

deed, the rank-k recognition rate is the expectation of

a correct match at the rank-k, and the cumulative val-

ues of recognition rates at all ranks, are recorded as

one-trial CMC result. We also use the mean average

precision (mAP) to evaluate the performances of the

proposed methods. In fact, for each query, we calcu-

late the area under the Precision-Recall curve, called

average precision (AP). After that, the mean value of

the APs (mAP) of all queries is calculated by taking

into account both precision and recall, and thus pro-

vides a more comprehensive evaluation.

Supervised Person Re-ID based on Deep Hand-crafted and CNN Features

Table 1: Impact of the GFV, LGFV and IDE combination. Results (rank-1 matching rate and on mAP) are reported on

CUHK03 and Market-1501 datasets for the different encoding methods.

Methods CUHK03 Market-1501

r=1 mAP r=1 mAP

FV 36.43 37.86 52.12 23.38

GFV 43.61 45.08 58.85 31.88

LGFV 51.15 55.71 65.46 40.16

GFV + LGFV 58.97 64.75 71.92 46.45

IDE(C) 58.91 64.92 57.72 35.95

GFV + LGFV + IDE(C) 63.60 69.51 75.18 53.88

Table 2: Comparison of the proposed unsupervised (GFV U+LGFV U) method with the state-of-the-art methods in the case

of unsupervised (ﬁrst table part), and the proposed supervised GFV, (GFV+LGFV) and (GFV+LGFV+IDE(C)) methods with

the supervised methods (second table part), on the Market-1501 dataset. Note that (+MultiQ) and (+re) refer to the Multi

Query setting and the re-ranking method, respectively. ’-’ means that corresponding results are not available.

Methods OneQ MultiQ

r=1 mAP r=1 mAP

SDALF (Farenzena et al., 2010) 20.53 8.20 - -

eSDC (Zhao et al., 2013) 33.54 13.54 - -

BOW (Zheng et al., 2015) 34.40 14.10 42.14 19.20

Ours(GFV U+LGFV U+IDE(C) U) 65.11 40.08 71.86 46.19

LOMO (Liao et al., 2015) 26.07 7.75 - -

PersonNet (Wu et al., 2016a) 37.21 18.57 - -

KISSME(BOW) (Zheng et al., 2015) 44.42 20.76 - -

KISSME(LOMO) (Liao et al., 2015) 40.50 19.02 - -

XQDA(LOMO) (Liao et al., 2015) 43.79 22.22 54.13 28.41

kLDFA(LOMO) (Liao et al., 2015) 51.37 24.43 52.67 27.36

FVdeepLDA (Wu et al., 2017) 48.15 29.94 - -

SCSP (Chen et al., 2016) 51.90 26.35 - -

NS(LOMO) (Zhang et al., 2016) 55.43 29.87 67.96 41.89

NS(fusion) (Zhang et al., 2016) 61.02 35.68 71.56 46.03

NS-CNN (Li et al., 2016) 59.56 34.44 69.95 44.82

XQDA(IDE (C)) (Zhong et al., 2017) 57.72 35.95 - -

XQDA(IDE(C))+re (Zhong et al., 2017) 61.25 46.79 - -

SCNN (Varior et al., 2016) 65.88 39.55 76.04 48.45

Ours(GFV) 58.85 29.78 66.49 41.63

Ours(LGFV) 65.46 40.16 72.78 49.12

Ours(GFV+LGFV) 71.92 46.45 78.90 56.26

Ours(GFV+LGFV+IDE(C)) 75.18 53.88 82.72 60.46

Ours(GFV+LGFV+IDE(C)) + re 77.42 63.54 84.92 70.52

3.4 Empirical Analysis of the Proposed

Method

3.4.1 Impact of the Gaussian Weight

As can be shown by Table 1, weighting the tradi-

tional FV via the Gaussian weight (GFV) consider-

ably increase the matching rates in all datasets. In-

deed, when applying the Gaussian template, we elim-

inate the background noise located around the pedes-

trian and its harmful effects on the re-ID accuracy.

3.4.2 Impact of the GFV, LGFV and IDE

Combination

We propose in this paper a rich multi-level represen-

tation, issued of the combination of the deep hand-

crafted (GFV + LGFV) and the deep CNN (IDE) fea-

tures. We start by studying the impact of the GFV and

LGFV combination (GFV+LGFV). Results in Table 1

show good improvements in accuracy in both datasets

when combining the shallow GFV with the layered

LGFV. This achievement proves the complementary

of the shallow description (GFV), and the layered one

(LGFV) that integrates richer structural information.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

Table 3: Comparison of the proposed unsupervised methods (GFV U + LGFV U) with the state-of-the-art methods in the case

of unsupervised (ﬁrst table part), and the proposed supervised methods GFV, (GFV + LGFV) and (GFV + LGFV + IDE(C))

with the supervised methods (second table part), on CUHK03 dataset, on the labelled and detected cases.

Methods CUHK03 (detected) CUHK03 (manual)

r=1 r=5 r=10 r=20 r=1 r=5 r=10 r=20

SDALF(Farenzena et al., 2010) 4.87 - - - 5.60 23.45 36.09 51.96

eSDC (Zhao et al., 2013) 7.68 - - - 8.76 24.07 38.28 53.44

BOW (Zheng et al., 2015) 22.95 - - - 24.33 58.42 71.28 84.91

Ours (GFV U+LGFV U) 48.58 79.51 88.92 94.73 54.43 84.65 93.76 96.82

ITML (Davis et al., 2007) 5.14 - - - 5.53 18.89 39.96 44.20

LMNN (Sun and Chen, 2011) - - - - 7.29 21.00 32.06 48.94

KISSME (K

ostinger et al., 2012) 11.70 - - - 14.17 41.12 54.89 70.09

DeepReid (Li et al., 2014) 19.89 50.00 64.00 78.50 20.65 51.50 66.50 80.00

Improved Deep (Ahmed et al., 2015) 44.96 76.01 83.47 93.15 54.74 86.50 93.88 98.10

XQDA(LOMO) (Liao et al., 2015) 46.25 78.90 88.55 94.25 52.20 82.23 92.14 96.25

NS(LOMO) (Zhang et al., 2016) 53.70 83.05 93.00 94.80 58.90 85.60 92.45 96.30

NS(fusion) (Zhang et al., 2016) 54.70 84.75 94.80 95.20 62.55 90.05 94.80 98.10

Metric Ens.(Paisitkriangkrai et al., 2015) - - - - 62.10 87.81 92.30 97.20

FVdeepLDA (Wu et al., 2017) - - - - 62.23 89.95 92.73 97.55

PersonNet (Wu et al., 2016a) - - - - 64.80 89.40 94.92 98.20

IDE(C)+XQDA (Zhong et al., 2017) 58.90 - - - 61.70 - - -

IDE(C)+XQDA+re (Zhong et al., 2017) 58.50 - - - 61.60 - - -

Ours (GFV+LGFV) 58.97 88.08 94.26 98.84 62.38 91.82 96.95 99.88

Ours (GFV+LGFV+IDE(C)) 63.60 91.59 96.71 99.73 67.77 95.05 99.78 100

Ours (GFV+LGFV+IDE(C))+re 65.88 94.78 99.61 100 69.97 98.05 100 100

Besides, while comparing the deep hand-crafted rep-

resentation (GFV + LGFV) with the high-level deep

CNN descriptors IDE(C) we notably notice that the

propsed deep hand-crafted features achieve better re-

sults than the deep CNN ones. Finally, when combin-

ing the deep hand-crafted and the CNN features, the

accuracy increasingly rises proving thus the comple-

mentarity of these latter features. Indeed, the combi-

nation provide two different views of the re-ID pro-

cess since they exploit the structural arrangements

among the hand-crafted features and the raw pixels,

respectively.

3.4.3 Comparison with the State-of-the-art

Methods on Market-1501 Dataset

First, we ﬁnd out in Table 2 that the proposed unsuper-

vised GFV U+ LGFV U method outperforms all the

recent state-of-the-art unsupervised methods (Faren-

zena et al., 2010; Zheng et al., 2015; Zhao et al., 2013)

in the literature, thus proving thus the forcefulness of

the proposed multi-level representation. Moreover,

the proposed supervised GFV method gives a better

result than (Wu et al., 2016a; Liao et al., 2015; Zheng

et al., 2015; Chen et al., 2016; Zhong et al., 2017;

Zhang et al., 2016) in the Market-1501 dataset. For

example, GFV obviously outperforms the deep Per-

sonNet method (Wu et al., 2016a) ( r1=58.85% and

mAP=9.78% versus r1=37.21% and mAP=18.57%

). This can be due to the veriﬁcation model em-

ployed in (Wu et al., 2016a) that does not take into

consideration the similarity context i.e. ignores the

intra-similarity and inter-similarity among the differ-

ent identities. GFV also outperforms the FVdeepLDA

(Wu et al., 2017) which passes on the FVs as in-

puts to a deep LDA metric learning and SCSP (Chen

et al., 2016) which imposes spatial constraints on the

pedestrian image. This proves the importance of the

localization of the pedestrian position in the image.

Otherwise, the proposed deep hand-crafted represen-

tation (GFV+LGFV) achieves a rank-1= 71.92% and

mAP=46.45%. It considerably surpass the shallow

representation methods NS(LOMO) and NS(fusion)

(Zhang et al., 2016). Besides, (GFV+LGFV) gives

better rank-1 matching rates than the method pro-

posed in (Zhong et al., 2017) that uses the IDE deep

CNN feature, the XQDA and the re-ranking tech-

nique, but the latter slightly achieves a higher mAP

(46.79%). Indeed, the re-ranking scheme has also

brought a considerable improvement to the mAP ob-

tained by the proposed method. Moreover, (GFV +

LGFV) notably surpasses NS-CNN (Li et al., 2016)

which consists in the fusion of deep CNN features

and low-level features (LBP, HOG, CN and LOMO)

and achieves comparable results with the SCNN (Var-

ior et al., 2016) which uses deep CNN features.

Supervised Person Re-ID based on Deep Hand-crafted and CNN Features

This proves that deep hand-crafted features can be a

good alternative to the deep learning without need-

ing neither pre-training neither data augmentation, at

a lower training computational cost. When combin-

ing to the hand-crafted (GFV+ LGFV) with the deep

CNN features (IDE(C)), we considerably outperform

all state-of-the-art methods on Market-1501 dataset

(rank-1=75.18% and mAP= 53.88%). This proves the

complementarity of these two representations.

3.4.4 Comparison with the State-of-the-art

Methods on CUHK03 Dataset

As shown in Table 3, the proposed unsupervised

GFV U + LGFV U is compared with the unsu-

pervised state-of-the-art methods (Farenzena et al.,

2010; Zhao et al., 2013; Zheng et al., 2015) on

CUHK03. Also, the proposed supervised (GFV +

LGFV) method achieves better results than the super-

vised state-of-the-art methods (Sun and Chen, 2011;

Davis et al., 2007; K

ostinger et al., 2012; Li et al.,

2014; Ahmed et al., 2015; Liao et al., 2015; Zhang

et al., 2016). This good performance is due to

the effectiveness of the proposed deep hand-crafted

features. We remark that the proposed (GFV +

LGFV) achieves competitive results with NS(fusion)

(Zhang et al., 2016), Metric Ensembles (Paisitkri-

angkrai et al., 2015), FVdeepLDA (Wu et al., 2017)

and PersonNet (Wu et al., 2016a). This can be ex-

plained by the small number of training data provided

in this dataset, which can have a bad effect on the

GMM learning process. In revenge, the proposed

multi-level representation (GFV + LGFV + IDE(C))

achieves higher re-ID accuracy (in all ranks for the

Labelled and Detected dataset, respectively) with re-

spect to all the state-of-art methods, and speciﬁcally

outperforms the re-ranking (IDE(C) + XQDA + re)

method (Zhong et al., 2017). This proves the com-

plementarity power of both the deep hand-crafted and

deep CNN features.

4 CONCLUSION

In this paper, we proposed a new deep hand-crafted

feature that exploit the richness of the spatial struc-

tural information The proposed deep hand-crafted

feature is competitive with the deep CNN features

without needing neither pre-training nor data aug-

mentation. Moreover, we designed a multi-level

feature representation built upon the complementary

deep hand-crafted and deep CNN features. The ex-

perimental results in the two benchmarking image

datasets (Market-1501 and CUHK03), have shown

the good performances of the proposed methods. In

feature research, we will investigate more sophisti-

cated deep CNN architectures.

REFERENCES

Ahmed, E., Jones, M. J., and Marks, T. K. (2015). An

improved deep learning architecture for person re-

identiﬁcation. In IEEE Conference on Computer Vi-

sion and Pattern Recognition, CVPR 2015, Boston,

MA, USA, June 7-12, 2015, pages 3908–3916.

Arandjelovic, R. and Zisserman, A. (2012). Multiple

queries for large scale speciﬁc object retrieval. In

British Machine Vision Conference, BMVC 2012, Sur-

rey, UK, September 3-7, 2012, pages 1–11.

Bouchrika, T., Zaied, M., Jemai, O., and C. Ben Amar

(2014). Neural solutions to interact with computers

by hand gesture recognition. Multimedia Tools Appl.,

72(3):2949–2975.

Boughrara, H., Chtourou, M., and Ben Amar, C. (2017).

MLP neural network using constructive training algo-

rithm: application to face recognition and facial ex-

pression recognition. IJISTA, 16(1):53–79.

Chen, D., Yuan, Z., Chen, B., and Zheng, N. (2016). Sim-

ilarity learning with spatial constraints for person re-

identiﬁcation. In 2016 IEEE Conference on Computer

Vision and Pattern Recognition, CVPR 2016, Las Ve-

gas, NV, USA, June 27-30, 2016, pages 1268–1277.

Dammak, M., Mejdoub, M., and Amar, C. B. (2014a). Ex-

tended laplacian sparse coding for image categoriza-

tion. In Neural Information Processing - 21st Interna-

tional Conference, ICONIP 2014, Kuching, Malaysia,

November 3-6, 2014. Proceedings, Part III, pages

292–299.

Dammak, M., Mejdoub, M., and Amar, C. B. (2015). His-

togram of dense subgraphs for image representation.

IET Image Processing, 9(3):184–191.

Dammak, M., Mejdoub, M., and Ben Amar, C. (2014b).

Laplacian tensor sparse coding for image categoriza-

tion. In IEEE International Conference on Acoustics,

Speech and Signal Processing, ICASSP 2014, Flo-

rence, Italy, May 4-9, 2014, pages 3572–3576.

Dammak, M., Mejdoub, M., and Ben Amar, C. (2014c).

A survey of extended methods to the bag of visual

words for image categorization and retrieval. In VIS-

APP 2014 - Proceedings of the 9th International Con-

ference on Computer Vision Theory and Applications,

Volume 2, Lisbon, Portugal, 5-8 January, 2014, pages

676–683.

Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S.

(2007). Information-theoretic metric learning. In Ma-

chine Learning, Proceedings of the Twenty-Fourth In-

ternational Conference (ICML 2007), Corvallis, Ore-

gon, USA, June 20-24, 2007, pages 209–216.

Farenzena, M., Bazzani, L., Perina, A., Murino, V., and

Cristani, M. (2010). Person re-identiﬁcation by

symmetry-driven accumulation of local features. In

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

The Twenty-Third IEEE Conference on Computer Vi-

sion and Pattern Recognition, CVPR 2010, San Fran-

cisco, CA, USA, 13-18 June 2010, pages 2360–2367.

Guo, Y., Wu, L., Lu, H., Feng, Z., and Xue, X. (2006).

Null foley-sammon transform. Pattern Recognition,

39(11):2248–2251.

Jobson, D. J., Rahman, Z., and Woodell, G. A. (1997). A

multiscale retinex for bridging the gap between color

images and the human observation of scenes. IEEE

Trans. Image Processing, 6(7):965–976.

ostinger, M., Hirzer, M., Wohlhart, P., Roth, P. M., and

Bischof, H. (2012). Large scale metric learning from

equivalence constraints. In 2012 IEEE Conference

on Computer Vision and Pattern Recognition, Provi-

dence, RI, USA, June 16-21, 2012, pages 2288–2295.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Proceedings of the 25th International

Conference on Neural Information Processing Sys-

tems, NIPS’12, pages 1097–1105, USA. Curran As-

sociates Inc.

Ksibi, A., Dammak, M., Ammar, A. B., Mejdoub, M., and

Amar, C. B. (2012). Flickr-based semantic context

to reﬁne automatic photo annotation. In 3rd Interna-

tional Conference on Image Processing Theory Tools

and Applications, IPTA 2012, 15-18 October 2012, Is-

tanbul, Turkey, pages 377–382.

Ksibi, S., Mejdoub, M., and Ben Amar, C. (2016a).

Extended ﬁsher vector encoding for person re-

identiﬁcation. In 2016 IEEE International Confer-

ence on Systems, Man, and Cybernetics, SMC 2016,

Budapest, Hungary, October 9-12, 2016, pages 4344–

4349.

Ksibi, S., Mejdoub, M., and Ben Amar, C. (2016b).

Person re-identiﬁcation based on combined gaussian

weighted ﬁsher vectors. In 13th IEEE/ACS Interna-

tional Conference of Computer Systems and Applica-

tions, AICCSA 2016, Agadir, Morocco, November 29

- December 2, 2016, pages 1–8.

Ksibi, S., Mejdoub, M., and Ben Amar, C. (2016c).

Topological weighted ﬁsher vectors for person re-

identiﬁcation. In 23rd International Conference on

Pattern Recognition, ICPR 2016, Canc

un, Mexico,

December 4-8, 2016, pages 3097–3102.

Kuo, C.-H., Khamis, S., and Shet, V. D. (2013). Person re-

identiﬁcation using semantic color names and rank-

boost. In IEEE Workshop on Applications of Com-

puter Vision, pages 281–287.

Li, J., Yang, Z., and Xiong, H. (2015). Encoding the

regional features for person re-identiﬁcation using

locality-constrained linear coding. In 2015 Interna-

tional Conference on Computers, Communications,

and Systems (ICCCS), pages 178–181.

Li, S., Liu, X., Liu, W., Ma, H., and Zhang, H. (2016).

A discriminative null space based deep learning ap-

proach for person re-identiﬁcation. In 4th Interna-

tional Conference on Cloud Computing and Intelli-

gence Systems, CCIS 2016, Beijing, China, August 17-

19, 2016, pages 480–484.

Li, W., Zhao, R., Xiao, T., and Wang, X. (2014). Deep-

reid: Deep ﬁlter pairing neural network for person re-

identiﬁcation. In 2014 IEEE Conference on Computer

Vision and Pattern Recognition, CVPR 2014, Colum-

bus, OH, USA, June 23-28, 2014, pages 152–159.

Li, Z., Chang, S., Liang, F., Huang, T. S., Cao, L., and

Smith, J. R. (2013). Learning locally-adaptive deci-

sion functions for person veriﬁcation. In 2013 IEEE

Conference on Computer Vision and Pattern Recog-

nition, Portland, OR, USA, June 23-28, 2013, pages

3610–3617.

Liao, S., Hu, Y., Zhu, X., and Li, S. Z. (2015). Person

re-identiﬁcation by local maximal occurrence repre-

sentation and metric learning. In IEEE Conference

on Computer Vision and Pattern Recognition, CVPR

2015, Boston, MA, USA, June 7-12, 2015, pages

2197–2206.

Liu, K., Ma, B., Zhang, W., and Huang, R. (2015). A

spatio-temporal appearance representation for video-

based pedestrian re-identiﬁcation. In 2015 IEEE Inter-

national Conference on Computer Vision, ICCV 2015,

Santiago, Chile, December 7-13, 2015, pages 3810–

3818.

Ma, B., Su, Y., and Jurie, F. (2012). Local descriptors en-

coded by ﬁsher vectors for person re-identiﬁcation. In

ECCV Workshops, volume 7583, pages 413–422.

Mahmoud Mejdoub, Salma Ksibi, C. r. B. A. and Koubaa,

M. (2017). Person re-id while crossing different

cameras: Combination of salient-gaussian weighted

bossanova and ﬁsher vector encodings. International

Journal of Advanced Computer Science and Applica-

tions(IJACSA), 8(9):399–410.

Mejdoub, M., Dammak, M., and Ben Amar, C. (2015a).

Extending laplacian sparse coding by the incorpora-

tion of the image spatial context. Neurocomputing,

166:44–52.

Mejdoub, M., Fonteles, L. H., Ben Amar, C., and Antonini,

M. (2008). Fast indexing method for image retrieval

using tree-structured lattices. In International Work-

shop on Content-Based Multimedia Indexing, CBMI

2008, London, UK, June 18-20, 2008, pages 365–372.

Mejdoub, M., Fonteles, L. H., Ben Amar, C., and Antonini,

M. (2009). Embedded lattices tree: An efﬁcient in-

dexing scheme for content based retrieval on image

databases. J. Visual Communication and Image Rep-

resentation, 20(2):145–156.

Mejdoub, M. and Ben Amar, C. (2013). Classiﬁcation im-

provement of local feature vectors over the KNN al-

gorithm. Multimedia Tools Appl., 64(1):197–218.

Mejdoub, M., Ben Aoun, N., and Ben Amar, C. (2015b).

Bag of frequent subgraphs approach for image classi-

ﬁcation. Intell. Data Anal., 19(1):75–88.

Messelodi, S. and Modena, C. M. (2015). Boosting

ﬁsher vector based scoring functions for person re-

identiﬁcation. Image Vision Comput., 44:44–58.

Othmani, M., Bellil, W., Ben Amar, C., and Alimi, A. M.

(2010). A new structure and training procedure for

multi-mother wavelet networks. IJWMIP, 8(1):149–

175.

Paisitkriangkrai, S., Shen, C., and van den Hengel, A.

(2015). Learning to rank in person re-identiﬁcation

with metric ensembles. CoRR, abs/1503.01543.

Supervised Person Re-ID based on Deep Hand-crafted and CNN Features

Peng, X., Zou, C., Qiao, Y., and Peng, Q. (2014). Action

recognition with stacked ﬁsher vectors. In European

Conference on Computer Vision (ECCV), pages 581–

595.

Ben Aoun, N., Mejdoub, M., and Ben Amar, C. (2014).

Graph-based approach for human action recognition

using spatio-temporal features. J. Visual Communica-

tion and Image Representation, 25(2):329–338.

M. El Arbi, Koub

aa, M., Charfeddine, M., and Ben Amar,

C. (2011). A dynamic video watermarking algorithm

in fast motion areas in the wavelet domain. Multime-

dia Tools Appl., 55(3):579–600.

M. El Arbi, C. Ben Amar, and Nicolas, H. (2006). Video

watermarking algorithm based on neural network. In

IEEE International Conference on Multimedia and

Expo (ICME 2006), Toronto Ontario, Canada, July 9-

12, 2006, pages 1577–1580.

Sapienza, M., Cuzzolin, F., and Torr, P. H. S. (2014).

Learning discriminative space-time action parts from

weakly labelled videos. International Journal of Com-

puter Vision, 110(1):30–47.

Sekma, M., Mejdoub, M., and Amar, C. B. (2014). Spatio-

temporal pyramidal accordion representation for hu-

man action recognition. In IEEE International Con-

ference on Acoustics, Speech and Signal Processing,

ICASSP 2014, Florence, Italy, May 4-9, 2014, pages

1270–1274.

Sekma, M., Mejdoub, M., and Ben Amar, C. (2015a). Hu-

man action recognition based on multi-layer ﬁsher

vector encoding method. Pattern Recognition Letters,

65:37–43.

Sekma, M., Mejdoub, M., and Ben Amar, C. (2015b).

Structured ﬁsher vector encoding method for human

action recognition. In 15th International Conference

on Intelligent Systems Design and Applications, ISDA

2015, Marrakech, Morocco, December 14-16, 2015,

pages 642–647.

Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep

ﬁsher networks for large-scale image classiﬁcation. In

Advances in Neural Information Processing Systems

(NIPS), pages 163–171.

Sun, S. and Chen, Q. (2011). Hierarchical distance metric

learning for large margin nearest neighbor classiﬁca-

tion. IJPRAI, 25(7):1073–1087.

Varior, R. R., Haloi, M., and Wang, G. (2016). Gated

siamese convolutional neural network architecture for

human re-identiﬁcation. In Computer Vision - ECCV

2016 - 14th European Conference, Amsterdam, The

Netherlands, October 11-14, 2016, Proceedings, Part

VIII, pages 791–808.

Wali, A., N. Ben Aoun, Karray, H., Ben Amar, C., and Al-

imi, A. M. (2010). A new system for event detec-

tion from video surveillance sequences. In Advanced

Concepts for Intelligent Vision Systems - 12th Inter-

national Conference, ACIVS 2010, Sydney, Australia,

December 13-16, 2010, Proceedings, Part II, pages

110–120.

Wu, L., Shen, C., and van den Hengel, A. (2016a). Person-

net: Person re-identiﬁcation with deep convolutional

neural networks. CoRR, abs/1601.07255.

Wu, L., Shen, C., and van den Hengel, A. (2017). Deep

linear discriminant analysis on ﬁsher networks: A hy-

brid architecture for person re-identiﬁcation. Pattern

Recognition, 65:238–250.

Wu, S., Chen, Y., Li, X., Wu, A., You, J., and Zheng, W.

(2016b). An enhanced deep feature representation for

person re-identiﬁcation. In 2016 IEEE Winter Con-

ference on Applications of Computer Vision, WACV

2016, Lake Placid, NY, USA, March 7-10, 2016, pages

1–8.

Xiong, F., Gou, M., Camps, O. I., and Sznaier, M.

(2014). Person re-identiﬁcation using kernel-based

metric learning methods. In Computer Vision - ECCV

2014 - 13th European Conference, Zurich, Switzer-

land, September 6-12, 2014, Proceedings, Part VII,

pages 1–16.

Zhang, L., Xiang, T., and Gong, S. (2016). Learning a dis-

criminative null space for person re-identiﬁcation. In

2016 IEEE Conference on Computer Vision and Pat-

tern Recognition, CVPR 2016, Las Vegas, NV, USA,

June 27-30, 2016, pages 1239–1248.

Zhao, R., Ouyang, W., and Wang, X. (2013). Unsupervised

salience learning for person re-identiﬁcation. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 3586–3593.

Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S.,

and Tian, Q. (2016). Mars: A video benchmark for

large-scale person re-identiﬁcation. In European Con-

ference on Computer Vision. Springer.

Zheng, L., Shen, L., Tian, L., Wang, S., Bu, J., and Tian, Q.

(2015). Person re-identiﬁcation meets image search.

In CoRR, volume abs/1502.02171, pages 2360–2367.

Zhong, Z., Zheng, L., Cao, D., and Li, S. (2017). Re-

ranking person re-identiﬁcation with k-reciprocal en-

coding. CoRR, abs/1701.08398.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications