CRN: End-to-end Convolutional Recurrent Network Structure Applied

to Vehicle Classiﬁcation

Mohamed Ilyes Lakhal

, Sergio Escalera

and Hakan Cevikalp

Queen Mary University of London, London, U.K.

University of Barcelona and Computer Vision Center UAB, Barcelona, Spain

Eskisehir Osmangazi University, Eskisehir, Turkey

Keywords:

Vehicle Classiﬁcation, Deep Learning, End-to-end Learning.

Abstract:

Vehicle type classiﬁcation is considered to be a central part of Intelligent Trafﬁc Systems. In the recent years,

deep learning methods have emerged in as being the state-of-the-art in many computer vision tasks. In this

paper, we present a novel yet simple deep learning framework for the vehicle type classiﬁcation problem. We

propose an end-to-end trainable system, that combines convolution neural network for feature extraction and

recurrent neural network as a classiﬁer. The recurrent network structure is used to handle various types of

feature inputs, and at the same time allows to produce a single or a set of class predictions. In order to assess

the effectiveness of our solution, we have conducted a set of experiments in two public datasets, obtaining

state of the art results. In addition, we also report results on the newly released MIO-TCD dataset.

1 INTRODUCTION

The vehicle classiﬁcation task is an important vision

problem, with applications to illegal vehicle type re-

cognition, trafﬁc surveillance, and autonomous na-

vigation, among others. Trafﬁc surveillance camera

systems are an essential component of an Intelligent

Trafﬁc System. They include automatic monitoring

digital cameras that record high-resolution static ima-

ges of passing vehicles and other moving objects

(Tang et al., 2017). This source of information is

highly valuable for data mining and pattern classiﬁca-

tion. Thus, a good procedure for classiﬁcation is cru-

cial to this task. Classical machine learning tools do

not provide high performance solutions in this case.

On the other hand, deep learning techniques are the

current state-of-the-art in many Computer Vision and

Machine Learning problems. Following this trend,

we address the Vehicle type classiﬁcation problem

by combining convolutional neural networks (CNNs)

structure with recurrent neural networks (RNNs). The

idea is simple and intuitive, and can be easily adapted

to other application scenarios, such as multi-label le-

arning, without any further additional structure. To

the best of our knowledge, this is the ﬁrst attempt

that provides an end-to-end deep learning solution of

CNN and RNN to the vehicle classiﬁcation task.

To summarize, the main contributions of our paper

are:

• Merging two deep learning models into a single

structure framework.

• Learning of rich high-dimensional feature vectors

in an end-to-end fashion.

• First attempt that uses such model to vehicle clas-

siﬁcation task, achieving state-of-the-art on two

datasets, and very competing results on a huge

real-world trafﬁc surveillance dataset.

The rest of the paper is structured as following:

Section 2 provides a brief overview of the back-

ground literature on the topic, Section 3 describes our

CRN model, and the experimental results are given

in Section 4. Finally, we summarize our work with a

conclusion in Section 5.

2 RELATED WORK

Most of the vision-based methods for vehicle classi-

ﬁcation fall into two categories: model based met-

hods and appearance-based methods (Chen and El-

lis, 2011). In model based methods (Gupte et al.,

2002; Hsieh et al., 2006; Lai et al., 2001; Messelodi

et al., 2005; Zhang et al., 2012), geometric measure-

ments such as length, width, and height are used to

recover the vehicle’s 3D parameters. In (Nieto et al.,

2011), a 3D vehicle modeling has been proposed for

detection and classiﬁcation by means of the integra-

tion of temporal information and model priors within

Lakhal, M., Escalera, S. and Cevikalp, H.

CRN: End-to-end Convolutional Recurrent Network Structure Applied to Vehicle Classiﬁcation.

DOI: 10.5220/0006533601370144

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

137-144

ISBN: 978-989-758-290-5

137

Figure 1: Proposed CRN model. First, the three channels input image is processed through a series of convolutions and pool-

ing. Then, after reaching the ﬁfth stage, we ﬂatten the resulting set of features into a single high-dimensional representation

φ(I) ∈ R

. Finally, we feed the feature vector to the LSTM that learns the corresponding class label.

a Markov Chain Monte Carlo (MCMC). A 3D mo-

del, along with 3DHOG (which is an extension of

HOG (Dalal and Triggs, 2005) feature by applying

3D spatial modelling), has been successfully applied

to detect and classify vehicles (Buch et al., 2009).

The appearance-based methods (Hasegawa and Ka-

nade, 2005; Ma and Grimson, 2005; Zhang et al.,

2007) rely on the extraction of appearance features

(e.g., SIFT (Lowe, 2004), Sobel edges (Sobel, 1970))

from either frontal or side views of vehicle images to

classify them (Dong et al., 2014). A PCA-based, in-

tegrated vehicle classiﬁcation framework is presented

in (Zhang et al., 2006). It consists of segmenting and

normalizing the vehicles from the input video stream,

after this step a PCA-based classiﬁer (Eigenvehicle,

and PCA-SVM) is applied on the resulting segments.

In (Morris and Trivedi, 2006), the authors present

a tracking system with the ability to classify vehicles

into three classes. A 10-feature measurement vector

is extracted and its size is reduced by either princi-

pal component analysis (PCA) or linear discriminant

analysis (LDA), followed by a weighted K-nearest

neighbors (KNN) classiﬁer. Another approach (Tang

et al., 2017) uses a more engineering solution cal-

led Local Gabor Binary Pattern Histogram Sequence

(Huang et al., 2011). In (Xiang et al., 2016), a sur-

veillance video based vehicle classiﬁcation is presen-

ted. It uses local and structural features and sparse

coding; and multi-scale spatial max pooling is app-

lied to obtain more discriminative and representative

features.

More recently, deep learning approaches have

attracted the attention of various computer vision

tasks including the vehicle classiﬁcation problem, and

many works have been proposed in this direction. In

(Zhou et al., 2016), two methods have been used; the

ﬁrst one is ﬁne tuning over the AlexNet (Krizhev-

sky et al., 2012) architecture, as for the second so-

lution the authors extracted features from the fully

connected layer of a pre-trained AlexNet on Image-

Net (Russakovsky et al., 2014), followed by an SVM

as a classiﬁer. In (He et al., 2015), the authors con-

ducted a set of experiments to compare CNN features

against other type of feature descriptors, but the ex-

periments were conducted on a small subset of Ima-

geNet dataset. Moreover, semi-unsupervised Convo-

lutional Neural Network has been proposed in (Dong

et al., 2014). The weights of the network are learned

in an unsupervised manner via sparse ﬁltering, while

the ﬁnal classiﬁer is trained in a supervised way using

the labeled dataset that was collected. The problem

with such pre-training is that it does not scale well

with large convolutional networks. In (Wang et al.,

2016) deep learning is also applied to Trafﬁc Surveil-

lance Video problem. The authors ﬁrst use CNN de-

tector to select region proposals, and then features are

obtained through a fully connected network. Finally

K-means is applied to cluster those proposals. In (Ji-

ang and Zhang, 2016), the authors proposed to use a

CNN for vehicle detection and recognition from video

stream in a weakly-supervised manner. Research on

multi-label classiﬁcation using deep learning are also

conducted, e.g., the paper (Huo et al., 2016) presents

a Region-based CNN (RCNN) solution for vehicle re-

cognition problem.

In this paper, we present an appearance-based

vehicle type classiﬁcation method. We combine CNN

and RNN into a single structure called CRN.

3 PROPOSED ARCHITECTURE

In this section, we describe the proposed CRN model.

The framework, offers a general way to approach the

vehicle classiﬁcation problem. We also provide de-

tails of a typical implementation of such model, na-

med ESOGU. We will also highlight the importance

of feature learning part of this model.

3.1 CRN Model

Most of the successful deep learning models for

object recognition are built from stacking multiple

layers of convolutional operation and other operati-

ons such as batch normalization (Ioffe and Szegedy,

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

138

Figure 2: 3D scatter-plot of the features obtained from the test set of the BITVehicle dataset: (left) Pre-trained features; (right)

ESOGU features.

2015). Moreover, some recent works on image cap-

tioning (Vinyals et al., 2017; Xu et al., 2015) have

proven the effectiveness of the use of recurrent neu-

ral network as a pipeline for handling different type

of modalities (Vinyals et al., 2017). The idea is to

use features extracted from a pre-trained deep model.

For example, in (Vinyals et al., 2017) authors have

considered the use of GoogLeNet (Szegedy et al.,

2014). While CNNs are the state-of-the-art model

for image classiﬁcation (He et al., 2016; Krizhevsky

et al., 2012; Simonyan and Zisserman, 2014; Szegedy

et al., 2014), we want to have a model that learns rich

high level features, and at the same time uses the ﬂex-

ibility of the RNN network. This key observation is

our main motivation to our proposed solution. Our

CRN framework (see Figure 1), combines the convo-

lutional neural network with recurrent structure. This

allows the RNN to act as a classiﬁer and the features

are learned in an ”end-to-end” fashion from the con-

volutional neural network. It also gives us a very ﬂex-

ible framework to work with, i.e., the recurrent struc-

ture can handle variable length of inputs and produces

variable length of output too. As an example, to tackle

multi-label image classiﬁcation (Li et al., 2014), the

RNN will learn to produce a sequence of labels wit-

hout any further structure or processing on the model.

Also, on the input side, depending on the application,

we can either consider working with a set of features

from the convolutional layer, or ﬂatten the vector.

In this study, we propose two implementations of

the CRN framework: ESOGU, and ESOGU

f c

, where

the difference is only on the number of fully con-

nected layers. In the ESOGU model, we directly ﬂat-

ten the last convolutional layer and give it as input

to the RNN. Whereas, in the ESOGU

f c

model, we

add extra fully connected layers to downsample the

dimensionality of our features. Details of the archi-

tecture of the later model are given in Table 1. The

images are re-scaled to 224 × 224 pixels, which serve

Table 1: ESOGU

f c

, implementation of the CRN model for

trafﬁc vehicle classiﬁcation. Where each ConvBlock, cor-

respond to the sequence ’Conv-Conv-ReLU-Pool’.

Module Layers Output Size

CNN

Conv [224 × 224]

Pool [112 × 112]

ConvBlock

[56 × 56]

ConvBlock

[28 × 28]

ConvBlock

[14 × 14]

ConvBlock

[7 × 7]

[25088]

[4096]

RNN

LSTM nb

classes

as input to our model. After the convolution stage,

we ﬂatten the last layer that will act as a feature des-

criptor of size 4096 for the ESOGU

f c

, and 25088 for

the ESOGU. Finally, we feed our descriptors to the

recurrent network that implements classiﬁcation. In

this framework, we choose to work with LSTM (Ho-

chreiter and Schmidhuber, 1997).

3.2 Feature Learning

It is known that the performance of a classiﬁer does

heavily depend on the choice of the feature represen-

tation (Bengio et al., 2013). In (Donahue et al., 2014),

the authors have shown that features extracted from

the activation of a convolutional network trained in a

fully supervised fashion can be in fact used as a gene-

ric descriptor. Empirical validations have been carried

out on small standard benchmark object recognition

tasks, including Caltech-101 (Fei-Fei et al., 2007).

In this study, we further investigate the use of deep

features as a generic descriptor for object classiﬁca-

CRN: End-to-end Convolutional Recurrent Network Structure Applied to Vehicle Classiﬁcation

139

tion task. Figure 2 shows 3D-PCA of features ex-

tracted from a pre-trained model on ImageNet (Deng

et al., 2009), and trained CRN model on the target set.

In the CRN model, when considering to work with

only the RNN part along with the learned features,

for small datasets the fully connected layer is a good

choice as feature extractor. However, in real world

large scale datasets, we argue that in fact the above

choice would not be appropriate due to the fact that

these images have a high range of geometric shapes

and illumination properties that could not be handled

by a small feature vector. We thus have to work with

other type of features, such as the upper level of con-

volution layers of a CNN. This idea was successfully

applied to other type of problems such like action re-

cognition (Sharma et al., 2015), where the authors use

pre-trained convolutional features and train an atten-

tion model.

As can be seen from our architecture, taking the

convolution layer without ﬂattening as feature extrac-

tor is straighforward, since we would only have to

merge the last convolution layer with the RNN di-

rectly. To support our hypothesis, we conduct two

experiments on more realistic dataset, the MIO-TCD.

In the ﬁrst one, features vectors are extracted directly

from a trained CRN model, and we train an RNN to

classify them. In the second experiment, convolutio-

nal features are obtained from the last convolutional

layer of pre-trained model on ImageNet (Deng et al.,

2009), and an attention based model is applied for

classiﬁcation. We ﬁnd that indeed, the second mo-

del performs way better than the one that uses only

one vector as feature input. These results conﬁrm our

earlier hypothesis that claims: ”For the CRN model,

training a set of features when the dataset is large,

helps for better generalization”.

3.3 Loss Function

To train our model, we use the cross-entropy loss

function deﬁned as follows:

L (X,Y ) = −

∑

i=1

log( ˆy

) + (1 − y

)log(1 − ˆy

)].

where y

is the true label of the i-th sample, ˆy

is the

predicted class probabilities by the model, n the num-

ber of train samples, X the train set, and Y its corre-

sponding labels.

4 EXPERIMENTAL RESULTS

We have conducted a set of experiments on three da-

tasets: The Road vehicle dataset (Zhou et al., 2016),

(a) (b) (c) (d)

Figure 3: Images examples from the Vehicle classiﬁcation

dataset: (a)-(b) Passenger; (c)-(d) Other.

Table 2: Results obtained on the Road vehicle dataset.

Method Acc

bal

Mean

Recall

Cohen

Kappa

Im-RNN 98.94% 98.94% 97.88%

ESOGU 97.92% 97.93% 95.76%

Fts-RNN 97.39% 97.39% 94.11%

(Zhou et al.,

2016)

97.35% − −

BIT-Vehicle (Dong et al., 2014), and the MIO-TCD

dataset. Because of the limited number of samples

on the ﬁrst two datasets (Road vehicle dataset (Zhou

et al., 2016), BIT-Vehicle (Dong et al., 2014)), we ran

all the experiments using ESOGU model that has only

one fully connected layer of dimension 25088. The

two other models are, Fts-RNN and Im-RNN. In the

Fts-RNN model, we take vectors from the fully con-

nected layer of the ESOGU model as descriptors, and

the RNN is used as classiﬁer. The other model, Im-

RNN uses RNN as a classiﬁer on the features extrac-

ted by a pre-trained model (VGG-16 in this study).

For all the three benchmarks, we train the ESOGU,

and the ESOGU

f c

from scratch, i.e., we do not have

a pre-training phase on ImageNet. The experiments

were carried out on a machine equipped with 32 GB

of RAM, and an NVIDIA GTX 1080 with 8 GB GPU.

4.1 Metrics

To assess the proposed model, we use three metrics:

Accuracy,Mean Precision , Mean Recall , and Cohen

Kappa. Deﬁnitions are given below:

Accuracy =

T P + TN

T P + TN + FP + FN

(1)

Prec

T P

T P + FP

(2)

Rec

T P

T P + FN

(3)

MeanPrecision =

∑

i=1

Prec

, (4)

MeanRecall =

∑

i=1

Rec

, (5)

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

140

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k)

Figure 4: Samples from the MIO-TCD: (a) Articulated truck; (b) Background; (c) Bicycle; (d) Bus; (e) Car; (f) Motorcycle;

(g) Non-motorized vehicle; (h) Pedestrian; (i) Pickup truck; (j) Single unit truck; (k) Work van.

where T P is true positives, FN false negatives, T N

true negatives, FP false positives, Rec

is the per-class

recall, and C is the total number of classes.

Cohen kappa (Cohen, 1960), is a statistic that me-

asures inter-annotator agreement as deﬁned in Equa-

tion 6, where p

is the empirical probability of

agreement with the label assigned to any sample, and

is the expected agreement when both annotators

assign labels randomly.

κ = (p

− p

)/(1 − p

) (6)

This function is used on a classiﬁcation problem, and

the obtained scores express the level of agreement be-

tween two annotators.

4.2 Road Vehicle Dataset

The Road vehicle dataset (Zhou et al., 2016) (see Fi-

gure 3) contains images that are taken from a sta-

tic camera along an express way. The original da-

taset contains 300 images of vehicles on multiple la-

nes, after some pre-processing, 983 images are obtai-

ned. Among these, 940 are valid images, i.e., image

that contains a whole vehicle, and 43 invalid ima-

ges, where most of them contain overlapping vehi-

cles. The dataset can be used for either vehicle de-

tection or classiﬁcation. For the classiﬁcation task,

there are two classes: passenger class and other class.

The passenger class includes: SUV, and MPV, whe-

reas the other class contains: van, truck, and other

types of vehicle. After doing some initial processing,

i.e., croping and segmentation, the classiﬁcation da-

taset contains 1,442 images for the passenger class,

and 985 for other class.

In order to evaluate our model with other state-of-

the-art methods, we employ the same metric deﬁned

in Equation 7 as suggested in (Zhou et al., 2016):

Acc

bal

Correct(pass)

Size(pass)

Correct(other)

Size(other)

(7)

where Correct(pass) represents the number of correct

predictions in passenger class, and Size(pass) is its

size.

Figure 5: ROC over the test set for the Road vehicle dataset.

(a) (b) (c)

(d) (e) (f)

Figure 6: Samples from the BITVehicle dataset: (a) Bus;

(b) Microbus; (c) Minivan; (d) Sedan; (e) SUV; (f) Truck.

The test results are shown in Table 2. The perfor-

mance of the ESOGU was slightly below the Im-RNN

model. This is explained by the fact that the used da-

taset is fairly small, and we did not consider data aug-

mentation. Figure 5 shows the ROC for the Im-RNN

and Fts-RNN models. Another remark here is that,

due to the limited train/set samples, the results are in

general close to each other.

4.3 BIT-Vehicle

The BIT-Vehicle (Dong et al., 2014) is a classiﬁcation

dataset that contains 900 vehicles images divided into

six categories: Bus, Microbus, Minivan, Sedan, SUV,

and Truck (see Figure 6). Each category contains 150

image samples of either 1600×1200, or 1920 × 1080

pixel size. The images contain changes in illumina-

tion condition, scale, the surface color of vehicles and

CRN: End-to-end Convolutional Recurrent Network Structure Applied to Vehicle Classiﬁcation

141

Table 3: Models results over the BITVehicle test set.

Method Accuracy Mean Recall Cohen Kappa

Fts-RNN 93.40% 88.01% 89.16%

Im-RNN 93.20% 88.73% 88.74%

(Dong et al., 2014) 92.89% - −

ESOGU 92.08% 86.18% 86.95%

Table 4: Classiﬁcation results on the MIO-TCD challenge.

Method Mean Recall Accuracy Cohen Kappa Mean Precision

VGG16

85.02% 96.16% 94.03% 88.02%

ESOGU

f c

84.77% 93.62% 90.24% 79.37%

ESOGU 83.86% 93.74% 90.37% 77.72%

AlexNet 75.83% 93.30% 89.57% 77.29%

Figure 7: Per-class precision for the three described models

on the BITVehicle dataset.

viewpoint. This dataset was captured at different pla-

ces and different times.

In this dataset, our three models generalize well

on the test set. In Table 3 we give the accuracy, mean

accuracy, and the Cohen Kappa score. Figure 7 shows

the per-class precision of each model. Again, the ge-

neral performance of the presented solutions are simi-

lar.

4.4 MIO-TCD Dataset

The MIO-TCD dataset

consists of total 786,702

images with 648,959 in the classiﬁcation dataset and

137,743 in the localization dataset acquired at diffe-

rent times of the day and different periods of the year

by thousands of trafﬁc cameras deployed all over Ca-

nada and the United States. The dataset is divided in

two parts, the ”classiﬁcation dataset” and the ”locali-

zation dataset”. For the classiﬁcation task, the dataset

is divided into 11 classes as shown in Figure 4.

It is worth noticing that for the classiﬁcation task, the

original set is highly unbalanced, that is, the class

samples are not equally distributed. For example,

there are 260,518 samples for the highest class (car

class), and 1, 751 examples for the under-represented

http://podoce.dinf.usherbrooke.ca/challenge/dataset/

one (non-motorized vehicle class). So, if we feed di-

rectly the dataset as it is to our classiﬁer, we would

learn only models of classes that have the largest num-

ber training samples. To overcome this, we use data

augmentation (Krizhevsky et al., 2012) only on under-

represented classes in order to have more data exam-

ples for the training phase. Also, for each epoch we

get random samples per class of ﬁxed size as shown

below:

Batch[class

] = {z

i j

∼ R(X

);1 ≤ i ≤ C;1 ≤ j ≤ m}

Here, we denote by z

i j

the randomly chosen element,

R(.) the distribution that returns an element from a

set, where each element has the same proportion to be

chosen, C the number of classes (11 in our case), and

m is a ﬁxed integer which represents the size of each

class per epoch.

Table 4 shows our results comparing to some of

the submissions. We refer the reader to the ofﬁcial

ranking for complete comparison.

Here we de-

note by VGG16

, the ﬁne tuned VGG16. Our met-

hod was able to achieve a competing results, with a

good generalization to unseen challenging and realis-

tic data.

5 CONCLUSIONS

We have introduced a simple yet robust deep architec-

ture for the vehicle classiﬁcation problem. Our solu-

tion differs from the other state-of-the art in the sense

that we propose to use the recurrent neural network as

a classiﬁer, and the feature learning part is performed

using a CNN. The whole system is trained in end-to-

end manner. We have demonstrated the use of deep

feature as a proper choice for representation. We ﬁ-

nally suggested an extension of this framework when

http://podoce.dinf.usherbrooke.ca/results/classiﬁcation

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

142

dealing with more challenging dataset and supported

it with further experiments. The extension is to use M

feature vectors of size p as input to the recurrent struc-

ture instead of one feature vector. The proposed fra-

mework can easily adapt itself to other scenarios like

multi-label image classiﬁcation without adding extra

layers to the network architecture.

ACKNOWLEDGEMENTS

This work has been partially supported by the Spa-

nish project TIN2016-74946-P (MINECO/FEDER,

UE) and CERCA Programme / Generalitat de Cata-

lunya.

REFERENCES

Bengio, Y., Courville, A., and Vincent, P. (2013). Represen-

tation learning: A review and new perspectives. IEEE

Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828.

Buch, N., Orwell, J., and Velastin, S. A. (2009). 3d extended

histogram of oriented gradients (3dhog) for classiﬁca-

tion of road users in urban scenes. In Proceedings of

the British Machine Vision Conference, pages 15.1–

15.11. BMVA Press. doi:10.5244/C.23.15.

Chen, Z. and Ellis, T. (2011). Multi-shape descriptor vehi-

cle classiﬁcation for urban trafﬁc. In 2011 Internatio-

nal Conference on Digital Image Computing: Techni-

ques and Applications, pages 456–461.

Cohen, J. (1960). A coefﬁcient of agreement for nominal

scales. Educational and Psychological Measurement,

20(1):37–46.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In Proceedings of the

2005 IEEE Computer Society Conference on Compu-

ter Vision and Pattern Recognition (CVPR’05) - Vo-

lume 1 - Volume 01, CVPR ’05, pages 886–893, Wa-

shington, DC, USA. IEEE Computer Society.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N.,

Tzeng, E., and Darrell, T. (2014). Decaf: A deep con-

volutional activation feature for generic visual recog-

nition. In International Conference in Machine Lear-

ning (ICML).

Dong, Z., Pei, M., He, Y., Liu, T., Dong, Y., and Jia, Y.

(2014). Vehicle type classiﬁcation using unsupervi-

sed convolutional neural network. In 2014 22nd In-

ternational Conference on Pattern Recognition, pages

172–177.

Fei-Fei, L., Fergus, R., and Perona, P. (2007). Lear-

ning generative visual models from few training ex-

amples: An incremental bayesian approach tested on

101 object categories. Comput. Vis. Image Underst.,

106(1):59–70.

Gupte, S., Masoud, O., Martin, R. F., and Papanikolopou-

los, N. P. (2002). Detection and classiﬁcation of vehi-

cles. Trans. Intell. Transport. Sys., 3(1):37–47.

Hasegawa, O. and Kanade, T. (2005). Type classiﬁcation,

color estimation, and speciﬁc target detection of mo-

ving targets on public streets. Machine Vision and Ap-

plications, 16(2):116–121.

He, D., Lang, C., Feng, S., Du, X., and Zhang, C. (2015).

Vehicle detection and classiﬁcation based on convolu-

tional neural network. In Proceedings of the 7th In-

ternational Conference on Internet Multimedia Com-

puting and Service, ICIMCS ’15, pages 3:1–3:5, New

York, NY, USA. ACM.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep re-

sidual learning for image recognition. In The IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR).

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Comput., 9(8):1735–1780.

Hsieh, J.-W., Yu, S.-H., Chen, Y.-S., and Hu, W.-F.

(2006). Automatic trafﬁc surveillance system for vehi-

cle tracking and classiﬁcation. IEEE Transactions on

Intelligent Transportation Systems, 7(2):175–187.

Huang, D., Shan, C., Ardabilian, M., Wang, Y., and Chen,

L. (2011). Local binary patterns and its application to

facial image analysis: A survey. IEEE Transactions on

Systems, Man, and Cybernetics, Part C (Applications

and Reviews), 41(6):765–781.

Huo, Z., Xia, Y., and Zhang, B. (2016). Vehicle type classi-

ﬁcation and attribute prediction using multi-task rcnn.

In 2016 9th International Congress on Image and Sig-

nal Processing, BioMedical Engineering and Infor-

matics (CISP-BMEI), pages 564–569.

Ioffe, S. and Szegedy, C. (2015). Batch normalization:

Accelerating deep network training by reducing in-

ternal covariate shift. In Proceedings of the 32nd In-

ternational Conference on Machine Learning, ICML

2015, Lille, France, 6-11 July 2015, pages 448–456.

Jiang, C. and Zhang, B. (2016). Weakly-supervised vehi-

cle detection and classiﬁcation by convolutional neu-

ral network. In 2016 9th International Congress on

Image and Signal Processing, BioMedical Engineer-

ing and Informatics (CISP-BMEI), pages 570–575.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C. J. C., Bottou,

L., and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems 25, pages 1097–

1105. Curran Associates, Inc.

Lai, A. H. S., Fung, G. S. K., and Yung, N. H. C.

(2001). Vehicle type classiﬁcation from visual-based

dimension estimation. In ITSC 2001. 2001 IEEE In-

telligent Transportation Systems. Proceedings (Cat.

No.01TH8585), pages 201–206.

Li, X., Zhao, F., and Guo, Y. (2014). Multi-label image

classiﬁcation with a probabilistic label enhancement

model. In Proceedings of the Thirtieth Conference on

Uncertainty in Artiﬁcial Intelligence, UAI’14, pages

430–439, Arlington, Virginia, United States. AUAI

Press.

CRN: End-to-end Convolutional Recurrent Network Structure Applied to Vehicle Classiﬁcation

143

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. Int. J. Comput. Vision, 60(2):91–

110.

Ma, X. and Grimson, W. E. L. (2005). Edge-based rich

representation for vehicle classiﬁcation. In Tenth

IEEE International Conference on Computer Vision

(ICCV’05) Volume 1, volume 2, pages 1185–1192 Vol.

Messelodi, S., Modena, M., and Zanin, M. (2005). A com-

puter vision system for the detection and classiﬁcation

of vehicles at urban road intersections. Pattern Anal.

Appl., 8(1):17–31.

Morris, B. and Trivedi, M. (2006). Improved vehicle clas-

siﬁcation in long trafﬁc video by cooperating tracker

and classiﬁer modules. In 2006 IEEE International

Conference on Video and Signal Based Surveillance,

pages 9–9.

Nieto, M., Unzueta, L., Cortes, A., Barandiaran, J., Otaegui,

O., and Sanchez, P. (2011). Real-time 3d modeling of

vehicles in low-cost mono camera systems. In Proc.

Int. Conf. on Computer Vision Theory and Applicati-

ons VISAPP, pages 459–464.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M. S., Berg, A. C., and Li, F. (2014). Image-

net large scale visual recognition challenge. CoRR,

abs/1409.0575.

Sharma, S., Kiros, R., and Salakhutdinov, R. (2015).

Action recognition using visual attention. CoRR,

abs/1511.04119.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Sobel, I. E. (1970). Camera Models and Machine Percep-

tion. PhD thesis, Stanford, CA, USA. AAI7102831.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., An-

guelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,

A. (2014). Going deeper with convolutions. CoRR,

abs/1409.4842.

Tang, Y., Zhang, C., Gu, R., Li, P., and Yang, B. (2017).

Vehicle detection and recognition for intelligent trafﬁc

surveillance system. Multimedia Tools and Applicati-

ons, 76(4):5817–5832.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D.

(2017). Show and tell: Lessons learned from the 2015

MSCOCO image captioning challenge. IEEE Trans.

Pattern Anal. Mach. Intell., 39(4):652–663.

Wang, S., Liu, F., Gan, Z., and Cui, Z. (2016). Vehicle type

classiﬁcation via adaptive feature clustering for trafﬁc

surveillance video. In 2016 8th International Confe-

rence on Wireless Communications Signal Processing

(WCSP), pages 1–5.

Xiang, Z. Q., Huang, X. L., and Zou, Y. X. (2016). An

effective and robust multi-view vehicle classiﬁcation

method based on local and structural features. In 2016

IEEE Second International Conference on Multimedia

Big Data (BigMM), pages 68–73.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhu-

dinov, R., Zemel, R., and Bengio, Y. (2015). Show,

attend and tell: Neural image caption generation with

visual attention. In Blei, D. and Bach, F., editors, Pro-

ceedings of the 32nd International Conference on Ma-

chine Learning (ICML-15), pages 2048–2057. JMLR

Workshop and Conference Proceedings.

Zhang, C., Chen, X., and bang Chen, W. (2006). A pca-

based vehicle classiﬁcation framework. In 22nd Inter-

national Conference on Data Engineering Workshops

(ICDEW’06), pages 17–17.

Zhang, L., Li, S. Z., Yuan, X., and Xiang, S. (2007). Real-

time object classiﬁcation in video surveillance based

on appearance learning. In 2007 IEEE Conference on

Computer Vision and Pattern Recognition, pages 1–8.

Zhang, Z., Tan, T., Huang, K., and Wang, Y. (2012).

Three-dimensional deformable-model-based localiza-

tion and recognition of road vehicles. IEEE Transacti-

ons on Image Processing, 21(1):1–13.

Zhou, Y., Nejati, H., Do, T. T., Cheung, N. M., and Cheah,

L. (2016). Image-based vehicle analysis using deep

neural network: A systematic study. In 2016 IEEE In-

ternational Conference on Digital Signal Processing

(DSP), pages 276–280.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

144