A Uniﬁed Framework for Coarse-to-Fine Recognition of Trafﬁc Signs

using Bayesian Network and Visual Attributes

Hamed Habibi Aghdam, Elnaz Jahani Heravi and Domenec Puig

Department of Computer Engineering and Mathematics, University Rovira i Virgili, Tarragona, Spain

Keywords:

Trafﬁc Sign Recognition, Visual Attributes, Bayesian Network, Most Probable Explanation, Sparse Coding.

Abstract:

Recently, impressive results have been reported for recognizing the trafﬁc signs. Yet, they are still far from the

real-world applications. To the best of our knowledge, all methods in the literature have focused on numerical

results rather than applicability. First, they are not able to deal with novel inputs such as the false-positive

results of the detection module. In other words, if the input of these methods is a non-trafﬁc sign image, they

will classify it into one of the trafﬁc sign classes. Second, adding a new sign to the system requires retraining

the whole system. In this paper, we propose a coarse-to-ﬁne method using visual attributes that is easily

scalable and, importantly, it is able to detect the novel inputs and transfer its knowledge to the newly observed

sample. To correct the misclassiﬁed attributes, we build a Bayesian network considering the dependency

between the attributes and ﬁnd their most probable explanation using the observations. Experimental results

on the benchmark dataset indicates that our method is able to outperform the state-of-art methods and it also

possesses three important properties of novelty detection, scalability and providing semantic information.

1 INTRODUCTION

Trafﬁc sign detection and recognition is one of the

major tasks in advanced driver assistant systems and

intelligent cars. A trafﬁc sign detection and recogni-

tion system is composed of two modules namely de-

tection and recognition. The input of the detection

module is the image of the scene and its output is the

areas of the image that include a trafﬁc sign. Then,

the recognition module analyses the images of these

areas and recognizes the type of the trafﬁc sign.

One of the important characteristics of trafﬁc signs

is their design simplicity which facilitates their detec-

tion and recognition for a human driver. First, they

have a simple geometric shape such as circle, triangle,

polygon or rectangle. Second, they are distinguish-

able from most of the objects in the scene using their

color. To be more speciﬁc, trafﬁc signs are usually

composed of some basic colors such as red, green,

blue, black, white and yellow. Finally, the meaning of

the trafﬁc sign is acquired using the pictograph in the

center. Even though the design is clear and discrim-

inative for a human, but there are challenging prob-

lems in real world applications such as shadow, cam-

era distance, weather condition, perspective and age

of the sign that need to be addressed in the trafﬁc sign

detection and recognition systems.

Moreover, there are two difﬁculties that must be

tackled by the recognition module in the real-world

applications. First, the trafﬁc sign recognition is a

multi-category classiﬁcation problem that can include

hundreds of classes. Second, assuming the fact that it

is probable to have some false-positive outputs in the

detection module, the recognition module must dis-

card these false-positive inputs. In other words, the

recognition module must deal with the novel inputs

that have not been observed during the training.

To the best of our knowledge, most of the works in

the recognition module have only focused on increas-

ing the performance of the system under more real-

istic conditions and on a limited number of classes.

Further, none of the methods in the literature have

been tried to recognize the trafﬁc signs in a coarse-to-

ﬁne fashion. Despite the impressive results obtained

by different groups in the German trafﬁc sign bench-

mark competition (Stallkamp et al., 2012), all of these

methods suffer from some common problems.

First, none of the methods in the literature are able

to deal with novel inputs. For example, given the im-

age of a non-trafﬁc sign object (e.g. false-positive re-

sults of the detection module), the state-of-art meth-

ods classify the novel input into one of the trafﬁc sing

classes. Second, they are not easily scalable. On the

one hand, adding a new class to the recognition mod-

Habibi Aghdam H., Jahani Heravi E. and Puig D..

A Uniﬁed Framework for Coarse-to-Fine Recognition of Trafﬁc Signs using Bayesian Network and Visual Attributes.

DOI: 10.5220/0005303500870096

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 87-96

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

ule might require to re-train the whole system. On

the other hand, they use the conventional classiﬁca-

tion method in which we consider that all classes are

well separated in the same feature space and, using

this assumption, a single model is trained for whole

classes. While this assumption can be true for a few

number of classes but it is probable that there will be

an overlap between classes if the number of classes

increases. Third, they do not take into account the

attributes of the trafﬁc signs.

Attributes are high level concepts which provide

some useful information about the objects. For ex-

ample, if we observe that the input image “has red

margin” and “is triangle” and its pictograph depicts

an object that “is pointing to the left” with a high

probability the input image is a “dangerous curve to

the right” trafﬁc sign. In this case, we could rec-

ognize the trafﬁc sign using three attributes. As the

second example, assume the attributes “has red rim”,

“is circle” and “contains a two-digit number” have

been observed. These attributes reveal that the in-

put image indicates a “speed limit” trafﬁc sign. Con-

sidering that there are at most 10 speed limit trafﬁc

signs, we only need to do a 10-class classiﬁcation in-

stead of hundreds-class classiﬁcation

if we observe

the mentioned attributes before the ﬁnal classiﬁcation.

In sum, we believe a successful and applicable trafﬁc

sign recognizer must have the following characteris-

tics: 1) The cost of adding a new class to the system

should be low (scalability). 2) Novel inputs must be

rejected and 3) it should follow a coarse-to-ﬁne clas-

siﬁcation approach.

In this paper, we propose a coarse-to-ﬁne method

for recognizing the large number of trafﬁc signs with

ability to identify the novel inputs. In addition, adding

a new class to the system requires to update a few

models instead of the whole system. It should be

noted that our goal is not to notably improve the nu-

merical results of the state-of-art methods since the

current performance is 99% but to propose a more

scalable and applicable method with better perfor-

mance which is also able to detect the novel inputs

and provide some high level information about the

any inputs. To achieve this goal, we ﬁrst do a coarse

classiﬁcation on the input image using semantic vi-

sual attributes and classify it into one of the possi-

ble object categories. Then, a ﬁne-grained classiﬁca-

tion is done on the objects of the detected category.

However, because the attributes of the object are de-

tected using a one-versus-all classiﬁer, it is possible

that some attributes of the object are not detected and

some irrelevant attributes are detected for the same

we consider that there are at most 100 trafﬁc signs to

be recognized.

object. To deal with this problem, we take into ac-

count the correlation between the different attributes

as well as the uncertainty in the observations and build

a Bayesian network. Next, we enter our observation

to the Bayesian network and select the most proba-

ble explanation of the attributes. Finally, the reﬁned

attributes are used to ﬁnd the category of the trafﬁc

sign or ascertain if it is a novel input.

Contribution: one of the important aspects of the

proposed method is that all objects in the same cate-

gory share the same attributes. For example, all speed

limit trafﬁc signs are triangle, have a red rim and con-

tain a two-digit or three-digit number. In our pro-

posed method, the input image is in the category of

the speed limit trafﬁc signs if it possesses all these

three attributes. Otherwise, it does not belong to this

category. Using this property, we are able to iden-

tify the novel inputs. More precisely, if the input im-

age does not belong to any of the coarse categories,

it is classiﬁed as a novel input. Our second contri-

bution is proposing a scalable method. This means

that the proposed framework can be effectively ex-

tended to hundreds of classes. Our third contribution

is dividing the hundreds of classes into fewer cate-

gories and building separate ﬁne-grained classiﬁers

for every category. For instance, the category “speed

limit” may contain 10 classes including 8 signs with

different two-digit numbers and 2 signs with differ-

ent three-digit numbers. Clearly, there are subtle dif-

ferences between these signs. For example, the traf-

ﬁc sign “speed limit: 70Km/h” is visually very sim-

ilar to the “speed limit: 20Km/h” sign. As a result,

the classiﬁcation approach must take into account the

subtle differences rather than more abstract charac-

teristics. Another advantage of dividing the problem

into smaller problems is that in the case of adding a

new sign to the system, we need to ﬁnd its relevant

category and update only the classiﬁcation model of

this category. Last but not the least, in the case that

our system cannot ﬁnd the category of the object or it

is not conﬁdent about the classiﬁcation result, it pro-

vides a more abstract semantic information which can

be fused with the context and temporal information

for inference.

The rest of this paper is organized as follows: Sec-

tion 2 reviews the state-of-art methods for recognizing

the trafﬁc signs as well as the methods for detecting

the attribute of the object. Then, the proposed method

is described in section 3 where we mention the feature

extraction method and the Bayesian network model.

Next, we show the experimental results in section 4

and ﬁnally, the paper concludes in section 5.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

2 RELATED WORK

Trafﬁc sign recognition has been extensively studied

and some impressive results on uncontrolled environ-

ments have been reported. In general, the methods for

recognizing the trafﬁc signs can be divided into three

different categories namely template matching, clas-

siﬁcation and deep networks.

Template Matching: In the early works, a trafﬁc

sign is considered as a rigid and well-deﬁned object

and their image are stored in the database. Then, the

new input image is compared with the all templates in

the database to ﬁnd the best matching. The methods

based on template matching usually differ in terms of

similarity measure or template selection. Obviously,

these methods are not stable and accurate in uncon-

trolled environments. For more detail the reader can

refer to (Piccioli et al., 1996) and (Paclik et al., 2006).

Classiﬁcation: Recently, classiﬁcation ap-

proaches have achieved high accurate results on

more realistic databases. These approaches consist

of two major stages. In the ﬁrst stage, features of

the image are extracted and, then, they are classiﬁed

using machine learning approaches. Stallkamp et.

al. (Stallkamp et al., 2012) achieved 95% classiﬁ-

cation accuracy on German trafﬁc sign benchmark

database (Houben et al., 2013) by extracting the HOG

features and classifying the images into 43 classes

using the linear discriminant analysis. Zaklouta and

Stanciulescu (Zaklouta and Stanciulescu, 2011)- (Za-

klouta and Stanciulescu, 2014) extracted the same

HOG features on the same database in (Stallkamp

et al., 2012) and classiﬁed them using the random

forest model. They could increase the performance

up to 97.2%. Similarly, Sun et. al. (Sun et al.,

2014) utilized extreme learning machine method

for classiﬁcation of the HOG features and achieved

97.19% accuracy on the same database.

In another study, Maldonado et. al. (Maldonado-

Bascon et al., 2007) (Bascn et al., 2010) recognized

the trafﬁc signs by recognizing the pictographs using

support vector machine. Most recently, Liu et. al.

(Liu et al., 2014) extracted the SIFT features of the

image after transforming it to the log-polar coordi-

nate system and found the visual words using k-means

clustering. Then, the feature vectors were obtained

using a novel sparse coding method and, ﬁnally, the

trafﬁc signs were recognized using support vector ma-

chines.

Different from the previous approaches, Wang et.

al. (Wang et al., 2013) employed a two step classi-

ﬁcation. In the ﬁrst step, the input image is classiﬁed

into 5 super-classes using HOG features and support

vector machine. In the second stage, the ﬁnal clas-

siﬁcation is done using HOG and support vector ma-

chine after doing perspective adjustment on the image

taking into account the information from the super

class. For more detailed information about classiﬁ-

cation based methods the reader can refer to (Mogel-

mose et al., 2012).

Deep Network: deep networks outperformed the

human performance by classifying more than 99%

of the images, correctly. Ciresan et. al. (Cirean

et al., 2012) (Ciresan et al., 2011) developed a bank of

7-layer deep networks whose inputs are transformed

version of the input image. In addition, Sermanet and

LeCun (Sermanet and LeCun, 2011) proposed a 7-

layer deep network for recognizing the trafﬁc signs

and obtained 99% accuracy in their experiments.

Discussion: Despite the impressive results

achieved by both deep networks and classiﬁcation

methods, but they are still far from the real applica-

tions. First, a deep network is slow and it cannot

currently be used in real-time applications. Second,

ﬁnding the optimal structure of the deep network is a

time consuming task which depends on the number of

the classes. In the other words, if the number of the

classes changes, the whole network need to be trained

again. Third, neither deep network nor the above clas-

siﬁcation methods are not able to deal with the novel

inputs and they will classify every input image into

one of the trafﬁc sign classes. To address all these

problems, in this paper, we have formulated the trafﬁc

sign recognition problem in terms of visual attributes

and ﬁne-grained classiﬁcation.

Visual attributes was ﬁrst proposed by Ferrari

and Zisserman (Ferrari and Zisserman, 2007) and,

later, it has been successfully used for deﬁning the

objects (Russakovsky and Fei-Fei, 2012). Cheng

and Tan (Cheng and Tan, 2014) classiﬁed the ﬂow-

ers by learning attributes using sparse representation.

Farhadi et. al. (Farhadi et al., 2009) described the

objects using semantic and discriminative attributes.

Semantic attributes are more comprehensive and they

are the ones which human use to describe the objects.

They can include shape, material and parts. In con-

trast, discriminative attributes are the ones that does

not have a speciﬁc meaning for human but they are

utilized for better separating the objects. One impor-

tant advantage of visual attributes is their ability to

transfer the knowledge to the new classes of objects

and learn them without examples. This is called zero

shot learning and it is illustrated in ﬁg.1. Here, 7 dif-

ferent attributes are learned and they can be identiﬁed

in the input images. As it is shown in this ﬁgure, by

detecting the correct attributes of the input image we

are able to recognize 11 signs without observing them

during the training phase. This is an important prop-

AUnifiedFrameworkforCoarse-to-FineRecognitionofTrafficSignsusingBayesianNetworkandVisualAttributes

Figure 1: Zero-shot learning using a set of attributes.

erty which can help us to extend our models with a

few efforts through transferring the knowledge from

observed classes to the new classes (Rohrbach et al.,

2010) (Lampert et al., 2009).

3 PROPOSED METHOD

Trafﬁc sign recognition is a multi-category classiﬁ-

cation problem with hundreds of classes. Also, it is

not trivial to collect a large number of real-world im-

ages of every sign. Further, some signs happens more

frequently than other signs. For example, it is more

probable to see the “curve” signs instead of the “be

ware of snow” sign. For this reason, the collected

database might be highly unbalanced. Consequently,

the trained model for the signs with fewer data can be

less accurate than the ones with more data. One fea-

sible remedy to this problem is to update the models

through time. However, if we build a single model for

classiﬁcation of all signs, it will be a time consuming

task to re-train this model. But, if we can group the

N trafﬁc signs into M < N categories

, then, we can

train a different model for each category and in the

case of adding new signs, we need to ﬁnd its relevant

category and re-train only the model of this category.

On the other hand, temporal information plays an

important role in human inference system. For ex-

ample, if we observe “no passing” sign at time t

expect to see “end of no passing zone”, after a while,

at time t

. Assume the sign “end of no passing zone”

is impaired because of its age and it is hard to see its

pictograph and the stripped crossing. In this case, if

we follow the classiﬁcation approaches that we men-

tioned in the previous section, the “end of no passing

zone” sign can be incorrectly classiﬁed. However, if

we provide some more abstract information such as

“the input image has a circular shape and black-white

color,” the trafﬁc sign recognition system can infer

that the image is related to the previously observed

“no passing” sign. Hence, it probably indicates the

“end of no passing zone” trafﬁc sign.

In this paper, we propose a coarse-to-ﬁne classiﬁ-

cation approach using the semantic attributes of the

A category may contain more than one trafﬁc sign.

object. Fig.2 shows the overview of the proposed

algorithm. In the ﬁrst stage, the image is divided

into several regions and each region is coded using a

sparse coding method. Then, the feature vector is ob-

tained by concatenating the locally pooled coded vec-

tors (Section 3.1). Next, the feature vector is individ-

ually applied on the attribute classiﬁers and the clas-

siﬁcation score of each attribute is computed (Sec-

tion 3.2). Finally, the certain state of each attribute

is estimated by plugging the scores into the Bayesian

network and calculating the most probable explana-

tion of the attributes (Section 3.3). In the next step,

the category of the image is found using the attribute

conﬁguration (Section 3.4). Having the sign category

found, the ﬁne-grained classiﬁer of this category is

used to do the ﬁnal classiﬁcation (Section 3.5).

3.1 Feature Extraction

In order to train the attribute classiﬁers, we ﬁrst need

to extract the features of the trafﬁc sign. The extracted

feature must be able to encode the color, the shape and

the content of the trafﬁc sign in the same vector. One

of the characteristics of the trafﬁc sign is that they are

rigid and their geometrical features ( e.g. shape, size

and orientation) as well as their appearance (e.g. color

and content) remains relatively unchanged. From this

point of view, a simple template matching approach

can be useful for the recognition task. However, some

important issues such as motion blur, weather condi-

tion and occlusion cause the template matching ap-

proach to fail.

Nonetheless, it is possible to divide the image of

the trafﬁc signs into smaller blocks and learn the most

dominant exemplars of each block, independently.

Then, we can reconstruct the original block by lin-

early combining the exemplars. This is the idea be-

hind sparse coding approach (Lee et al., 2007). More

speciﬁcally, as it is shown in ﬁg.3, we divide the in-

put image into 5 different regions and each region is

divided into a few smaller blocks. For example, the

region indicated by number 1 is divided into 3 blocks.

Then, in order to learn the templates of the region r,

we ﬁrst collect the images of the blocks of this re-

gion from all training images and, then, learn the most

dominant exemplars by solving the following equa-

tion:

minimize

, α

∑

i=1

− D

sub ject to kα

<= λ

(1)

In this equation, x

∈ R

is a M-dimensional vector

representing the RGB values of the blocks in region

r, D

is a R

M×K

matrix storing the K dominant tem-

plates of region r in the training images, α

∈ R

is a

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

Figure 2: Overview of the proposed method (best viewed in color).

Figure 3: Feature extraction scheme.

K-dimensional sparse vector indicating the templates

which have been selected to reconstruct the block x

and λ is the value which controls the sparsity. The

value of λ is determined empirically by the user.

After training the dictionaries, we can use them to

extract the features of the input images. To this end,

we divide the input image into regions and blocks in

the same way that it is shown in ﬁg.3. Then, we take

the blocks of each region r, separately, and minimize

(1) assuming that the values of D are ﬁxed in order

to compute the vector α

of each block. At this step,

we have a few K-dimensional vectors. For example,

we will obtain four vectors from region 5. Then, the

feature vector of region r is computed by pooling the

vectors in that region:

∑

i=1

(2)

In this equation, n

is the number of the blocks in re-

gion r. Finally, the feature vector of the image is ob-

tained by concatenating the vectors f

, r = 1 . . . 5 into

a single vector and normalizing it using L

norm.

3.2 Attribute Classiﬁer

A trafﬁc sign can be deﬁned using three sets of visual

attributes. These are illustrated in ﬁg.4. Dashed

arrows show the soft dependency relation and we

will discuss about them in the next section. In

fact, there is a causal relationship between these

attributes and the trafﬁc signs. In the other words, we

can verify the validity of this relationship using the

concept of ancestral sampling. Given the color, shape

and content attributes, we can randomly generate

new trafﬁc signs using the probability distribution

function p(tra f f ic sign|color, shape, content). For

instance, while p(tra f f ic sign = curve le f t|color =

red, shape = circle, content = has number)

might be close to zero but p(tra f f ic sign =

speed limit 60|color = red, shape = circle, content =

has number) is high.

Figure 4: Causal relationship between the attributes and the

trafﬁc signs.

Taking this causal relationship into account, we

have deﬁned three sets of attributes including color

(4 attributes), shape (3 attributes) and content (12 at-

tributes). These attributes are listed in table 1. Each

trafﬁc sign in our experiments can be described using

these attributes. However, they can be easily extended

to more attributes without affecting the general model

we have proposed in this paper.

Detecting the attributes of the input image is done

through the attribute classiﬁers. For this reason, we

need to train 19 binary classiﬁers as follows. For each

attribute, we select the images having that attribute

as the positive samples and the rest of the images as

the negative samples. Then, we train a random forest

AUnifiedFrameworkforCoarse-to-FineRecognitionofTrafficSignsusingBayesianNetworkandVisualAttributes

Table 1: Sets of attributes for describing the trafﬁc signs.

Content

has human(a

) danger road(a

) pointing up(a

)

end of (a

) 2-digit number(a

) pointing right(a

)

has car(a

) 3-digit number(a

) pointing left(a

)

has truck(a

0) irregular object(a

1) is blank(a

Color

red(a

3) blue(a

4) yellow(a

black-white(a

Shape

circle(a

7) triangle(a

8) polygon(a

model on the collected data. At the end, we will have

19 random forest models for ﬁnding the attributes of

the input image.

3.3 Bayesian Network Model

Fig.5 shows the general model for the classiﬁcation

of the images using attributes where x indicates the

feature vector, a

, i = 1 . . . N is a binary value indicat-

ing the presence or absence of the i

attribute and

, k = 1 . . . K is the class label.

Figure 5: General classiﬁcation model using attributes.

Based on this model, it is easy to show that the

classiﬁcation will be done by ﬁnding the maximum a

posteriori of the class labels:

∗

= arg max

k=1...K

∑

a=0,1

p(a|x)p(y

|a) (3)

where a = a

|i = 1 . . . N is a binary vector. There are

two important issues with this model. First, it does

not take into account the causal relationship between

the attributes and it considers them completely inde-

pendent. This means, using this model, the attribute

“danger in road ” does not longer depend on the shape

attributes. But, all trafﬁc signs indicating the danger

will be only shown in the red and triangle signs. Sup-

pose that we observe the attributes “is blue ”, “is tri-

angle ” and “pointing left”. Obviously, there is no

trafﬁc sign with this conﬁguration. However, if the

shape had been detected as “is circle “or the color had

been detected as “is red”, the conﬁguration was valid.

But, with the model of ﬁg.5 it is difﬁcult to ﬁnd which

attribute has been falsely classiﬁed. The reason is it

does not take into account the dependency between at-

tributes and the uncertainty of the observations. The

second issue is that ,using this model, detecting the

novel inputs is not a trivial task. In order to detect the

novel inputs, we need to deﬁne a threshold which can

be compared with the maximum a posteriori value for

this purpose. However, determining the value of the

threshold is an empirical task and it highly depends on

the conditional distribution models of each attribute.

On the other hand, if one of the models changes, we

need to ﬁnd the threshold value, again.

As we mentioned in ﬁg.4, the image of the traf-

ﬁc sign can be described in terms of color, shape and

content (pictograph). However, there is also a soft

dependency between the content and other attributes

(dashed lines). This is because some attributes can

happen regardless of the shape and color. For exam-

ple, the content attribute “is blank” can happen on ev-

ery possible combination of the color and the shape

attributes. In other words, the attribute “is blank” can

be independent of the other attributes. In addition,

there is also intra-dependency between the content at-

tributes. For example, if we observe “has truck” at-

tribute, it is probable to observe “has car ” attribute,

as well (e.g. “no passing” trafﬁc sign). To ﬁnd the

dependencies between the all attributes in table 1, we

calculated the co-occurrence matrix of the attributes.

This is illustrated in ﬁg.6.

Figure 6: Co-occurrence matrix of the attributes.

The co-occurrence matrix is a 19 × 19 matrix

where the element (i, j) in this matrix indicates

the probability of observing i

and j

attributes

at the same time among the whole classes of traf-

ﬁc signs. Using the co-occurrence matrix, we cre-

ate our Bayesian network by discarding the relations

where their probability in the co-occurrence matrix is

less than the threshold T . Fig.7 shows the obtained

Bayesian network.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

Figure 7: Bayesian network without observation nodes.

The nodes in this Bayesian network depicts the

state of an attribute. In other words, all of the nodes in

this network are binary nodes and they represent the

conditional probabilities of the child attributes given

their parents that are acquired using the training data.

Our goal is to ﬁnd the optimal state of the attributes

using the identiﬁed evidences from the image. For

this reason, we add another 19 observation nodes to

the network. This is illustrated in ﬁg.8 where the solid

gray circles are the hidden nodes and the white cir-

cles are the evidence nodes. Our observations are the

scores of the attribute classiﬁers (random forests) that

is a number between 0 and 100. Given the evidence

from the attribute classiﬁers, our goal is to maximize

following function:

∗

. . . a

∗

= argmax

...a

∈[0 1]

p(a

. . . a

, O

. . . O

) (4)

where O

, i = 1 . . . 19 is the score of the i

attribute

obtained from the i

random forest model. According

to the equation, we are looking for the state of the

hidden variables (actual state of the attributes) such

that the joint probability of the hidden variables and

the observed attributes are maximum. This is called

most probable explanation problem.

3.4 Category Finding

Given the set of 19 attributes, we can cluster the traf-

ﬁc signs into smaller categories. To this end, we

manually specify which attributes are active for each

class of trafﬁc signs. For example, only attributes

red, circle, 2-digit number are active on trafﬁc sign

“speed limit 60”. Then, we cluster the classes with

exactly the same active attributes into one category.

We applied this procedure on the German trafﬁc sign

database and reduced the number of classes from 43

classes into 29 categories. Fig.9 shows the statistic of

different categories as well as the trafﬁc signs inside

Figure 8: Bayesian network after adding observation nodes.

each category. As it is clear from the ﬁgure, there are

6 categories with more than one trafﬁc sign and other

categories contain only one trafﬁc sign. This means

that 22 trafﬁc signs can be recognized using their vi-

sual attributes and they do not need a ﬁner classiﬁca-

tion. In contrast, the trafﬁc signs inside the other 6

categories can be recognized by the ﬁne classiﬁcation

models.

Having the optimal state of the attributes esti-

mated using (4), we can ﬁnd the category of the new

image using the parsing tree illustrated in ﬁg.10. In

this tree, orange nodes are the starting points and the

white nodes are the leaf nodes.

Novelty Detection: One interesting property of

the parsing tree in ﬁg.10 is that we can use it for ﬁnd-

ing the novel inputs. To achieve this, we start by the

comparing the optimal state of the attributes obtained

from (4) with the starting nodes. For example, assume

the state of the attribute blue is active and state of the

other color attributes are inactive. According to the

parsing tree, there is only one outgoing path from the

node blue that is the circle attribute. If it is active,

then we keep do parsing, otherwise the input is novel

because the node blue is not a leaf node and the pars-

ing is not successful. In sum, an input image is novel

if there is no active path from the starting points to the

leaf nodes.

3.5 Fine Classiﬁcation

We saw that a few categories contain more than one

trafﬁc sign. If the input image belongs to one of these

categories, then, the actual class of the image is found

using a ﬁne classiﬁcation model which is trained on

the images of the category. In other words, we cre-

ate an individual model for every object category with

more than one class inside the category. We follow

the same method as in (Zaklouta and Stanciulescu,

AUnifiedFrameworkforCoarse-to-FineRecognitionofTrafficSignsusingBayesianNetworkandVisualAttributes

Figure 9: Clustering the trafﬁc signs using visual attributes.

2011) for classifying the objects within the same cat-

egory. Moreover, using our method, we reduce the

number of the classes from 43 to 6 in German trafﬁc

sign benchmark database which is about 7 times re-

duction in the number of classes. Therefore, we only

need to do a 6-class classiﬁcation instead of 43-class

classiﬁcation that can be more accurate and ﬂexible.

4 EXPERIMENTS

We have applied our proposed method on the Ger-

man trafﬁc sign benchmark database (Stallkamp et al.,

2012). This database consists of 43 classes. It also in-

cludes two different sets for training and testing. We

have resized the all images into 40 × 40 pixels before

applying any feature extraction method. We have two

sets of feature vectors. The ﬁrst set which is obtained

by sparse coding method mentioned in this paper is

for recognizing the attributes of the image and the

second set is the HOG features for ﬁne-classiﬁcation.

For sparse coding approach we applied our proposed

method on both RGB and the distance transform of

the edge image. Then, we concatenated the pooled

vectors to build the ﬁnal feature vector. For HOG fea-

tures, we utilized the same conﬁguration in (Zaklouta

and Stanciulescu, 2011). Next, we trained two sets of

random forest model one for the attribute classiﬁca-

tion (19 classiﬁers) and one for the ﬁne classiﬁcation

of the categories with more than one trafﬁc sign (6

classiﬁers) using only the training set.

It is worth mentioning that the conditional prob-

abilities of the hidden variables of the Bayesian net-

work are modeled using the conditional probability

tables and the conditional probability of the observa-

tions are modeled using Gaussian distribution of the

attribute scores.

Table 2 shows the results of the attribute classiﬁca-

Figure 10: Parsing tree for ﬁnding the category of the im-

age.

Table 2: Precision, Recall and F

measure of the attribute

classiﬁers.

Attribute precision recall F

measure

red 0.993 0.990 0.992

blue 0.988 0.992 0.990

yellow 0.971 0.948 0.959

black-white 1.0 0.992 0.996

triangle 0.985 0.995 0.990

circle 0.997 0.987 0.992

polygon 0.991 0.988 0.990

pointing left 0.967 0.947 0.975

pointing right 0.982 0.935 0.957

pointing up 0.994 0.954 0.973

end of 1.0 0.992 0.996

has car 0.993 0.983 0.988

has truck 1.0 0.977 0.988

2-digit number 0.983 0.973 0.978

3-digit number 0.959 0.965 0.962

has human 0.988 0.896 0.940

danger in road 0.932 0.960 0.946

irregular object 0.977 0.913 0.944

is blank 0.991 0.993 0.992

tion on the test dataset. Apparently, the attribute clas-

siﬁers have achieved high accuracy in detecting the

attributes of the input images. Next, we tried to ﬁnd

the category of the test images using the attribute clas-

siﬁcation model depicted in ﬁg.5 and our proposed

method. Table 3 and Table 4 show the results of the

category classiﬁcation using the proposed method and

the general model, respectively. Clearly, our method

has outperformed the general attribute classiﬁcation

model. The reason is that, using our method, we are

able to model the uncertainties of the observations

and correct the mistakes. Consequently, the number

of the samples that pass the tests in the parsing tree

increases.

In addition, ﬁg.9 reveled that some categories con-

tain only one class. However to compare our results

with other methods, we also applied the ﬁne classi-

ﬁcation model on the categories with more than one

class inside. Then, to be consistent with the state-

of-art results, we computed the mean accuracy of the

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

Table 3: The result of category classiﬁcation using the proposed method. The categories are indexed according to ﬁg.9.

Category C

precision 0.998 1.0 0.995 0.993 0.996 0.992 1.0 0.996 0.995 0.993 1.0 0.967 1.0 0.960

recall 0.995 1.0 0.987 1.0 0.996 0.992 1.0 0.998 1.0 1.0 0.980 0.991 0.936 0.941

accuracy 0.994 1.0 0.983 0.993 0.993 0.984 1.0 0.994 0.995 0.993 0.980 0.960 0.936 0.905

Category C

precision 1.0 0.976 0.985 1.0 0.988 0.961 1.0 0.998 0.992 0.991 0.965 0.965 0.961 1.0 1.0

recall 0.985 0.984 0.987 0.969 0.988 1.0 1.0 0.992 0.977 0.986 0.988 0.965 1.0 1.0 1.0

accuracys 0.985 0.960 0.973 0.969 0.977 0.961 1.0 0.990 0.970 0.978 0.955 0.933 0.961 1.0 1.0

Table 4: The result of category classiﬁcation using the general model in ﬁg.5. The categories are indexed according to ﬁg.9.

Category C

precision 0.998 1.0 0.981 0.996 1.0 0.972 1.0 0.998 1.0 0.993 1.0 0.725 0.943 1.0

recall 0.958 0.895 0.896 0.972 0.946 0.925 0.899 0.966 0.979 0.979 0.981 0.952 0.688 0.365

accuracy 0.956 0.895 0.881 0.969 0.946 0.901 0.899 0.964 0.979 0.973 0.981 0.699 0.660 0.365

Category C

precision 1.0 1.0 0.991 1.0 0.929 0.814 1.0 0.996 0.962 0.996 0.977 1.0 0.886 1.0 1.0

recall 0.718 0.944 0.844 0.715 0.824 0.946 0.778 0.930 0.933 0.982 0.977 0.966 0.984 0.840 0.947

accuracy 0.718 0.944 0.838 0.715 0.775 0.778 0.778 0.927 0.899 0.978 0.955 0.966 0.873 0.840 0.947

Table 5: The result of category classiﬁcation using the proposed method. The categories are indexed according to ﬁg.9.

Speed limits Other prohibitions De-restriction Mandatory Danger Unique

Our method 97.01 99.25 100 97.09 96.31 98.76

Random forests 95.95 99.13 87.50 99.27 92.08 98.73

LDA 95.37 96.80 85.83 97.18 93.73 98.63

classiﬁcations. Table 5 shows the results. As it is

clear, there is a signiﬁcant improvement in recogni-

tion of de-restriction signs (you can refer to (Stal-

lkamp et al., 2012) for deﬁnitions of different signs).

This is because the shape of de-restrictions signs is

very similar to some of the signs in other classes. For

example, “end of no-passing” sign has a very simi-

lar edge features to the “no passing” signs. On the

other hand, two other methods represented in this pa-

per have utilized HOG features for classiﬁcation. Ob-

viously, because of shape similarity of the other signs

with the de-restrictions signs, their feature vector will

be similar, as well. For this reason, there is an over-

lap between the feature vector of this signs with other

signs in the feature space which causes the misclassi-

ﬁcation.

However, because our method utilizes the at-

tributes of the image, it is able to model the color,

shape and content of each sign explicitly. For this

reason, when an image from de-restrictions group

is given to our method, it is able to distinguish be-

tween them with other signs simply using the color

attributes. As the result, it is able to improve the ac-

curacy of the classiﬁcation.

5 CONCLUSION

In this paper, we proposed a method based on visual

attributes and Bayesian network for recognizing the

trafﬁc signs. Our method is different from the state-

of-art methods for various reasons. First, it is more

scalable and in some cases it is possible to learn the

new classes without any training samples (zero-shot

learning). Further, in the case that zero shot learn-

ing is not applicable, the system only requires to up-

date the models locally instead of the whole mod-

els. Second, it is able to detect the novel inputs. In

other words, if there are some false-positive results

in the detection module, our method is able to dis-

card this novel inputs instead of classifying them as

one of the trafﬁc signs. We believe, this is the ﬁrst

time that novelty detection is introduced for trafﬁc

sign recognition problem. Third, because of using

the visual attributes, our method is able to provide

some high level semantic information about the in-

put image. Fourth, the system is easily expendable

to hundreds of classes of trafﬁc signs since it breaks

the hundred classes to the categories with much less

trafﬁc signs which make them more tractable to clas-

sify without affecting the accuracy of the system. Our

experiments on the German trafﬁc sign benchmark

dataset indicates that in addition to improvements in

the results compared with the state-of-art methods,

AUnifiedFrameworkforCoarse-to-FineRecognitionofTrafficSignsusingBayesianNetworkandVisualAttributes

our modeling framework is more closer to the real-

world applications.

REFERENCES

Bascn, S. M., Rodrguez, J. A., Arroyo, S. L., Caballero,

A. F., and Lpez-Ferreras, F. (2010). An optimization

on pictogram identiﬁcation for the road-sign recogni-

tion task using {SVMs}. Computer Vision and Image

Understanding, 114(3):373 – 383.

Cheng, K. and Tan, X. (2014). Sparse representations based

attribute learning for ﬂower classiﬁcation. Neurocom-

puting, 145(0):416 – 426.

Cirean, D., Meier, U., Masci, J., and Schmidhuber, J.

(2012). Multi-column deep neural network for traf-

ﬁc sign classiﬁcation. Neural Networks, 32(0):333 –

338. Selected Papers from {IJCNN} 2011.

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J.

(2011). A committee of neural networks for trafﬁc

sign classiﬁcation. In Neural Networks (IJCNN), The

2011 International Joint Conference on, pages 1918–

1921.

Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. (2009).

Describing objects by their attributes. In Computer

Vision and Pattern Recognition, 2009. CVPR 2009.

IEEE Conference on, pages 1778–1785.

Ferrari, V. and Zisserman, A. (2007). Learning visual at-

tributes. In Advances in Neural Information Process-

ing Systems.

Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., and

Igel, C. (2013). Detection of trafﬁc signs in real-world

images: The German Trafﬁc Sign Detection Bench-

mark. In International Joint Conference on Neural

Networks, number 1288.

Lampert, C., Nickisch, H., and Harmeling, S. (2009).

Learning to detect unseen object classes by between-

class attribute transfer. In Computer Vision and Pat-

tern Recognition, 2009. CVPR 2009. IEEE Confer-

ence on, pages 951–958.

Lee, H., Battle, A., Raina, R., and Ng, A. Y. (2007). Ef-

ﬁcient sparse coding algorithms. In Sch

olkopf, B.,

Platt, J., and Hoffman, T., editors, Advances in Neural

Information Processing Systems 19, pages 801–808.

MIT Press.

Liu, H., Liu, Y., and Sun, F. (2014). Trafﬁc sign recogni-

tion using group sparse coding. Information Sciences,

266(0):75 – 89.

Maldonado-Bascon, S., Lafuente-Arroyo, S., Gil-Jimenez,

P., Gomez-Moreno, H., and Lopez-Ferreras, F. (2007).

Road-sign detection and recognition based on support

vector machines. Intelligent Transportation Systems,

IEEE Transactions on, 8(2):264–278.

Mogelmose, A., Trivedi, M., and Moeslund, T. (2012).

Vision-based trafﬁc sign detection and analysis for in-

telligent driver assistance systems: Perspectives and

survey. Intelligent Transportation Systems, IEEE

Transactions on, 13(4):1484–1497.

Paclik, P., Novovicova, J., and Duin, R. P. W. (2006).

Building road sign classiﬁers using trainable similar-

ity measure. IEEE Transactions on Intelligent Trans-

portation Systems, 7(3):309–321. to appear.

Piccioli, G., Micheli, E. D., Parodi, P., and Campani, M.

(1996). A robust method for road sign detection and

recognition.

Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., and

Schiele, B. (2010). What helps where – and why? se-

mantic relatedness for knowledge transfer. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR).

Russakovsky, O. and Fei-Fei, L. (2012). Attribute learning

in large-scale datasets. In Proceedings of the 11th Eu-

ropean Conference on Trends and Topics in Computer

Vision - Volume Part I, ECCV’10, pages 1–14, Berlin,

Heidelberg. Springer-Verlag.

Sermanet, P. and LeCun, Y. (2011). Trafﬁc sign recogni-

tion with multi-scale convolutional networks. In Neu-

ral Networks (IJCNN), The 2011 International Joint

Conference on, pages 2809–2813.

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C.

(2012). Man vs. computer: Benchmarking machine

learning algorithms for trafﬁc sign recognition. Neu-

ral Networks, 32(0):323 – 332. Selected Papers from

{IJCNN} 2011.

Sun, Z.-L., Wang, H., Lau, W.-S., Seet, G., and Wang, D.

(2014). Application of bw-elm model on trafﬁc sign

recognition. Neurocomputing, 128(0):153 – 159.

Wang, G., Ren, G., Wu, Z., Zhao, Y., and Jiang, L. (2013).

A hierarchical method for trafﬁc sign classiﬁcation

with support vector machines. In Neural Networks

(IJCNN), The 2013 International Joint Conference on,

pages 1–6.

Zaklouta, F. and Stanciulescu, B. (2011). Warning trafﬁc

sign recognition using a hog-based k-d tree. In In-

telligent Vehicles Symposium (IV), 2011 IEEE, pages

1019–1024.

Zaklouta, F. and Stanciulescu, B. (2014). Real-time traf-

ﬁc sign recognition in three stages. Robotics and Au-

tonomous Systems, 62(1):16 – 24. New Boundaries of

Robotics.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications