ON THE IMPORTANCE OF THE GRID SIZE FOR GENDER

RECOGNITION USING FULL BODY STATIC IMAGES

Carlos Serra-Toro

1,3

, V. Javier Traver

1,3

, Ra

ul Montoliu

2,3

and Jos

e M. Sotoca

1,3

DLSI,

DICC &

iNIT, Universitat Jaume I, 12071 Castell

on, Spain

Keywords:

Gender recognition, Soft biometrics, Machine learning.

Abstract:

In this paper we present an study on the importance of the grid conﬁguration in gender recognition from

whole body static images. By using a simple classiﬁer (AdaBoost) and the well-known Histogram of Oriented

Gradients features we test several grid conﬁgurations. Compared with previous approaches, which use more

complicated classiﬁers or feature extractors, our approach outperforms them in the case of the frontal view

recognition and almost equals them in the case of the mixed view (i.e. frontal and back views combined

without distinction).

1 INTRODUCTION

The characterization of people according to some cri-

terion (e.g. age, ethnicity, or gender) in digital im-

ages and videos is relevant for many applications of

scientiﬁc and social interest. Some recent research

has been done concerning the classiﬁcation of peo-

ple according to their gender. Most of them use a

facial approach (Moghaddam and Yang, 2002) while

others try to classify the gender of people according

to their gait (Yu et al., 2009). However, there is not

much work done concerning gender classiﬁcation us-

ing whole body static images of standing people and

the ﬁrst contribution is as recent as (Cao et al., 2008).

In this paper we study the gender recognition of

a single, standing person using just one whole body

static image. This is a complex problem since gen-

der recognition is a difﬁcult task even for human be-

ings since, although there are a number of heuristics

that can partially guide the design of an algorithm,

there are many exceptions that make them unreliable

in general. Furthermore, detecting those conditions is

also a problem in itself.

The tendency in the machine learning community

seems to be to give more importance to ﬁnd compli-

cated or newer descriptors, or combination of several

existing descriptors or classiﬁers, to achieve more ac-

curacy. This tendency sometimes leads to overlook

other simpler aspects of the problems that can impact

on the global accuracy. In this paper we show that,

in the case of gender recognition from whole body

static images, choosing an optimal grid for classiﬁca-

tion can be as important as, and sometimes more im-

portant than, choosing a complicated feature extractor

or classiﬁer to perform the recognition. We present an

study on the importance of the grid conﬁguration for

gender classiﬁcation, achieving results comparable to

those obtained by previous works (Section 2).

2 STATE OF THE ART

To the best of our knowledge, there are currently as

few as four published papers addressing the problem

of gender recognition from whole body static images.

This section is a review of these works.

The ﬁrst documented approach to gender recogni-

tion from static images was (Cao et al., 2008). They

manually labeled (see Section 4.1) the CBCL pedes-

trian database (Oren et al., 1997), releasing the ﬁrst

publicly available dataset for the evaluation of gen-

der classiﬁcation. They created a classiﬁer inspired

by AdaBoost (Freund and Schapire, 1995), based on

a part-based representation of the body, named Part-

based Gender Recognition (PBGR), in which every

part provided a clue of the gender of the person in the

image. In each round of the algorithm, they ﬁrst se-

lected the most optimal patch of the image and then

trained a learner using only the Histogram of Ori-

ented Gradients (HOG) features (Dalal and Triggs,

2005) corresponding to that part of the image. They

achieved a recognition rate of 75.0% for the mixed

view, and 76.0% and 74.6% when considering just the

334

Serra-Toro C., Javer Traver V., Montoliu R. and M. Sotoca J..

ON THE IMPORTANCE OF THE GRID SIZE FOR GENDER RECOGNITION USING FULL BODY STATIC IMAGES.

DOI: 10.5220/0003323803340339

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2011), pages 334-339

ISBN: 978-989-8425-47-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

frontal or the back view images, respectively.

The second contribution was (Collins et al., 2009),

who investigated a number of feature extractors and

noticed that the best results in gender recognition

were achieved using a combination of both a shape

(based on an edge map) and a colour (based on hue

histograms (van de Sande et al., 2008)) extractors

combined using a linear kernel support vector ma-

chine (SVM) classiﬁer. They only focused on frontal

view images and randomly balanced the CBCLdataset

so that there were 123 images of each gender. Also,

they cropped each image so that its size was approx-

imately the bounding box of the person represented

in it. They reported an accuracy of 76.0% (the same

rate as (Cao et al., 2008) for the frontal views) with

a good balance between the male and the female ac-

curacy. (Collins et al., 2009) repeated the experiment

with the VIPeR dataset (Gray et al., 2007) (a larger

dataset with almost 300 images for each gender) and

achieved 80.6% of accuracy in frontal view images.

In a newer contribution, (Collins et al., 2010)

combined the VIPeR and CBCL images to create

a database of 413 images of each gender, all in

frontal view. They obtained the “eigenbody” im-

ages, computed using the principal component anal-

ysis (PCA) (Jolliffe, 2005) over the raw images data

and over the edge map of the images. They realized

that the gender from whole body image discrimina-

tion seems to be encoded by a combination of several

PCA components. They chose the top components

from both the raw and edgemap data and combined

them using a SVM, resulting in a recognition accu-

racy of 66%.

Very recently, (Guo et al., 2010) reported an accu-

racy of 80.6% using only the CBCL dataset with (Cao

et al., 2008) manual labelling, without balancing and

without cropping the images and considering both

frontal and back view combined (mixed view). They

represented each image with biologically-inspired

features (BIF) (Serre et al., 2007) combined with sev-

eral manifold learning techniques. They designed a

classiﬁcation framework where the type of view was

considered. Their accuracy rate was 80.6%, which is,

to the best of our knowledge, currently the best gender

recognition rate published with the CBCL dataset.

In this paper we intend to explore the impact of the

grid conﬁguration in the gender recognition task, an

aspect that has not been yet taken into consideration,

by using a simple classiﬁer.

3 APPROACH

We consider images showing a whole body picture of

a single, still standing person in frontal or back view.

The persons shown in all the pictures are approxi-

mately aligned and scaled so that different persons in

different images all take up a similar space.

Each gray-scale image I is described using a fea-

ture vector, v

. Each image is divided into smaller

rectangles, called cells, whose size are deﬁned by

a r × c grid applied over the picture, with r being

the number of windows across the height of the im-

age, and c being the number of windows across its

width. There is an overlap between adjacent cells of

50%, both vertically and horizontally. For each cell, a

Rectangular HOG (R-HOG) (Dalal and Triggs, 2005)

feature vector is obtained, so that the resulting fea-

ture vector v

for the image I is the concatenation of

all the feature vectors obtained for each of its cells:

= v

. . . v

, with v

being the feature vector cor-

responding to cell γ of the image I.

In order to classify each image as representing

a male or a female we use AdaBoost (Freund and

Schapire, 1995). We use decision stumps as the weak

learner in the same way as (Cao et al., 2008) do, and

use their same variant of AdaBoost (see Algorithm 1

of (Cao et al., 2008) for the details).

4 EXPERIMENTATION

4.1 Image Dataset

We use the CBCL pedestrian image database

(Oren

et al., 1997), not designed initially to gender recog-

nition, but used by other authors (Section 2) for this

purpose. Images are all 64 × 128 pixel size, showing

one pedestrian standing in frontal or back view, hor-

izontally and vertically aligned so that their height is

about 80 pixels from their shoulders to their feet.

We use (Cao et al., 2008)’s publicly available

manual labelling of the CBCL database according to

pedestrians’ gender. This labelling consists of 600

men and 288 women. The wiews are also classiﬁed:

51% (frontal) and 49% (back) for male images, and

39% (frontal) and 61% (back) for female images.

4.2 Implementation Details

The experiments were executed using an implemen-

tation of a version of the R-HOG (Dalal and Triggs,

2005). In our implementation we consider only gray-

level images and therefore the information of the

http://cbcl.mit.edu/cbcl/software-datasets/

PedestrianData.html

ON THE IMPORTANCE OF THE GRID SIZE FOR GENDER RECOGNITION USING FULL BODY STATIC IMAGES

335

colour is lost, since we believe that the gender infor-

mation is primarily codiﬁed in the shape of the ﬁgure.

Since our purpose was the study of the importance

of the grid conﬁguration, we have simpliﬁed the block

schema used by (Dalal and Triggs, 2005) and we do

not group adjacent cells into a block, and therefore

no normalization is done between the cells within a

block. Since (Dalal and Triggs, 2005) report best re-

sults with respect to pedestrian detection when block

normalization is done, it is possible that gender recog-

nition beneﬁt too from this schema. Since our pur-

pose was the study of the grid conﬁguration and not

the clustering between its cells, we left the study of

clustering them into blocks as future work.

4.3 Experiments with the Grid Size

In order to verify our hipothesis (i.e. choosing an opti-

mal grid is as important as using a complex algorithm

with respect to gender recognition from static images)

we have tested several grid conﬁgurations over the

image. The idea is to ﬁnd an optimal grid for gen-

der recognition and then achieve an accuracy similar

to those proposed in the literature (Section 2) using a

simpler algorithm (AdaBoost, in our case).

As stated in Section 3, our grid consists of an uni-

formly sampled cartesian r × c grid with 50% of cell

overlap. We decided to test a wide range of grid con-

ﬁgurations, resulting in several cell sizes from con-

siderable big (about 21 × 21 pixels) to quite small

(about 3 × 3 pixels), and therefore we tested r ∈ S

= {6, 9, . . . , 42} and c ∈ S

, S

= {3, 6, . . . , 24} val-

ues. Therefore, |S

| × |S

| = 13 × 8 = 104 possible

grids were explored. In all cases we train and classify

using the AdaBoost described in Section 3.

The results are averaged using a 5-fold cross vali-

dation, in the same way as the previous contributions

do (Section 2). Our results are shown in Tables 1, 2

and 3 for frontal, back and mixed view, respectively.

We highlight the optimal grid (the one with best ac-

curacy) for each view. According to our experimen-

tation, the optimal grid conﬁgurations for each view

are: 21 ×12 for the frontal view, 36 × 21 for the back

view and 15 × 15 for the mixed view.

It is interesting to notice that the optimal recog-

nition grid is denser than the one used by (Dalal and

Triggs, 2005) in their pedestrian detector since they

propose a grid with cells of 6 × 6 pixels resulting in a

grid of 210 cells for a 128 × 64 image, while our opti-

mal grid results to be of 252 cells for the frontal view,

756 cells for the back view and 225 for the mixed

view. We think this suggests that gender recognition

depends more on certain parts of the silhouette rather

than in the silhouette of the whole body, since ﬁner

grids allows the classiﬁer to be more focused on par-

ticular aspects of the shape than grids with less divi-

sions are able to. This ﬁnding is in agreement with

those recently reported by (Collins et al., 2010).

Figures 1 and 2 summarizes the results of the Ta-

bles 1, 2 and 3, showing the mean accuracy for each

combination of the value of the number of windows

across the height (r) or across the width (c) of the im-

age, for each view. As it can be seen, in general the

results improve as the grid makes denser, up to a cer-

tain point at which the accuracy degrades gradually.

It is interesting to notice that the top for the frontal

view is more to the left than the top for the back view.

This indicates that denser grids are needed to recog-

nize gender from back view images, probably because

this view is more difﬁcult, even for humans beings.

The highest accuracy for the mixed view is more on

the left but, contrary to what happens with the other

views, high values for r or c result in a high variance

of the results, and thus the recognition behaviour is

more unstable with denser grids. This is probably the

reason why the optimal recognition grid for the mixed

view, 15 ×15, is the one with less density of cells.

4.4 Study of the Overﬁtting of AdaBoost

The results shown in Tables 1, 2 and 3 are obtained

using 400 iterations for AdaBoost. There is some

controversy about the convenience of stopping early

in AdaBoost to not overﬁt the data, as (Zhang and

Yu, 2005) claims, or to perform a large number iter-

ations so that the overﬁtting reduces, as (Mease and

Wyner, 2008) experimentally shows. Results proba-

bly depend on the nature of each problem, so we have

studied the evolution of the accuracy as the number of

iterations of AdaBoost increases from 100 iterations

to 1500, in steps of 100, for each view.

The results obtained, using the optimal grid found

in Section 4.3 for each view, are shown in Figure 3.

As it can be seen, the increase of the number of weak

learners up to 300 and 400 in the case of frontal and

mixed view, respectively, and up to 600 in the case

of the back view, increases the accuracy, and then the

recognition rate becomes more or less stable in the

three cases. We think in our case AdaBoost is not

overﬁtting the data because, if that were the case, then

this would result in an increase of the accuracy with

the increasing of the number of the iterations.

4.5 Comparison with other proposed

Methods

We compare our results with those reported by the

other existing published approaches (Section 2) in

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

336