Robust Human Detection using Bag-of-Words and Segmentation

Yuta Tani and Kazuhiro Hotta

Meijo University, 1-501 Shiogamaguchi, Tenpaku, Nagoya, 468-0051, Japan

Keywords: Human Detection, Bag-of-Words, Pose Variation, Occlusion, GrabCut and Color Names.

Abstract: It is reported that Bag-of-Words (BoW) is effective to detect humans with large pose changes and

occlusions in still images. BoW can make consistent representation even if a human has pose changes and

occlusions. However, the conventional method represents all information within a bounding box as positive

data. Since the bounding box is the rectangle including a human, background region is also included in

BoW representation. The background region affects BoW representation and the detection accuracy

decreases. Thus, in this paper, we propose to segment the region by GrabCut or Color Names, and the

influence of background is reduced and we can obtain BoW histogram from only human region. By the

comparison with the deformable part model (DPM) and conventional method using BoW, the effectiveness

of our method is demonstrated.

1 INTRODUCTION

In recent years, people deal with a large number of

images as spread of digital cameras and cell phones

with a camera. Therefore, it is desired that

computers recognize the semantic contents of

images automatically. In general, there are many

pictures in which humans are doing actions. Thus,

for image understanding, recognition of action in

still images is important. It is reported that the usage

of both the foreground and background regions is

effective for object recognition in still images

(Russakovsky et al., 2012). For action recognition in

still images, the detection of human with actions is

required to represent foreground and background

independently.

Human with actions in real images often has

large pose changes and occlusions. In the case, the

detection is difficult even if we use Deformable

Parts Model (DPM) (Felzenszwalb et al., 2010). In

the conventional method (Tani and Hotta, 2014),

BoW representation is used to cope with large pose

changes and occlusions, and it gave superior

accuracy to DPM. However, the conventional

method represents all information within a bounding

box as positive data. Some bounding boxes are

shown with red rectangle in Figure 1, which also

include background region. In addition, the

background area is large when human raises his

hand or opens his leg. Thus, the conventional

method is influenced by background region.

Figure 1: Top row shows positive training images. The red

rectangle is the bounding box. It also includes background

region. Bottom row is the segmentation results using the

GrabCut. If we represent only the segmented human by

BoW, the bad influence by background is reduced and the

accuracy will be improved further.

Furthermore, in the test step of the method, the test

image is divided into grid, and the combination of

the divided regions is represented by BoW. The

background region also affects the detection result.

In this paper, we use only human region

segmented by GrabCut (Rother et al., 2004) as

positive training data in order to reduce influence of

background. GrabCut does not segment a human

region perfectly but it gives good segmentation

result as shown in Figure 1. We see that background

is removed well. For negative training data, we crop

the regions randomly from the outside of the

bounding box and represent the region by BoW. We

train the SVM using those positive and negative data.

504

Tani Y. and Hotta K..

Robust Human Detection using Bag-of-Words and Segmentation.

DOI: 10.5220/0005354705040509

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 504-509

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

However, GrabCut requires the bounding box

including a human. In the training phase, we have a

bounding box but we cannot use the bounding box in

test phase. Thus, we use the Color Names (Weijer

and Schmid, 2007) instead of GrabCut. First, we

convert RGB image into Color Names image. If the

adjacent pixels are classified as the same color, we

put the same label on the region. We can represent

various combination of the labeled regions by BoW

histogram, and we feed them into SVM. We detect a

human as the combination of labeled regions with

the maximum score. The details are described in

Section 2.

In experiments, we use the Stanford 40 dataset

(Yao et al., 2011). The dataset contains images in

which people appear with large pose changes and

occlusions. In evaluation, the accuracy is measured

by the average overlapping rate of the area between

the detected region and the ground truth attached to

test images. As the baseline, we evaluate DPM. It

achieves 28.42% while our method achieves 52.07%.

The method without segmentation by GrabCut or

Color Names achieves 48.45%. These results

demonstrate that our method is effective for human

detection with partial occlusion and pose changes.

This paper is constructed as follows. Our human

detection method is explained in Section 2. The

experimental results using the Stanford 40 dataset

are shown in Section 3. Comparison with DPM and

the conventional method is also shown. Finally,

conclusion and future work are described in Section

2 DETECTION USING BAG-OF-

WORDS OF SEGMENTED

REGIONS

In the conventional method (Tani and Hotta, 2014),

BoW representation is used to cope with large pose

changes and occlusions. A codebook is made by k-

means of RootSIFT (Arandjelovi'c and Zisserman,

2012). The method was superior to the DPM.

However, the conventional method represented all

information within the bounding box by BoW. Thus,

the conventional method was influenced by

background region. Furthermore, in the test step, the

image is divided into grid and the combination of

grid is represented by BoW. Thus, the background

region also affects the detection result.

In this paper, GrabCut is used to segment a

human in training. In test phase, we segment the

region using the Color Names, and the combination

Figure 2: Overview of the proposed method.

of segmented regions is fed into SVM. By using

only the segmented region without background, the

accuracy is improved.

An overview of our detection method using BoW

of segmented regions is shown in Figure 2. As

described previously, we use the GrabCut result for

the bounding box as the positive training data. The

negative samples are generated by cropping the

region randomly except for human regions. We train

the SVM using BoW of those positive and negative

samples.

In the test phase, the bounding box is unavailable.

Thus, we use Color Names instead of GrabCut. We

segment the region based on Color Names, and BoW

histograms of various combination of the segmented

regions are computed. We feed them into the SVM,

and the combination of region with the maximum

output is used as the detection result.

2.1 Bag-of-Words Representation

We use a standard BoW representation. A codebook

is made by k-means of RootSIFT similar to the

conventional method. In general, the location

information of local features is helpful in object

categorization (Lazebnik et al., 2006). However,

when a human has occlusions or pose changes, it

makes the feature vector inconsistent. Thus, we use

the standard BoW representation to cope with such

cases.

In the experiments, in order to be independent of

the image size, we extract RootSIFT features with

grid spacing of 2% and patch size of 1%, 3% and

5% for the smaller width or height of each image.

The number of visual words is set as 1000. These

parameters are the same as the conventional method

(Tani and Hotta, 2014). For fair comparison, we use

the same parameters. We train a SVM with a

Hellinger kernel (Vedaldi and Zisserman, 2010). By

using linear SVM after taking root of elements in a

RobustHumanDetectionusingBag-of-WordsandSegmentation

505

BoW histogram, the nonlinearity can be used

without increasing the computation time.

Figure 3: The details of filtering. We convert input image

(a) into Color Names (b). (c) is the enlarged region of

orange rectangle region in (b). We pay attention to local

3×3 region shown as yellow, and the center pixel shown as

red is replaced by the most frequent Color Name in the

3×3 region as (d). We carry out this process to all pixels in

the image shown in (e). The result after applying some

filters is shown in (f).

2.2 Segmentation using GrabCut

It is reported that GrabCut gave good segmentation

result (Gavves et al., 2013). Furthermore, it can

segment the region from the bounding box without

human interaction. Since the Stanford 40 dataset

used in experiments contains the bounding box

including a human, we segment human region by

GrabCut from the bounding box. Some segmentation

results are shown in Figure 1. Segmentation result is

not perfect but GrabCut segments human region

roughly.

Figure 4: Left image is the result by applying a 3×3 filter

ten times. Applying the small filter many times does not

decrease the number of regions much. Right image is the

result by applying a 21×21 filter once. The large filter

loses the fine edge. For example, sky and the window of

the truck become together.

2.3 Segmentation using Color Names

Since the bounding box is unavailable in test phase,

we cannot use GrabCut. Thus we use the Color

Names instead of using GrabCut in order to segment

regions. Since the 11 basic colors are fixed in Color

Names, it is suited to rough segmentation. First, we

convert RGB (Figure 3 (a)) image into Color Names

(Figure 3 (b)) and divide the image into regions

based on the label assigned by Color Names. If the

adjacent pixels has the same color, the same label is

attached to the region. However, when the number

of regions is large, the number of combination

becomes huge and the computational cost is also

high. Thus, we apply a filter to reduce the number of

regions and the computation time. Here we pay

attention to 3×3 region as shown in Figure 3 (c), and

the center pixel is replaced by the most frequent

Color Name in the 3×3 region as shown in Figure 3

(b). This process is carried out for all pixels in the

image as shown in Figure 3 (e).

If a small filter is applied to the image many

times, the number of region does not decrease as

shown in Figure 4. If a large filter is used, the fine

edges are lost. To avoid those cases, we apply the

filters while changing the filter size as 3×3, 5×5, 7×7,

9×9, 11×11 pixels. The result after applying these

filters is shown in Figure 3 (f). By applying the filter

with different sizes, we can decrease the number of

region while maintaining fine edges.

Next, we reduce the number of regions further.

In order to simplify the description, we treat the

image of Figure 5 (a) as the result after applying

filters. We search the smallest region (the V region

shown in (a)) and the region is merged to the

smallest adjacent region of V in the image. In the

example, the region is III, and III and V are merged.

We search the smallest region again and the same

process is repeated till the number of regions

becomes threshold or less. In the experiment, we set

the threshold value as 20.

Figure 5: We search the smallest region in (a). In this

example, it is V. The region is merged to the smallest

adjacent region III as shown in (b).

2.4 Detection Method

As described previously, in the test step, we segment

the region using Color Names. After that, we

compute the score of all combinations of the labeled

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

506

regions. In the case of Figure 5, all combinations are

shown in Figure 6 (a). The non-adjacent

combinations such as [II, IV] are not used. We

represent all combinations by BoW. For example, in

the case of combination [I, IV], the shaded region

shown in Figure 6 (b) is represented by BoW. We

feed those representations into SVM. Finally, we

detect the combined region with the maximum score

as a human.

3 EXPERIMENTAL RESULTS

In the experiments, we use the Stanford 40 dataset

(Yao et al., 2011). The dataset is originally made for

action recognition in still images and contains 40

daily human actions. The bounding boxes including

a human are already given, and we use them for

training and evaluating the detection results.

For training, we made positive images by using

GrabCut from training images and made negative

images by randomly cropping the regions except for

the bounding box. Consequently, the training set

consists of 3,136 positive and 15,703 negative

images. We represent them by BoW and train SVM

by LIBLINEAR (Fan et al, 2008).

Figure 6: We compute the score of all combinations of the

labeled regions. In the case of Figure 5, all combinations

are shown in (a). We do not use the non-adjacent

combination such as [II, IV]. We represent all

combinations by BoW. For example, in the case of

combination [I, IV], the shaded region shown in (b) is

represented by BoW, and we compute the SVM score. The

combination of region with the maximum score is detected

as a human.

We use 4,345 test images from the Stanford 40

dataset. Test images also include humans with

various poses and occlusions.

The proposed method is compared with DPM

and the conventional method using BoW (Tani and

Hotta, 2014). We explain each method.

(A) Deformable Part Model:

As the baseline method, we detect humans by

DPM. The available annotation on the Stanford 40

dataset is only the bounding boxes indicating a

human. Thus it is difficult to train the DPM using

only those annotations, and we use the available

source code with pre-trained model obtained from

(http://cs.brown.edu/~pff/latent-release4/). The

model is trained with the INRIA Person Dataset. Of

course, since the training samples are different from

our method, the direct comparison with DPM is

difficult but we show the result as a reference.

(B) Conventional method using BoW:

This method represents all information within

the bounding box by BoW. Thus the method is

influenced by background region contained in the

bounding box. Furthermore, in test step, the image is

divided into grid and the combination of divided

region is represented by BoW. The background

region also affects the detection result.

Our method segments the region by GrabCut or

Color Names to reduce the influence of background.

We compute BoW histograms of the combination of

the segmented regions, and the BoW histograms are

fed into SVM. We detect the combination of the

segmented region with the maximum SVM score.

We evaluate the detection results by overlapping

rate R defined as

∩

∪

× 100[%]

(1)

where T is the area of the ground truth and D is the

area of the detection result. The value of R increases

as the overlapping area of the T and D becomes

large. If they match perfectly, the overlapping rate

becomes 100%. Since we want to evaluate how

much detection result matches to the ground truth,

we use this evaluation measure.

Table 1: Average overlapping rate for each method.

(A) (B) (C)

Overlapping rate 28.42% 48.45% 52.07%

Table 1 shows the average overlapping rate for

test images in each method. The DPM (A) is inferior

to other two methods. By using BoW instead of

DPM, it becomes robust to partial occlusion and

pose changes. This result shows that BoW is

effective for detection tasks when the appearance of

a human is much different.

By the comparison with the methods (B), we see

that the proposed method (C) is better than the

conventional method (B). The accuracy

improvement is 3.62%. This result shows that the

background region influences BoW histograms and

the combination of segmented regions is effective

for improving the accuracy.

RobustHumanDetectionusingBag-of-WordsandSegmentation

507

Figure 7: Comparison results. The blue rectangle is the ground truth and the red rectangle is the detection result.

Some detection results by each method are shown in

Figure 7. The blue rectangle shows the ground truth

and the red rectangle shows the detection result. We

see that the proposed method (C) can detect a human

correctly. In contrast, the DPM (A) gives poor result

for this dataset. One reason is that another dataset is

used for training. Since the INRIA Person Dataset is

made for the pedestrian detection, the resolution of

human image is low. Another reason is that DPM

cannot treat the large pose changes even if whole

body appears. This decreases the accuracy. However,

when human appears neither large pose changes nor

occlusions, DPM (A) can detect humans well.

The conventional method (B) represents all

information within the bounding box. Since the

background region within the bounding box is also

trained as positive data, the method (B) tends to

detect a human with large background. By using

segmentation, the background is reduced and the

proposed method (C) can detect human with higher

accuracy

4 CONCLUSION AND FUTURE

WORK

In this paper, we proposed the method for

detecting a human using Bag-of-Words of the

combination of regions segmented by Color Names.

BoW is robust to partial occlusions and pose

changes. Furthermore, we can reduce the influence

of background region by combining the segmented

regions. This improves the detection accuracy.

In the proposed method, merging the smallest

region with adjacent region in test phase is forcible

way. If we use another segmentation method, the

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

508

detection accuracy may be improved. It is worth

trying to use the SLIC which gives a good

segmentation quality (Achanta et al., 2010). This is a

subject for future work.

ACKNOWLEDGEMENTS

This work was partially supported by KAKENHI

Grant Number 24700178.

REFERENCES

Russakovsky, O., Lin, Y., Yu, K. and Fei-Fei, L., 2012.

Object-centric spatial pooling for image classification,

European Conference on Computer Vision.

Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan,

D., 2010. Object Detection with Discriminatively

Trained Part Based Models, IEEE Transactions on

Pattern Analysis and Machine Intelligence, Vol. 32,

No. 9, Sep.

Csurka, G., Dance, C., Fan, L., Willamowski, J. and Bray,

C., 2004. Visual Categorization with Bags of

Keypoints, Proc. of ECCV Workshop on Statistical

Learning in Computer Vision, pp. 59–74.

Arandjelovi’c, R. and Zisserman, A., 2012. Three things

everyone should know to improve object retrieval, In

IEEE Conference on Computer Vision and Pattern

Recognition, pp. 2911-2918.

Discriminatively trained deformable part models.

http://cs.brown.edu/~pff/latent-release4/

Lazebnik, S., Schmid, C. and Ponce, J., 2006. Beyond

Bags of Features: Spatial Pyramid Matching for

Recognizing Natural Scene Categories, In IEEE

Conference on Computer Vision and Pattern

Recognition, pp. 2169-2178.

Fan, R., Chang, K., Hsieh, C., Wang, X. and Lin, C. 2008.

LIBLINEAR: A library for large linear classification,

Journal of Machine Learning Research 9, pp. 1871-

1874.

Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L.J. and

Fei-Fei, L., 2011. Human Action Recognition by

Learning Bases of Action Attributes and Parts,

Internation Conference on Computer Vision.

INRIA Person Dataset http://pascal.inrialpes.fr/

data/human/

Vedaldi, A. and Zisserman, A., 2010. Efficient Additive

Kernels via Explicit Feature Maps, In IEEE

Conference on Computer Vision and Pattern

Recognition, Vol. 34, No. 3, pp. 480-492.

Tani, Y. and Hotta, K., 2014. Robust Human Detection to

Pose and Occlusion Using Bag-of-Words,

International Conference on Pattern Recognition, pp.

4376-4381.

Rother, C., Kolmogorov, V., and Blake, A., 2004.

GrabCut: Interactive foreground extraction using

iterated graph cuts, The ACM Special Interest Group

on Computer Graphics, Vol. 23, pp. 309-314.

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and.

Susstrunk, S. 2010. SLIC superpixels, Technical

report, EPFL.

Weijer J., Schmid C., 2007, Applying Color Names to

Image Description, In IEEE Conference on Computer

Vision and Pattern Recognition, Vol. 3, pp. 493-496.

Gavves E., Fernando B., Snoek C.G.M., Smeulders

A.W.M., and Tuytelaars T, 2013, Fine-Grained

Categorization by Alignments, In IEEE International

Conference on Computer Vision.

RobustHumanDetectionusingBag-of-WordsandSegmentation

509