Video Shot Boundary Detection using Visual Bag-of-Words

Jukka Lankinen

and Joni-Kristian K

ainen

Machine Vision and Pattern Recognition Laboratory, Lappeenranta University of Technology (LUT Kouvola),

Kouvola, Finland

Department of Signal Processing, Tampere University of Technology, Tampere, Finland

Keywords:

Bag of Words, Bag of Features, Shot Boundary Detection.

Abstract:

Recently, convergence of techniques used in image analysis and video processing has occurred. Many com-

putation and memory intensive image analysis methods have become available for per frame processing of

videos due to increased computing power of desktop computers and efﬁcient implementations on multiple

cores and graphical processing units (GPUs). As our main contribution in this work, we solve the problem

of shot boundary detection using a popular image analysis (object detection) approach: visual bag-of-words

(BoW). The baseline approach for the shot boundary detection has been colour histogram and it is at the core of

many top methods, but our BoW method of similar complexity in the terms of parameters clearly outperforms

colour histograms. Interestingly, an “AND-combination” of colour and BoW histogram detection is clearly

superior indicating that colour and local features provide complimentary information for video analysis.

1 INTRODUCTION

Problem settings in image and video process-

ing/analysis problems are almost equivalent, but

adopted approaches have been divergent due to per

frame processing required in many video process-

ing tasks, such as in video shot boundary detection.

For example, one hour of video contains approxi-

mately 100,000 frames, and the processing time of

one second per frame would take 27 hours in to-

tal. In this kind of tasks typically “fast-to-compute-

features”, such as colour histograms, have been used.

On the other hand, benchmark databases for image

analysis have also become very large. For exam-

ple, there are nearly 15 million annotated images in

the ImageNet

. This has set new demands for ap-

proaches, and development has not only produced

new techniques, but also more efﬁcient implementa-

tions of the existing ones.

Thus, in this work, we adopt the state-of-the-art

BoW method for processing of massive amounts of

images: dense SIFT for feature detection and repre-

sentation, k-means clustering for codebook genera-

tion, L1-normalisation of codebook histograms, and

the Euclidean distance for code matching. Our main

contribution is to apply this method for shot bound-

ary detection. In addition, we compare video spe-

http://www.image-net.org/

ciﬁc codebooks, generated from the local features

extracted from an input video, to a “general code-

book” generated from the ImageNet descriptors used

in (Deng et al., 2010). Moreover, we study the ef-

fect of varying the codebook size, which is the most

important parameter of BoW. The experiments are

performed using the TRECVid 2007 shot boundary

detection competition data. We compare our ap-

proach to the frame windows method (Tahaghoghi

et al., 2005) which was among the top performers

in (Smeaton et al., 2010) and can be considered as

the baseline method for shot boundary detection. Our

main contributions are:

• An efﬁcient bag-of-features method for detect-

ing shot boundaries. In the experiments, our

method performed better than the baseline (note

that colour histograms are used by many state-of-

the-art methods).

• We investigate the effect of the codebook size and

whether the codebook should be video speciﬁc or

general, both being important computational con-

siderations.

• We show how the combination of colour his-

tograms and local feature histograms provides

clearly superior results indicating that colour and

local features provide complementary informa-

tion for video processing.

788

Lankinen J. and Kämäräinen J..

Video Shot Boundary Detection using Visual Bag-of-Words.

DOI: 10.5220/0004290707880791

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 788-791

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

2 PREVIOUS WORK

In many applications, such as in video abstrac-

tion (Truong and Venkatesh, 2007) and content based

retrieval (Sivic and Zisserman, 2003), video shot

boundary detection is the ﬁrst step before higher level

processing. For analysis, the shots are usually consid-

ered as basic units and thus success of the boundary

detection affects the whole processing pipeline. The

shot detection has been studied within speciﬁc appli-

cations and as its own problem and a wide variety of

proposed methods exist.

A good introduction to the subject and an analysis

of the best approaches with the common benchmark

data were provided in a TrecVid survey (Smeaton

et al., 2010) which summarised the ﬁndings over

seven years of the TRECVid shot boundary compe-

tition. The vast majority of the best performing meth-

ods utilise colour histograms and machine learning

algorithms, such as GMM (Gaussian Mixture Mod-

els) (Kang and Hua, 2005) or HMM (Hidden Markov

Models) (Pruteanu-Malinici and Carin, 2008). It

is noteworthy, that the colour histogram difference,

which is considered as the baseline method, performs

notably well and is virtually parameter free except

the difference detection threshold (Gargi et al., 2000).

The histogram-based methods utilise some distance-

metric between the histograms of two consecutive

frames which measures how content has changed. For

example, high difference peaks on the time-line may

denote hard cuts and sequences of smaller consecutive

changes may denote fade-outs. The colour histogram

based shot boundary detectors are fast and accurate

when accompanied by heuristics for all transitions

types (Tahaghoghi et al., 2005; Joyce and Liu, 2006;

Mas and Fernandez, 2003). For experiments, we se-

lected the colour histogram variant in (Tahaghoghi

et al., 2005).

2.1 Visual Categorisation using BoW

The seminal works of the visual bag-of-features

(BoW) are (Sivic and Zisserman, 2003) and (Csurka

et al., 2004). In BoW, the salient local image features

(interest points) are extracted with a special detector

(e.g. SIFT) or ﬁxed size patches are selected using

dense sampling on a regular grid. Then, these “key-

points” are described with a descriptor, the SIFT de-

scriptor being the most popular. In the training phase,

a codebook is generated by clustering the descrip-

tors into a ﬁxed number of codes. In the matching

phase, for each descriptor the best matching code is

assigned. An image feature is generated by comput-

ing the histogram of codes appearing in the image.

Matching can be performed by histogram similarity

between two images (frames).

There is a huge number of variants and extensions

of the baseline method (e.g., (Lazebnik et al., 2006;

Leibe et al., 2008; Cao et al., 2010)), but often the ba-

sic method performs the best (Tuytelaars et al., 2010)

and for large scale problems the most efﬁcient dis-

criminative methods are not feasible anymore (Deng

et al., 2010). For this work, we adopt the recent im-

plementation in (Deng et al., 2010). For feature de-

tection the method uses dense sampling on a regular

grid, which has lately replaced the interest point de-

tection methods in the most visual object classiﬁca-

tion methods (Everingham et al., 2011). The descrip-

tor of choice is SIFT, the codebook is generated using

the k-means clustering and the feature histograms are

L1-norm normalised.

The shot boundary detection methods using local

features are only a few. (Li et al., 2010) computed

SIFT regions and descriptors, but did not utilise a

codebook. They directly searched for SIFT matches

between consecutive frames. A similar approach for

content analysis was proposed by Sivic and Zisser-

man (Sivic and Zisserman, 2003) who used a code-

book. The both techniques, however, are extremely

slow due to random sampling based matching. Sivic

and Zisserman run the matching only for key frames

of every shot as their application was content retrieval

and Li et al. did not report the computation times

for their method. To the authors’ best knowledge our

work is the ﬁrst to propose the bag-of-words approach

to video shot boundary detection.

3 SHOT BOUNDARY DETECTION

The main parts of our own implementation are similar

to the approach presented by Deng et al. This is par-

ticularly the case with a general codebook generated

from two million features extracted from ImageNet. It

is interesting to study how well the general codebook

performs as compared to a speciﬁc codebook, which

is re-generated for every input video. Speciﬁc code-

books are generated using features extracted from se-

lected frames (one frame per second in our imple-

mentation) and using the k-means clustering method.

Next, the shot boundary detection algorithm is given

in Alg. 1. It is noteworthy, that the only parameter

for our method is the detection threshold τ which is

equivalent to the colour histogram detection thresh-

old. The other inputs are video and a pre-computed

codebook.

VideoShotBoundaryDetectionusingVisualBag-of-Words

789

Algorithm 1: Video shot boundary detection (BoW).

1: Load codebook cb.

2: for all Frames i in video do

3: Init ~v(i) ← 0.

4: Extract dense interest points and form their de-

scriptors.

5: Search the best matches in the codebook cb for

every extracted descriptor using the fast KD-

tree search.

6: Form the code histogram

h using the codes.

7: L1-norm normalise the histogram and compute

the Euclidean distance d

curr

to the previous

frame histogram.

8: Calculate the distance difference (derivative)

= d

curr

− d

9: If d

≥ τ, then mark a shot boundary to the cur-

rent frame ~v(i) ← 1.

10: end for

11: Return the vector of shot boundaries~v.

4 EXPERIMENTS

The experiments were conducted with the TRECVid

2007 Shot Boundary data set which contains almost

7 hours of human annotated videos, 637,805 frames

with 2317 transitions. In our evaluation, we used

the TRECVid protocol, data and groundtruth, and the

provided functionality in the available toolkit. For our

method, the operating point is set by the difference

threshold τ. Low values result to high recall but low

precision, and vice versa. The precision-recall evalu-

ation curves were computed by iteratively testing all

possible values of the threshold τ.

4.1 Optimal Codebook Size

The size of the codebook (the number of clusters in

K-means) is one of the computational bottlenecks. In

object classiﬁcation, the codebook sizes vary between

1,000 and 100k, but in our case as small as possi-

ble is preferred. The precision-recall curves for our

method and with varying codebook size are shown in

Fig. 1. It is evident that the boundary detection is a

low level task which requires only moderate discrim-

ination power from the codebook. Already 100 codes

performed very well and increasing the size did not

improve the results. The method started to collapse

with codebooks smaller than 10.

4.2 General vs. Speciﬁc Codebook

Based on the previous experiment, a generated code-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

100

1000

10000

Figure 1: TRECVid precision-recall curves using our BoW

method and different codebook sizes.

book of size 100 performs well in shot boundary de-

tection. That does not produce signiﬁcant computa-

tional burden since the local features need to be ex-

tracted anyway. However, this could be improved if

a general codebook would perform well, since the

codebook generation step could be completely omit-

ted. A general codebook was generated using the two

million ImageNet features used in (Deng et al., 2010).

The shot boundary detection results and comparison

to a speciﬁc codebook (100) are shown in Fig. 2. The

results provide a clear evidence that the general code-

book does not perform well in this application and

changing the codebook size does not help the situa-

tion. This result is quite surprising from object class

detection point of view, but for shot boundary detec-

tion, video speciﬁc codebooks should be used.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Specific

General

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

100

1000

10000

Figure 2: TRECVid results for general and speciﬁc code-

books (right: general codebook results with varying size).

4.3 Baseline Comparison

In the last experiment, we compared our method to

the tailored colour histogram method in (Tahaghoghi

et al., 2005). We also investigated complementarity

of the two, colour histograms and BoW histograms.

This was achieved by ﬁrst running the both methods

and then combining the output binary vectors (1’s de-

noting a cut and 0’s no cut). For combining the log-

ical AND and OR rules were tested. The AND rule

should mainly improve the precision and the OR rule

the recall. The results are shown in Fig. 3. For the

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

790

combinations, we tested all possible combinations of

the thresholds τ

BoW

and τ

RGB

and for a recall point se-

lected the highest precision. Our BoW method clearly

outperforms the baseline method using colour his-

tograms. However, it is evident that combining the

two still improves detection remarkably.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

RGB

BoW

RGB+BoW (or)

RGB+Bow (and)

Figure 3: TRECVid comparison with the baseline method

and with the hybrid of the two methods.

5 CONCLUSIONS

In this work, we adopted the popular approach for

object class detection, visual Bag-of-Words (BoW),

to the low level video processing task of video shot

boundary detection. To the authors’ best knowledge,

our work is the ﬁrst which uses the BoW approach

in video shot boundary detection. We utilised the

available efﬁcient implementations and our method,

which has equal complexity in terms of the number

of parameters, achieved clearly superior performance

to the baseline. This is an interesting result, since

the baseline (colour histogram difference) is at the

core of many top performing methods. Our method

runs on half frame rate on standard PC hardware and

without special optimisation. Moreover, our results

showed that the two, BoW feature histograms and

colour histograms, provide complementary informa-

tion, and their combination achieved the best perfor-

mance. In future work, we will investigate other low

level video processing tasks using the BoW approach

and optimisation of our implementation to run on at

least frame rate.

REFERENCES

Cao, Y., Wang, C., Li, Z., Zhang, L., and Zhang, L. (2010).

Spatial bag-of-features. In CVPR.

Csurka, G., Dance, C., Willamowski, J., Fan, L., and Bray,

C. (2004). Visual categorization with bags of key-

points. In ECCV Workshop on Statistical Learning

in Computer Vision.

Deng, J., Berg, A., Li, K., and Fei-Fei, L. (2010). What does

classifying more than 10,000 image categories tell us?

In ECCV.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,

J., and Zisserman, A. (2011). The PASCAL Visual

Object Classes Challenge 2011 (VOC2011) Results.

http://www.pascal-network.org/challenges/VOC

/voc2011/workshop/index.html.

Gargi, U., Kasturi, R., and Strayer, S. H. (2000). Perfor-

mance characterization of video-shot-change detec-

tion methods. IEEE Trans. Circuits Syst. Video Techn.,

10(1):1–13.

Joyce, R. A. and Liu, B. (2006). Temporal segmentation of

video using frame and histogram space. IEEE Trans-

actions on Multimedia, 8(1):130–140.

Kang, H.-W. and Hua, X.-S. (2005). To learn representa-

tiveness of video frames. In ACM international con-

ference on Multimedia.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond

bags of features: Spatial pyramid matching for recog-

nizing natural scene categories. In CVPR.

Leibe, B., Ettlin, A., and Schiele, B. (2008). Learning se-

mantic object parts for object categorization. Image

and Vision Computing, 26(1):15–26.

Li, J., Ding, Y., Shi, Y., and Li, W. (2010). A divide-

and-rule scheme for shot boundary detection based on

SIFT. Int. J. of Digital Content Technology and Its

Applications, 4(3).

Mas, J. and Fernandez, G. (2003). Video shot boundary de-

tection based on color histogram. In TRECVid Work-

shop.

Pruteanu-Malinici, I. and Carin, L. (2008). Inﬁnite Hid-

den Markov Models for Unusual-Event Detection in

Video. IEEE Trans. on Image Processing, 17(5):811–

822.

Sivic, J. and Zisserman, A. (2003). Video Google: A text

retrieval approach to object matching in videos. In

ICCV.

Smeaton, A., Over, P., and Doherty, A. (2010). Video

shot boundary detection: Seven years of TRECVid

activity. Computer Vision and Image Understanding,

114:411–418.

Tahaghoghi, S., Williams, H., Thom, J., and Volkmer, T.

(2005). Video cut detection using frame windows. In

Australasian Computer Science Conference.

Truong, B. and Venkatesh, S. (2007). Video abstraction: A

systematic review and classiﬁcation. ACM Trans. on

Multimedia Computing, Communications and Appli-

cations (ACM TOMCCAP), 3(1).

Tuytelaars, T., Lampert, C., Blaschko, M., and Buntine, W.

(2010). Unsupervised object discovery: A compari-

son. Int J Comput Vis, 88(2).

VideoShotBoundaryDetectionusingVisualBag-of-Words

791