Automatic Attendance Rating of Movie Content using

Bag of Audio Words Representation

Avi Bleiweiss

Architecture Group, Intel Corporation, Santa Clara, U.S.A.

Keywords:

MFCC, Vector Quantization, Bag of Words, Support Vector Machines, Ranked Information Retrieval.

Abstract:

The sensory experience of watching a movie, links input from both sight and hearing modalities. Yet tradi-

tionally, the motion picture rating system largely relies on the visual content of the ﬁlm, to make its informed

decisions to parents. The current rating process is fairly elaborate. It requires a group of parents to attend a full

screening, manually prepare and submit their opinions, and vote out the appropriate audience age for viewing.

Rather, our work explores the feasibility of classifying age attendance of a movie automatically, resorting to

solely analyzing the movie auditory data. Our high performance software records the audio content of the

shorter movie trailer, and builds a labeled training set of original and artiﬁcially distorted clips. We use a bag

of audio words to effectively represent the ﬁlm sound track, and demonstrate robust and closely correlated

classiﬁcation accuracy, in exploiting boolean discrimination and ranked retrieval methods.

1 INTRODUCTION

Classifying multimedia scenes into semantic cate-

gories is considered a central information retrieval

(IR) problem, and attracted great interest in both re-

search and practice. This is primarily motivated by

the ever increasing scale of online content-base data,

both visual and auditory, becoming freely available.

For movies, the Internet Movie Database (IMDb,

1990) is the most comprehensive and authoritative

source, providing millions of titles in its repository,

publicly accessible to users, in the forms of full length

ﬁlms, trailers and clips. Of greatest relevancy to our

work, is the IMDb high quality, trailer gallery that

is kept current with both upcoming and recent re-

leases. Each trailer is believed to faithfully capture

the essence of the fully featured ﬁlm, by portraying

a short plot preview with key scenes, and playing in

a time limit of one to two minutes. Moreover, many

of the trailer videos are already ground-truth labeled,

and embed the content ratings for appropriate viewing

audience. Contrary to the explicit approach that re-

quires human perception experience, in attending an

entire ﬁlm screening, we propose a system that im-

plicitly learns from the vast aggregate audio content,

provided by the IMDb trailer sound tracks, to clas-

sify admissible content to an age group. Our claim

is founded on the basis that the depiction of language

and violence, two of the more principal content rat-

ing criteria, are more immediately identiﬁed with the

trailer acoustic features. We clearly recognize the

challenge involved in a learning system of no visual

sensing, but contend the implicit approach is more

scalable, has the potential to be less error prone, and

respectfully more economically sound.

One of the more simple and very effective text

retrieval models is the bag of words (Baeza-Yates

and Ribeiro-Neto, 1999). Here, documents are repre-

sented as a histogram of words from a prescribed vo-

cabulary, however, the model completely ignores any

of word dependencyand syntactic structure. Retrieval

relevance is therefore reduced to evaluating similarity

of words in a document with words in a query. In-

spired by the prospect to produce superior classiﬁca-

tion performance, researchers have evolved the model

to efﬁciently represent large scale, multimedia con-

tent. The bag of visual words formulation was suc-

cessfully adapted to scene image classiﬁcation (Yang

et al., 2007), and image retrieval (Wu et al., 2009).

Similarly, the bag of audio words modality emerged

in content based audio retrieval (Chechik et al., 2008),

video copy detection (Liu et al., 2010), and a higher

level structure, bag of system words, for semantic an-

notation of musical pieces (Ellis et al., 2011). Our

system leverages established results of the model con-

vention, and expresses each of the movie trailers with

an enhanced bag of audio words representation that

ﬁts both a discriminatory classiﬁer like SVM, and

142

Bleiweiss A..

Automatic Attendance Rating of Movie Content using Bag of Audio Words Representation.

DOI: 10.5220/0004604601420150

In Proceedings of the 10th International Conference on Signal Processing and Multimedia Applications and 10th International Conference on Wireless

Information Networks and Systems (SIGMAP-2013), pages 142-150

ISBN: 978-989-8565-74-7

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

follows similarity calculations directly from the well

known, Vector Space Model (Salton et al., 1975).

The main contribution of this paper is a high per-

formance software that retrieves movie trailer sound

tracks from a large scale and continually growing,

online audio archive, and automatically rates a ﬁlm

content to appropriate, viewing age group. Unlike

the current manual and more involved rating process

that studies every movie, individually. Our feature

abstraction interface provides the ﬂexibility to utilize

both discriminative and ranked information retrieval

processes for classiﬁcation, subscribing to a unique

and robust correlation based validation of system per-

formance. Next, we provide a brief overview of the

movie rating system, in Section 2. Followed by de-

scribing our methodology for acoustic feature extrac-

tion, end-to-end, in Section 3. Then, we detail al-

gorithm and give theory to support the performance

of our multi modal classiﬁcation approach, in section

4. Proceeding with analyzing quantitative results of

our experiments to demonstrate rating effectiveness,

in Section 5, and conclude with a short summary and

future prospect remarks, in Section 6.

2 MOVIE RATING SYSTEM

The movie rating system is governed by an indepen-

dent body, comprised of parents, with the purpose of

giving advance warnings to parents, so that they can

make informed decision about which ﬁlms their chil-

dren see. The movie rating system evolved both as

a useful and valued tool for parents, but also become

an essential guardian of the freedom of artistic, cre-

ative and political expression. In the United States,

the Motion Picture Association of America (MPAA,

1922), through the Classiﬁcation and Rating Admin-

istration (CARA, 1968), issues ratings for the movies.

The modern system was instituted in late 1968 and is

entirely voluntary. However, most major studios have

agreed to submit all titles for rating prior to theatri-

cal release. Respectfully, most movie theater chains

avoid showing unrated domestic ﬁlms. After screen-

ing ﬁlms, the personal opinions of the group of par-

ents in attendance, are used to arriveat one of ﬁve ﬁlm

ratings (Table 1). For some ratings, the MPAA adds a

brief explanation as to why a particular ﬁlm received

certain rating. By convention, most ﬁlm trailers will

have the MPAA rating right at the beginning, and a

fully featured ﬁlm will have the MPAA logo at the

end of the closing credits.

The motion picture rating system, classiﬁes the

content of a ﬁlm primarily in terms of its depiction

of matured theme, language, violence, and adult ac-

tivities. For a General Audiences (G) rated movie,

all ages are admitted. The movie contains nothing to

offend parents for viewing by children. A Parental

Guidance Suggested (PG) ﬁlm class identiﬁes some

material that is inappropriate for children. Parents are

urged to give guidance to their children, before let-

ting them view the ﬁlm. Parents Strongly Cautioned

(PG-13) rating implies some material is unsuitable for

children under 13. A PG-13 rank holds more of a se-

rious warning to parents to determine whether their

children should attend this motion picture. The Re-

stricted (R) rating indicates that children under 17 are

not allowed to attend, unless they are accompanied by

a parent or an adult guardian. Finally, NC-17 rating

is considered by most parents too adult oriented and

no child under the age of 17 is admitted. NC-17 rated

movies are rare and have little to no success in the box

ofﬁce. Often, studios and distributors edit the NC-17

ﬁlm to qualify for an R rating. Accordingly, the NC-

17 movie rating is outside the scope of our study. If a

ﬁlm is not submitted for rating or is an uncut version

of a ﬁlm that was submitted, the label NR (Not Rated)

is often used.

3 ACOUSTIC FEATURES

Our basic framework for extracting the bag of au-

dio words from a movie trailer, proceeds in several

stages. First, the sound track of its original digital au-

dio is converted into a high quality, WAV ﬁle. Next,

the signal captured is divided into overlapping, short

time segments, and primitive, audio feature vectors

are derived from each. This follows a vector quan-

tization (VQ) (Gersho and Gray, 1992) process that

maps the original, acoustic feature vectors into a set

of k clusters. The collection of clusters constitutes

a vocabulary, and each cluster corresponds then to a

unique audio word, or a term. Effectively, reducing

the trailer representation to a highly compressed, sin-

gle word vector of k dimensionality, with each ele-

ment retaining a count of audio word occurrences in

a WAV ﬁle.

Table 1: United States motion picture attendance ratings.

Rating Description

G General Audiences

PG Parental Guidance Suggested

PG-13 Parents Strongly Cautioned

R Restricted

NC-17 No One 17 and Under Admitted

AutomaticAttendanceRatingofMovieContentusingBagofAudioWordsRepresentation

143

3.1 Audio Feature Extraction

Models for human perception of sound, are based

upon frequency analysis performed in the inner ear

(Rabiner and Schafer, 2007). The cepstrum, the

power spectrum of the logarithm spectrum, of a

speech signal found to be an effective indicator of

pitch detection (Noll, 1964). On this basis, Davis

and Mermelstein (Davis and Marmelstien, 1980)

formulated the mel-frequency cepstrum coefﬁcients,

MFCC, that represents the short term power spectrum

of a sound, and captures the nonlinearity of human

hearing. MFCC is one of the more compact acous-

tic feature representation, and is widely used in appli-

cation domains that include automatic speech recog-

nition, speaker identiﬁcation, audio-video misalign-

ment detection (Perelygin and Jones, 2011), and mul-

timedia event classiﬁcation (Pancoast and Akbacak,

2012). In our work, we use MFCC features with the

parametric representation of the Fourier transformed

cepstrum, derived based on a mel scale. The subjec-

tively, nonlinear mel scale is further deﬁned in equa-

tion 1, where f is the linear frequency in Hz.

mel

= 1125ln(1+

700

) (1)

The computational ﬂow for extracting the MFCC

feature vector, from a discrete time speech signal, is

depicted in Algorithm 1. First, the signal for each

movie trailer is pre-emphasized with alpha = 0.95.

Then, we segment the time data into frames at a rate of

100Hz, or 10ms duration, using a hamming window

with 50% overlap. Our movie clips are consistently

recorded in 16-bit mono, using a 44,100Hz sampling

rate that yields 512 zero padded samples, per frame.

Power-of-2 frame padding, warrants an efﬁcient suc-

cessive computation of the Fourier power spectrum,

using the discrete Fourier transform (DFT) algorithm.

Then, we apply a set of 48 triangular ﬁlters, spaced

linearly in mel scale, on the spectrum obtained pre-

viously, and compute the logarithmic ﬁlter bank, en-

ergy coefﬁcients. Finally, mel cepstral coefﬁcients are

derived, by employing the discrete cosine transform

(DCT) on the mel bank, energy spectra. The features

extracted for each frame, consist of the standard 12

MFCCs, as well as the log energy of the frame, result-

ing in a 13-dimensional feature vector. In our exper-

iments, trailer sound tracks are recorded for approx-

imately one minute, generating about 12,000 MFCC

vectors on average, per WAV ﬁle.

3.2 Bag of Audio Words

The VQ step performs clustering in the 13-

dimensional, MFCC space. We apply K-means

Algorithm 1: MFCC.

Input: time data s, sample rate f

, ﬁlter bank f

x = pre-emphasize(s)

F = frames(x)

for f in F do

w = hamming( f)

spectrum = DF T (w)

melspectra = log

(melbank(spectrum, f

, f

))

melcepstral = DC T (melspectra)

melcepstral = melcepstral ∪ energy( f)

end for

(Lloyd, 1982) to a set of MFCC features, extracted

from each movie trailer, and identify k dense regions

that collectively constitute the audio words. For ev-

ery MFCC vector, K-means computes the nearest Eu-

clidean distance to an iterated cluster centroid, and

assigns to each MFCC feature, a cluster index in

{1,2,...,k}. Then, a sound track is mapped to a k-

dimensional vector [ f

, f

,..., f

] that encodes the fre-

quency of each audio word, or a term, f

, in the trailer.

This histogram representation makes equivalent, seg-

ments that are acoustically similar, but their MFCC

vector varies slightly. Figure 1 depicts audio word

histogram of four movie trailers, one for each of the

MPAA ratings. The choice of the parameter k is an

important system performance trade-off. Large k in-

creases computation time and results in a more dis-

criminative model. On the other hand, while more ef-

ﬁcient and of higher vocabulary consistency, a small

set of clusters is less separable. In our evaluation,

we explore the parameter k in a wide range of 100 to

3000, and study the vocabulary size impact on movie

rating accuracy.

In a bag of audio words model, the order of the

features in a movie sound track m is ignored. Rather,

each trailer audio content is represented as a count

vector in N

|V|

, where |V| is the total number of words

in the vocabulary. Borrowing from IR, we deﬁne the

term frequency,t f

t,m

, of term t in a movie feature vec-

tor m, as the number that t occurs in m. Whereas the

relevance of a movie to a query, does not increase pro-

portionally with the term frequency. Hence, to en-

sure less than linear, matching score growth, IR in-

troduces instead a log-frequency weighting entity, we

label w

t,m

(

1+ log

t f

t,m

, if t f

t,m

> 0

0, otherwise

(2)

Similarly to stop words in IR, frequent MFCC fea-

tures in a trailer auditory sequence, are less informa-

tive than rare terms. To capture this notion, we intro-

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

144

100

1000

0 100 200 300 400 500 600 700 800 900 1000

# occurrences

Audio Word

G PG PG-13 R

Figure 1: Audio word histogram for a sample trailer from each movie rating class. G (Aladdin), PG (Animals United) and

PG-13 (10 Years) are fairly distinct, while R (A Late Quartet) is more observant on high word indices. Shown for a 1000

clusters conﬁguration, with number of occurrences plotted on a logarithmic scale.

duce the movie frequency weight, m f

, that amounts

to the number of trailers that contain the term t. mf

therefore the inverse measure of the informativeness

of t. We then formally follow to deﬁne the inverse

movie frequency, imf of t as

imf

= log

(N/mf

), (3)

with N the size of our trailer training set. Finally,

analogous to the IR t f.id f scheme, the term and

inverse movie frequency measures are combined to

yield t f.imf weighting, for a given term, that is the

product of the term t f and imf weights:

t,m

= (1+ log

t f

t,m

)log

(N/mf

). (4)

t f.imf increases with the number of feature occur-

rences within a trailer acoustic model, and also with

the rarity of a feature in the trailer training collec-

tion. Each movie trailer auditory content, is now rep-

resented by a real-valued vector of t f.imf weights

∈ R

|V|

. Either an unnormalized or normalized weight

vector version, is passed on to a classiﬁer of choice.

4 CROSS CLASSIFICATION

The task of rating a movie content constitutes a multi

class, classiﬁcation problem. Decomposing the prob-

lem into a series of two-class, binary subproblems,

using a one-against-all discriminative process, is one

intuitive path of implementation. On the other hand,

with a generic and orthogonalbag of audio words rep-

resentation, the problem can be viewed as a search

into a dataset of trailers that returns a ranked list of

the most relevant, auditory content, to match a sound

track query. Our design incorporates both boolean

and indexed classiﬁcation modalities, using support

vector machines (SVM) and ranked information re-

trieval (RIR) methods, respectively. This cross clas-

siﬁcation formulation, merits a unique correlation di-

mension that ensures a more robust validation of our

system rating performance.

4.1 Support Vector Machines

The bag of words feature representation, lends itself

well to a discriminativeclassiﬁer, such as support vec-

tor machines (SVM). SVM effectively models non-

linear decision boundaries, by using a kernel func-

tion. Respectively, we pass the vectors of weights,

extracted from the auditory content of the movie trail-

ers, for both training, and testing the learned model

with new examples. For our classiﬁer, we selected

SVM-Light (Joachims, 1999), owing to its robust,

large scale SVM training, and implemented a C++

wrapper on top, to seamlessly communicate with our

movie rating, software components. Both the linear

and radial basis function (RBF) kernels are studied in

our work, speciﬁcally to compare rating performance

impact, as a function of varying vocabulary size.

With four movie rating classes, we train four SVM

models, each separating one group of trailers from

the rest. The i-th SVM trains one of G, PG, PG-

AutomaticAttendanceRatingofMovieContentusingBagofAudioWordsRepresentation

145

13, and R class of trailers, all labeled as ground-truth

true, and the remaining movie rating classes are la-

beled false. At the classiﬁcation step, an unlabeled

test trailer is assigned to a rating class that produces

the largest value of hyper-plane distance, in feature

space. Implementation wise, our software controls

the one-against-all outer loop, invoking SVM-Light

repeatedly four times, once for each rating class vs.

rest conﬁguration.

4.2 Ranked Information Retrieval

The Vector Space Model is the most commonly used

in IR. This model ranks more relevant documents

higher than less relevant ones, with respect to a query

that is comprised of a collection of terms. Both the

query and the documents are represented as vectors

in the space, and documents are ranked based on their

proximity to the query. Proximity is the similarity

measure of two vectors, and is roughly a function of

the inverse of the distance between them. The notion

adapted by the IR community, to rank documents in

increasing order of the cosine of the angle between

the query and the document vectors, stems from the

cosine function property, for monotonically decreas-

ing in the interval [0

◦

,180

◦

A vector is length normalized by dividing each of

its weighting components, by its length. For normal-

ization, IR uses the L2-norm, expressed as k~xk

∑

. Dividing a vector by its L2-norm, makes it a

unit vector, and thus short and long trailer sequences

of scaled terms, now have comparable weights. We

deﬁne cosine similarity as the dot product of the query

sound track vector, q, and a training trailer audio vec-

tor, m, both length normalized:

cos(~q,~m) = ~q·~m =

|V|

∑

i=1

. (5)

We compute the cosine similarity score for the

query trailer and each of the training trailers in our

dataset. Training trailers, with respect to the query,

are ranked for relevancy by their score, and the top

M(M = 10) are returned for further analysis.

5 EMPIRICAL EVALUATION

To evaluate our system in practice, we have imple-

mented a Direct2D audio application that reads raw

WAV ﬁles, splits each to a property header and data

sections, and loads the normalized time signal and its

sampling rate parameter to our movie rating, C++ li-

brary. Our library operates on the raw audio, com-

mences feature extraction followed by vector quan-

tization, and performs discrimination and similarity

calculations. We use the hold out method with cross

validationto rank the performanceof our system. For-

mally, our library sets up one of random and 10-fold

resampling modes, and each rating class becomes a

two-way data split of trailers, with train and test sets,

owning 80/20 percent shares, respectively.

5.1 Experimental Setup

Disposed at matching movie content to audience age,

the productivity of bag of audio words representation

is assessed. We build a labeled base set of 25 trailers,

for each the G, PG, PG-13, and R ratings. The base set

is drawn randomly from previously rated, movie trail-

ers, off IMDb (IMDb, 1990). Our collection incorpo-

rates more recent and modern titles that are readily

accessible online, but also subscribes to a fair share

of productions that span over two decades of movie

making. The high quality, 16-bit mono, 44.1KHz

WAV ﬁles, each produced of a minute long record-

ing, is about 5MB in size, setting the base set to total

100-minute footage of combined 0.5GB size. With

an average of nearly 12,000 MFCC feature vectors,

extracted from the audio sequence of a base trailer,

our system processes a total of 1.2 million features.

We then apply signal distortion, artiﬁcially to each

of the labeled recording, and augment our learning

data set by a factor of ten, to yield an aggregate of

250 auditory samples, per rating class. Leading to

a combined 1000 system trailer set that subscribes

to a 5GB footage of both physical and virtual audio.

Rather than deforming the source signal (Riley et al.,

2008), a step that is computationally intensive, we ob-

tain an identical effect by simply perturbing the his-

togram of the base word vector, and randomly mod-

ulating the term frequency of a word in the interval

{−5%,+5%}. This slightly warped version of a word

vector, conforms to both discriminatory and similarity

margins, to rule rating class inclusion.

Scalability to a dynamically grown, synthetic au-

ditory data set, for attaining higher classiﬁcation per-

formance, is a principal guidance in the design of our

proposed system. Here we discuss three implementa-

tion considerations that ensure the efﬁciency of ma-

jor computational steps. First, the task of process-

ing over a million feature vectors, is a critical com-

pute section in our implementation. We ﬁnd the exe-

cution of feature extraction and vector quantization,

in sequentially iterating the base set, prohibitively

inefﬁcient. Rather, we exploit concurrency, using

the latest C++11 futures and asynchronous launching

methodology. In parallel processing independent au-

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

146

Table 2: Vocabulary statistical data, collected for a discrete set of cluster counts, and for each movie rating.

Number Rating Average Max Median Mean Standard Deviation Average Single Word

Clusters MFCC Vectors t f

t,m

t f

t,m

t f

t,m

t f

t,m

Clusters

100 G 11,679 4292 53.20 116.79 192.82 3.04

PG 12,045 3687 49.60 120.45 201.26 3.48

PG-13 11,507 2953 49.48 115.08 198.84 3.00

R 11,955 3740 51.84 119.56 219.08 1.60

500 G 11,679 2256 12.00 23.36 39.82 42.88

PG 12,045 2916 12.00 24.09 42.48 43.48

PG-13 11,507 754 12.16 23.02 34.21 31.68

R 11,955 1860 11.28 23.91 43.84 40.52

1000 G 11,679 1579 6.80 11.68 19.26 129.76

PG 12,045 1162 6.60 12.05 19.72 131.80

PG-13 11,507 435 6.88 11.51 15.60 109.00

R 11,955 809 5.84 11.96 21.73 154.56

1500 G 11,679 626 4.80 7.79 11.10 248.60

PG 12,045 1162 4.64 8.03 13.00 266.04

PG-13 11,507 357 4.84 7.67 9.90 238.80

R 11,955 450 3.84 7.97 14.71 331.40

2000 G 11,679 554 3.76 5.84 7.63 408.16

PG 12,045 1159 3.68 6.02 9.47 442.00

PG-13 11,507 357 3.56 5.75 7.32 408.32

R 11,955 440 3.12 5.98 10.65 552.20

2500 G 11,679 554 3.08 4.67 5.92 624.16

PG 12,045 579 2.92 4.81 6.35 649.08

PG-13 11,507 357 3.04 4.60 5.68 621.88

R 11,955 359 2.52 4.78 7.96 792.32

3000 G 11,679 554 2.60 3.89 4.90 886.92

PG 12,045 578 2.64 4.02 5.13 876.96

PG-13 11,507 357 2.56 3.83 4.60 884.60

R 11,955 359 2.32 3.99 6.03 1038.96

dio streams, and generating MFCC vectors, followed

by constructing histogram of words for each, we

achieved a close to linear performance gain of about

3.5X, compared to the serial implementation, on a

four cores, 2

generation, 2.8GHz Intel Core proces-

sor. Second, the automatic construction of syntheti-

cally distorted histogram vectors at runtime, not only

bypasses the extensive MFCC processing of a source

audio signal, it furthermore strips down memory foot-

print for temporary buffers, substantially. More im-

portantly, the automatic augmenting of our auditory

data set, from a relatively small, manual trailer pro-

duction set, adds ﬂexibility to gracefully enhance the

performance of our rating system. Note the data am-

pliﬁcation factor, set in our study to ten, is a system

controlled parameter, supplied by the user. Thirdly,

the computation of the inverse movie frequency, imf

weighting, is independent of any query vector, and is

therefore computed only once in our system, at train-

ing time. We use the well established and more ef-

ﬁcient, standard IR weighting scheme, lnc.ltc, with

imf computed for the test vectors, but is set to one,

for the entire training partition.

5.2 Experimental Results

To understand how terms are distributed across our

collection of base and distorted auditory content, we

use the Zipf law. The law states that the collection

frequency cf

of the i

most common term, is propor-

tional to 1/i. For vocabularies of different sizes, we

plot the frequency of an audio word against its fre-

quency rank in Figure 2, in a log-log scale. A straight

line with a slope of −1, corresponding to the Zipf

ideal function, is also depicted for reference (Ref).

Our data shown to consistently ﬁt the law, with the

exception of the extremely low frequency terms. This

is likely a side effect of our K-means implementation

that produces rare words, we attribute to sparse, noisy

audio frames. The slopes of the vocabulary curves,

are however less steeper than the line predicted by

the law, indicating a more evenly distribution of audio

AutomaticAttendanceRatingofMovieContentusingBagofAudioWordsRepresentation

147

words. Table 2 further provides broader quantitative

statistics to term weights, for each vocabulary size we

use in our study, per movie rating. An interesting ob-

servation is the outstanding larger count of average

single word clusters, indicative of content sparseness,

for the movie rating R, speciﬁcally for richer vocabu-

laries.

0.0001

0.001

0.01

0.1

1 10 100 1000

Frequency

Rank

100

500

1000

1500

2000

2500

3000

Ref

Figure 2: Zipf law distribution as a function of increased

vocabulary size, plotted in a log-log scale.

We then study the impact of parametrically vary-

ing the bag of audio word representation, on our

movie rating performance. Our multi modal classi-

ﬁcation methodology, evaluates system performance,

for each of G, PG, PG-13, and R attendance classes,

as a function of increased vocabulary size. For SVM,

we use the F1-Score measure, and compare linear to

RBF kernels, reporting our best results in Figures 3

and 4, respectively. Clearly, the vocabulary size af-

fects greatly the classiﬁcation score, however, the be-

havior is not necessarily monotonic. Performance

rises ﬁrst, might reach an optimal peak, then either

remains ﬂat or declines and rises again to reach a sim-

ilar, or an extended local maximum. G, PG, and PG-

13 ratings, are strongly grouped and follow a simi-

lar path trend, to closely intersect at an F1-Score of

0.67, for an optimal vocabulary size of 2000. How-

ever, the R class performs distinctively different, with

an average classiﬁcation rate of 0.575, peaking at a

lexicon size of 3000 words, for a 0.64 score. We con-

tend this one-from-rest scoring disparity is fairly ex-

plicable. The R sound track contains only sparsely

spread, unique audio content to be classiﬁed as a re-

stricted theme, a claim further supported statistically

in Table 2. But the remainder of its content is a mix-

ture of audio, relevant in the other ratings, thus mak-

ing R less discriminatory. For the RBF kernel, we

observe better performance for small vocabularies of

highly correlated words, when compared to the linear

kernel. But for larger vocabularies of linearly separa-

ble words, the RBF advantage is less obvious. More

compelling for RBF, is the even tighter lumping of

the G, PG, and PG-13 rating functions, extended to

small vocabularies, and more importantly, the R class

is further aligned with the rest.

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0 500 1000 1500 2000 2500 3000

F1-Score

Clusters

G PG PG-13 R

Figure 3: Linear SVM: F1-Score classiﬁcation performance

as a function of increased vocabulary size, shown for each

movie rating class.

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0 500 1000 1500 2000 2500 3000

F1-Score

Clusters

G PG PG-13 R

Figure 4: RBF SVM: F1-Score classiﬁcation performance

as a function of increased vocabulary size, shown for each

movie rating class.

For RIR, a baseline method of ﬁnding nearest

neighbors in using cosine distance, depicted non

commensurate and inconsistent spiky performance.

Rather, we use the mean average precision (MAP)

measure that is susceptible to the entire ranking of

a set of queries. A query is a trailer member of our

test held out, data partition. First, we compute the av-

erage precision (AP) score, for an individual query.

Each time, one of the top ranked training trailers, by

way of similarity, is relevant and hence matches the

query label, we accumulate its precision score at the

current, non interpolated recall, and average out the

scores. Seeking both the performance of common

and rare terms in our bag of audio words represen-

tation, we weigh each query equally, amid computing

the arithmetic average of all the query APs, to yield

the desired MAP measure. Figure 5 depicts our sys-

tem MAP as a function of increased vocabulary size.

The G rating function rises almost monotonically, and

ﬂattens out at an optimal lexicon size of 2000 words,

with an exceptionally high MAP score of 0.995. For

the remaining rating classes, the score changes more

mildly. PG peaks at a MAP of 0.93 and a 1000 word

vocabulary, whereas PG-13 and R follow an almost

identical performance path, tracing along a 0.92 MAP

score. Figure 6 further illustrates the precision-recall

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

148

curve for each rating, at a ﬁxed vocabulary size of

100. For each of the precision-recall curves, we report

commensurate area-under-curve (AUC) measures, in

Figure 7, that closely parallel our MAP ﬁgures.

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

0 500 1000 1500 2000 2500 3000

MAP

Clusters

G PG PG-13 R

Figure 5: Ranked Retrieval: Mean average precision (MAP)

classiﬁcation performance as a function of increased vocab-

ulary size, shown for each movie rating class.

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

Recall

G PG PG-13 R

Figure 6: Ranked Retrieval: Precision-Recall curves for vo-

cabulary size of 100 audio words, shown for each movie

rating class.

0.8

0.9

AUC

Movie Ratings

G PG PG-13 R

Figure 7: Ranked Retrieval: area-under-curve (AUC)

measures, drawn from Precision-Recall curves (Figure 6),

shown for each movie rating class.

Owing to a common and an effective weight vec-

tor representation, our research beneﬁted markedly

from comparing classiﬁcation performance of a

boolean discriminatory process against a discretized,

ranked search. While classiﬁcation modalities largely

concur on performance results, there are however dif-

ferences that warrant a brief review. Notably is the

tendency for each model to distinguish one bound-

ary rating class from the rest, but in a seemingly

non-explicable, opposing directions of performance

scores. For SVM, the R rating stands relatively less

optimal, potentially implying bounded generalization,

due to over-ﬁtting. We suspect that by increasing the

ampliﬁcation factor of our distorted, audio data set,

over-ﬁtting can be greatly mitigated. On the other

hand, RIR attributes higher MAP performance to the

G rated movies, compared to the rest of the rating

classes. Subscribing to the kids gamut of movie con-

tent, the G theme signiﬁes more consistent and rela-

tively ﬂat acoustic features. Hence, similarity based

RIR, is more successful in assigning relevancy to the

majority of the top ranked, query results.

We compare our movie rating performance to the

conceptually resembling task of Multimedia Event

Detection (MED). Alike, MED searches multimedia

corpora to identify user deﬁned events, based on pre-

computed metadata. In their state-of-the-art work,

Wang et al. (Wang et al., 2012) present their MED

approach, employing bag of words representation for

an enhanced combination of both static and dynamic

visual descriptors, and MFCC audio features. They

report an event speciﬁc MAP performance that ranges

from 0.2 to 0.7, showing a moderate rate increase in

augmenting the feature set selection. While the pos-

sible extension of our feature set merits further evalu-

ation, at ﬁrst order, our system MAP scores appear to

justify a design with the sole use of audio features.

6 CONCLUSIONS

We have demonstrated the apparent potential in au-

tomating the rating process of age admittance to

movie viewing, by classifying the auditory content

of the movie sound track. Our training data set is

comprised of an extremely small seed of high quality,

hand recorded trailers, that is artiﬁcially augmented

by a dynamically grown distorted set, computed off

our low dimensionality, histogram vectors. Thus lead-

ing to an efﬁcient and scalable software that exe-

cutes on a compact memory footprint. The bag of

audio words representation prove effective for both

discriminative and retrieval classiﬁcation modalities.

Whereas exploiting term weighting to achieve plausi-

ble classiﬁcation accuracy, serves a sound empirical

basis for other challenging applications in real world,

audio content understanding. Optimizing our system

performance, as a function of the manual to synthetic

data division, is one vital area for future research. We

perceive our current unigram representation to evolve

into a more generic n-gram model, by building bi-

gram and trigram constructs, to better capture adja-

AutomaticAttendanceRatingofMovieContentusingBagofAudioWordsRepresentation

149

cent, temporal relations of audio features, at a trailer

level.

ACKNOWLEDGEMENTS

We would like to thank the anonymous reviewers for

their constructive and helpful feedback.

REFERENCES

Baeza-Yates, R. and Ribeiro-Neto, B., editors (1999). Mod-

ern Information Retrieval. ACM Press Series/Addison

Wesley, Essex, UK.

CARA (1968). Classiﬁcation and Rating Administration.

http://www.ﬁlmratings.com/.

Chechik, G., Ie, E., Rehn, M., Bengio, S., and Lyon, R. F.

(2008). Large scale content-based audio retrieval from

text queries. In ACM International Conference on

Multimedia Information Retrieval (MIR), Vancouver,

Canada.

Davis, S. B. and Marmelstien, P. (1980). Comparison

of parametric representations for monosyllabic word

recognition in continuously spoken sentences. IEEE

Transactions on Acoustics, Speech, and Signal Pro-

cessing, 28(4):357–366.

Ellis, K., Coviello, E., and Lanckriet, G. R. (2011). Se-

mantic annotation and retrieval of music using a bag

of systems representation. In International Society for

Music Information and Retrieval Conference (ISMIR),

pages 723–728, Miami, FL.

Gersho, A. and Gray, R. M., editors (1992). Vector Quan-

tization and Signal Compression. Kluwer Academic

Publishers, Boston, MA.

IMDb (1990). Internet Movie Database.

http://www.imdb.com/.

Joachims, T. (1999). Making large-scale svm learning prac-

tical. In Advances in Kernel Methods: Support Vector

Learning, pages 169–184. MIT-Press.

Liu, Y., Zhao, W., Ngo, C., Xu, C., and Lu, H. (2010). Co-

herent bag-of audio words model for efﬁcient large-

scale video copy detection. In Proceedings of the

ACM International Conference on Image and Video

Retrieval (CIVR), pages 89–96, Xi’an, China.

Lloyd, S. P. (1982). Least squares quantization in

pcm. IEEE Transactions on Information Theory,

28(2):129–137.

MPAA (1922). Motion Picture Association of America.

http://www.mpaa.org/.

Noll, A. M. (1964). Short-time spectrum and cepstrum tech-

niques for vocal-pitch detection. Acoustical Sociecty

of America, 36(2):296–302.

Pancoast, S. and Akbacak, M. (2012). Bag-of-audio-words

approach for multimedia event classiﬁcation. In Con-

ference of the International Speech Communication

Association, Portland, OR.

Perelygin, A. and Jones, M. R. (2011). Detecting audio-

video asynchrony. Machine Learning , Stanford,

http://cs229.stanford.edu/projects2011.html.

Rabiner, L. R. and Schafer, R. W., editors (2007). Introduc-

tion to Digital Speech Processing. Now Publishers

Inc., Hanover, MA.

Riley, M., Heinen, E., and Ghosh, J. (2008). A text re-

trieval approach to content-based audio retrieval. In

International Society for Music Information and Re-

trieval Conference (ISMIR), pages 295–300, Philadel-

phia, PA.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector

space model for automatic indexing. Communication

of the ACM, 18(11):613–620.

Wang, F., Sun, Z., Zhang, D., and Ngo, C. (2012). Se-

mantic indexing and media event detection: ECNU at

TRECVID 2012. In TREC Video Retrieval Evaluation

Workshop (TRECVID), Gaithersburg, MD.

Wu, Z., Ke, Q., Sun, J., and Shum, H. Y. (2009). A multi-

sample, multi-tree approach to bag-of-words image

representation for image retrieval. In IEEE Interna-

tional Conference on Computer Vision, (ICCV), pages

1992–1999, Kyoto, Japan.

Yang, J., Jiang, Y., Hauptmann, A. G., and Ngo, C. (2007).

Evaluating a bag-of-visual-words representations in

scene classiﬁcation. In ACM International Workshop

on Multimedia Information Retrieval (MIR), pages

197–206, Bavaria, Germany.

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

150