COMBINING NOVEL ACOUSTIC FEATURES USING SVM TO

DETECT SPEAKER CHANGING POINTS

∗

Haishan Zhong,

∗

David Cho,

∗†

Vladimir Pervouchine and

∗

Graham Leedham

∗

Nanyang Technological University, School of Computer Engineering, N4 Nanyang Ave, Singapore 639798

†

Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613

Keywords:

Speaker recognition, Feature extraction, Feature evaluation.

Abstract:

Automatic speaker change point detection separates different speakers from continuous speech signal by utilis-

ing the speaker characteristics. It is often a necessary step before using a speaker recognition system. Acoustic

features of the speech signal such as Mel Frequency Cepstral Coefﬁcients (MFCC) and Linear Prediction Cep-

stral Coefﬁcients (LPCC) are commonly used to represent a speaker. However, the features are affected by

speech content, environment, type of recording device, etc. So far, no features have been discovered, which

values depend only on the speaker. In this paper four novel feature types proposed in recent journals and con-

ference papers for speaker veriﬁcation problem, are applied to the problem of speaker change point detection.

The features are also used to form a combination scheme using an SVM classiﬁer. The results shows that the

proposed scheme improves the performance of speaker changing point detection as compared to the system

that uses MFCC features only. Some of the novel features of low dimensionality give comparable speaker

change point detection accuracy to the high-dimensional MFCC features.

1 INTRODUCTION

The aim of speaker changing point detection (speaker

segmentation) is to ﬁnd acoustic events within an au-

dio stream (e.g ﬁnding the speaker changing point

in the continuous speech ﬁles according to different

speakers’ characteristics). Automatic segmentation of

an audio stream according to speaker identities and

environmental conditions have gained increasing at-

tention. Since some speech ﬁles are obtained from

telephone conversations or recorded during meetings,

there are more than one person speaking in the audio

recordings. In such cases before performing speaker

recognition it is necessary to separate audio signal ac-

cording to different speakers. Features extracted from

a speech waveform are used to represent the charac-

teristics of the speech and speaker. Among the fea-

tures acoustic features are those based on spectro-

grams of short-term speech segments. However, the

feature values that represent a speaker also vary due

to speech content, environment, type of recording de-

vice, etc. So far no features have been discovered,

whose values only depend on the speaker. Also, dif-

ferent speech features contains different information

about a speaker: some features reﬂect a person’s vo-

cal tract shape while others may characterise the vocal

tract excitation source.

Generally there are three main techniques for de-

tecting the speaker changing points: decoder guided,

metric based and model based. In this paper, a

method using Support Vector Machines (SVM) to ﬁnd

speaker changing point in a continuous audio ﬁle is

presented. SVM is a binary classiﬁer that constructs a

decision boundary to separate the two classes. SVM

has gained much attention since the experimental re-

sults indicate that it can achieve a generalisation per-

formance that is greater than or equal to other clas-

siﬁers, but requires less training data to achieve such

an outcome (Wan and Campbell, 2000). Speaker seg-

mentation can be treated as a binary decision task: the

system must decide whether or not a speech frame

has the speaker changing point. This study uses the

SVM for seeking speaker changing points by com-

bining commonly used acoustic features with several

novel acoustic features proposed recently. The novel

features have been recently proposed by different re-

224

Zhong H., Cho D., Pervouchine V. and Leedham G. (2008).

COMBINING NOVEL ACOUSTIC FEATURES USING SVM TO DETECT SPEAKER CHANGING POINTS.

In Proceedings of the First International Conference on Bio-inspired Systems and Signal Processing, pages 224-227

DOI: 10.5220/0001060402240227

 SciTePress

searchers for the problem of speaker recognition.

The paper is organised as follows: section 2

describes the speaker segmentation method with

Bayesian Information Criteria. In section 3 the fea-

ture extraction is described for each feature type. In

section 4 the structure of SVM speaker segmentation

is explained. Section 5 presents the experimental re-

sults and draws the conclusion.

2 SPEAKER SEGMENTATION

WITH BIC

A speaker changing point detection algorithm us-

ing Bayesian Information Criterion (BIC) is proposed

in (Chen and Gopalakrishnan, 1998). A speech signal

is divided into partially overlapping frames of around

30 ms length using a Hamming window. Extraction of

acoustic features is performed for each speech frame.

A sliding window with minimum size W

min

and max-

imum size W

max

shifted by F frames is used to group

several consecutive frames. For detail grouping al-

gorithm the reader may refer to (Chen and Gopalakr-

ishnan, 1998). Each segment contains a number of

frames and is represented by the corresponding acous-

tic feature vectors. A segment can be modelled as a

single Gaussian distribution. The distance between

consecutive segments is calculated based on variances

of the Gaussian distributions that model the segments

in the feature space. The variance BIC (Nishida and

Kawahara, 2003) was developed from BIC and used

to represent the distance between two speech seg-

ments represented as their feature vectors. Variance

BIC is formulated with the following function:

∆BIC

variance

= −

+ n

log

|Σ

log

|Σ

| +

log

|Σ

+ α

(d +

d(d +1))log(n

+ n

)

(1)

where Σ

, Σ

and Σ

are the covariance values of

the whole segment, the ﬁrst segment and the second

segment respectively, n

is the number of frames for

the i-th segment, and d is the dimensionality of the

acoustic feature vectors. The larger the variance BIC

of two segments is, the larger is the probability that

there is a speaker changing point between these two

segments. A sliding window is used to calculate the

variance BIC value for the whole speech ﬁles (Chen

and Gopalakrishnan, 1998). Local maxima in vari-

ance BIC values of the whole speech are marked as

the speaker changing points.

When different acoustic features are used, there

will be different variance BIC values generated for

a speech ﬁle. These values can be used as features

forming feature vectors to be used for determining

speaker changing points. Fig. 1 shows the process of

generating a variance BIC vector after acoustic fea-

ture extraction. After the feature extraction for each

frame, the feature vector of each frame is used to cal-

culate the variance BIC values.

Audio signal

Acoustic features

Acoustic feature vectors

Frame 1

Frame 2

Frame n

…

Calculate variance BIC value

Figure 1: Generating variance BIC values for single type

acoustic features.

SVM feature vector for Frame 1

SVM feature vector for Frame 2

……….

SVM feature vector for Frame n

Feature 1 Feature 2 Feature 3 Feature n

BIC values

Figure 2: Combination of variance BIC values generated

from different acoustic features into feature vectors.

Most of the acoustic features are of high dimen-

sionality, and simple concatenation of the feature vec-

tors will result in a feature vector of even higher di-

mensionality, which, in turn, will require too many

training samples to be trained reliably. Instead, in

the current study the variance BIC value is calculated

from each of the acoustic features as described above.

The BIC values calculated for the same frames using

different acoustic features form new feature vectors

are then used with the SVM classiﬁer (Fig. 2).

3 EXTRACTION OF FEATURES

The features were extracted from speech sampled

at 16 kHz. Mel Frequency Cepstral Coefﬁcients

(MFCC) (Oppenheim and Schafer, 2004) features

were used to calculate the variance BIC values.

MFCC vectors were extracted from 30 ms frames

without overlap. The feature values were normalised

by subtracting the mean and dividing by the standard

COMBINING NOVEL ACOUSTIC FEATURES USING SVM TO DETECT SPEAKER CHANGING POINTS

225

deviation. First order difference features were added.

In addition, several novel acoustic features were used

to calculate variance BIC values: Mel Line Spec-

trum Frequencies (MLSF) (Cordeiro and Ribeiro,

2006), Hurst parameter features (pH) (Sant’Ana et al.,

2006), Haar Octave Coefﬁcients of Residue (HO-

COR) (Zheng and Ching, 2004), and features based

on fractional Fourier transform (MFCCFrFT).

FrFTMFCC

are extracted similarly to MFCC

with the only difference that the fractional Fourier

transform of order p is used in place of the inte-

ger one. Features of various orders p were tried

and FrFTMFCC

0.9

were chosen because they gave

the next highest speaker segmentation accuracy after

FrFTMFCC

1.0

, which are the conventional MFCC, as

measured by the F-score (see below).

MLSF are similar to Line Spectrum Frequencies

calculated from LP coefﬁcients and were proposed in

the context of the speaker veriﬁcation problem. A mel

spectrum was generated via Fast Fourier Transform

(FFT) and mel ﬁlter bank applied to 30 ms frames.

The inverse Fourier transform was applied to calcu-

late the mel autocorrelation of the signal, from which

MLSF features were then calculated via Levinson-

Durbin recursion. LP of order 10 was used. The fea-

ture values were normalised by subtracting the mean

and dividing by the standard deviation.

Hurst parameter is calculated for frames of a

speech signal via Abry-Veitch Estimator using dis-

crete wavelet transform (Veith and Abry, 1998). In the

current study a frame length of 60 ms was used, and

Daubechies wavelets with four, six, and twelve coef-

ﬁcients were tried giving rise to pH

, pH

, and pH

features. The depth of wavelet decomposition was

chosen to be 5, 4, and 3 for pH

, pH

, and pH

corre-

spondingly, thus resulting in 5-, 4-, and 3-dimensional

feature vectors (Sant’Ana et al., 2006).

While LP coefﬁcients are aimed at characteris-

ing the person’s vocal tract shape, information about

the glottal excitation source can be extracted from

the residual signal e

= s

∑

k=1

n−k

. Haar Oc-

tave Coefﬁcients of Residue (HOCOR) features are

extracted by applying Haar transform to the residual

signal. In the current study the LP of order 12 was

applied to 30 ms frames. HOCOR

features of order

α 1, 2, 3, and 4 were extracted (Zheng and Ching,

2004).

4 SVM SPEAKER

SEGMENTATION

Fig. 3 shows the structure of SVM speaker segmen-

tation. To be used in SVM the frames which contain

speaker changing point are labelled as −1, the frames

without speaker changing point are labelled as 1. The

acceptable error range of the found speaker chang-

ing points was chosen to be 1 second (Ajmera et al.,

2004), which means the frames that are half a sec-

ond before and after a speaker changing point are all

labelled as −1. The variance BIC values that are ob-

tained from different acoustic features are of different

order. To use them as features in SVM a linear scaling

is applied:

−

(2)

where i represent different features, j is the frame

number of the i-th feature,





is the mean value of

and σ

is its standard deviation.

Feature 1

Feature 2

Feature N

SVM

feature vector

Labelled speech

Training

SVM model

0 10 20 30 40 50 60 70 80 90 1 00

-0.04

-0.03

-0.02

-0.01

0.01

0.02

0.03

0.04

0.05

0 10 20 30 40 50 60 70 80 90 100

-0.04

-0.03

-0.02

-0.01

0.01

0.02

0.03

0.04

0.05

0 50 100 150 200 250 300

-0.06

-0.04

-0.02

0.02

0.04

0.06

0.08

Class values

Testing

SVM speaker segmentation

Find speaker changing points

…..

Figure 3: SVM speaker segmentation system.

The SVM classiﬁer returns two values for each

frame that are related to the distance to the separat-

ing hyperplane (either of them can be monotonically

mapped into the conditional class probability). The

values sum to one and indicate to what extent the

frame belongs to class −1 (or 1). The −1 class value

is analysed to determine the true speaker changing

point. A peak search algorithm was used to deter-

mine the local maxima of −1 class value as we moved

along the frames. The peak searching algorithm uses

adaptive threshold in an attempt to eliminate small

peaks due to noise and ﬁnd only true local maxima.

5 RESULTS AND CONCLUSION

The NIST HUB-4E Broadcast News Evaluation data

set was used in this study. The data was obtained

from the audio component of a variety of television

and broadcast news sources and each audio ﬁle con-

sist of approximately one hour of speech in English

BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing

226

and includes the speech of several speakers in one au-

dio channel (Hub, 1997). To evaluate the performance

of the speaker changing point detection, two criteria

were used: the precision of speaker changing points

that were found and the number of missed changing

points. The precision indicates the percentage of true

turning points from the total number of turning points

that were found. The recall indicates how many of the

true turning points were missed. These two are com-

bined into an F-score. F-score indicates how good

a system is: it is high when both precision and re-

call values are high and low when either of them is

low (Nishida and Kawahara, 2003).

Table 1: F-score, precision and recall for different features

and their combination via SVM. d is the dimensionality of

the acoustic feature vectors.

Feature d F-score Precision Recall

MFCC 26 0.62 0.61 0.62

MLSF 10 0.42 0.29 0.80

5 0.52 0.67 0.43

4 0.53 0.67 0.44

3 0.55 0.68 0.46

HOCOR

6 0.42 0.54 0.35

HOCOR

5 0.37 0.47 0.30

HOCOR

4 0.31 0.39 0.26

HOCOR

3 0.30 0.38 0.25

FrFTMFCC

0.9

12 0.61 0.73 0.56

SVM

10 0.64 0.72 0.58

SVM

6 0.65 0.75 0.58

Table 1 (except for the two bottom rows) shows

the speaker changing point detection results achieved

when different acoustic features were used to calcu-

late the variance BIC and the peak detection algorithm

was used to detect speaker changing points from the

BIC values. It is worth noticing that using pH features

gives F-scores comparable to those when MFCC fea-

tures are used, even though the dimensionality of fea-

ture vectors of pH features is far less than those of

MFCC. This suggests that pH features may be a bet-

ter choice when the training data set is small.

The features used for SVM combination 1

(SVM

) are the 10 variance BIC values resulted from

the 10 acoustic features. The results in Table 1 show

that the proposed SVM speaker changing point de-

tection scheme improves the speaker changing point

detection performance as compared to each of the

individual acoustic features, with a higher F-score

of 0.64. This means that other acoustic features,

which were originally proposed for speaker recogni-

tion problem, can be used for the problem of speaker

segmentation as well. Because of low both preci-

sion and recall values achieved on HOCOR features,

a combination of the acoustic features was attempted

without HOCOR features. The results (SVM

in Ta-

ble 1) were comparable with those of SVM

. How-

ever, elimination of any other acoustic features from

the combination degraded the speaker segmentation

performance.

This study demonstrates that the new features

do carry additional information about speaker dif-

ferences to MFCC features, and some of them also

have attractiveness because of their low dimensional-

ity. Further study may ﬁnd better ways of how to in-

tegrate complimentary information about speaker dif-

ferences contained in the new features with traditional

features such as MFCC and LPCC.

REFERENCES

(1997). NIST HUB-4E Broadcast News Evaluation.

Ajmera, J., McCowan, I., and Bourlard, H. (2004). Robust

speaker change detection. IEEE Signal Process. Lett.,

11(8).

Chen, S. and Gopalakrishnan, P. (1998). Speaker, environ-

ment and channel change detection and clustering via

the Bayesian Information Criterion. In DARPA Speech

Recognition Workshop, pages 127–132.

Cordeiro, H. and Ribeiro, C. (2006). Speaker characteriza-

tion with MLSF. In Odyssey 2006: The Speaker and

Language Recognition Workshop, San Juan, Puerto

Rico.

Nishida, M. and Kawahara, T. (2003). Unsupervised

speaker indexing using speaker model selection based

on Baysian Information Criterion. In Proc. IEEE

ICASSP, volume 1, pages 172–175.

Oppenheim, A. and Schafer, R. (2004). From frequency to

quefrency: a history of the cepstrum. Signal Process-

ing Magazine, IEEE, (5):95–106.

Sant’Ana, R., Coehlo, R., and Alcaim, A. (2006). Text-

independent speaker recognition based on the Hurst

parameter and the multidimensional fractional Brow-

nian motion model. IEEE Trans. Acoust., Speech, Sig-

nal Process., 14(3):931–940.

Veith, D. and Abry, P. (1998). A wavelet-based joint estima-

tor of the parameters of long-range dependence. IEEE

Trans. Inf. Theory, 45(3):878–897.

Wan, V. and Campbell, M. (2000). Support vector machines

for speaker veriﬁcation and identiﬁcation. pages 775–

784.

Zheng, N. and Ching, P. (2004). Using Haar transformed

vocal source information for automatic speaker recog-

nition. In IEEE ICASSP, pages 77–80, Montreal,

Canada.

COMBINING NOVEL ACOUSTIC FEATURES USING SVM TO DETECT SPEAKER CHANGING POINTS

227