Experiments on Adaptation Methods to Improve Acoustic Modeling for

French Speech Recognition

Saeideh Mirzaei

, Pierrick Milhorat

, J

ome Boudy

, G

erard Chollet

and Mikko Kurimo

Department of Electronics and Physics, Telecom SudParis, Evry, France

CNRS - Laboratoire Traitement et Communication de l’Information, Telecom ParisTech, Paris, France

Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland

Media Archiving Research Laboratory, Kyoto University, Kyoto, Japan

Keywords:

Speech Recognition, Speaker Adaptation, Linear Regression, Vocal Tract.

Abstract:

To improve the performance of Automatic Speech Recognition (ASR) systems, the models must be retrained

in order to better adjust to the speaker’s voice characteristics, the environmental and channel conditions or

the context of the task. In this project we focus on the mismatch between the acoustic features used to train

the model and the vocal characteristics of the front-end user of the system. To overcome this mismatch,

speaker adaptation techniques have been used. A signiﬁcant performance improvement has been shown using

using constrained Maximum Likelihood Linear Regression (cMLLR) model adaptation methods, while a fast

adaptation is guaranteed by using linear Vocal Tract Length Normalization (lVTLN).We have achieved a

relative gain of approximately 9.44% in the word error rate with unsupervised cMLLR adaptation. We also

compare our ASR system with the Google ASR and show that, using adaptation methods, we exceed its

performance.

1 INTRODUCTION

Automatic Speech Recognition (ASR) systems can

play a great role in today’s Human-Machine Interac-

tive (HMI) systems. As ASR systems are introduced

to a wide range of applications, the accuracy of these

systems becomes signiﬁcantly important. It depends

on a number of factors; whether the ASR system is

continuous or not, the extent of the task domain, the

type of speech, planned or not and so on.

In our large vocabulary task, a continuous speech

recognition system and data from different broadcast

programs, containing both planned and spontaneous

speech, we focus on the speaker dependency of the

acoustic models. A Speaker Dependent (SD) system

is trained using data from only one speaker, whereas

a Speaker Independent (SI) system contains features

from a large number of speakers. SD systems have

shown to perform better than SI systems but as train-

ing models demands a large amount of data, SD sys-

tems are not feasible in practice. Hence, adaptation

methods are deployed to improve SI models using a

small amount of data from a new user. Without labels

provided for the adaptation data, estimating parame-

ters here is done in an unsupervised manner.

Previous work on the French broadcast news data

reported a word error rate between 12 and 26 percent

in the campaign held between 2007 and 2009 (Gal-

liano et al., 2009). The data here is different from the

original setup in this work and hence, results are not

comparable with those reported in the aforementioned

paper. The objective of our work is to investigate the

improvement achieved by speaker adaptation and so

the data has been rearranged based on the speakers.

This paper is organized as follows; Section 2 gives

an overview of the adaptation methods with details on

those implemented in this work, which includes vocal

tract length normalization and linear regression meth-

ods to estimate the transformation parameters. In sec-

tion 3, we present the tools used to carry out the ex-

periments. Section 4 describes the data set and how

it is arranged for training and test. In section 5, we

introduce the evaluation metrics in our experiments.

Section 6 presents the results giving precise informa-

tion on the adjustments used to implement the exper-

iments. We conclude the paper in section 7 and pro-

vide a perspective for the possible future work.

278

Mirzaei, S., Milhorat, P., Boudy, J., Chollet, G. and Kurimo, M.

Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition.

DOI: 10.5220/0005703702780282

In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2016), pages 278-282

ISBN: 978-989-758-173-1

2 ACOUSTIC MODEL

ADAPTATION

Acoustic model adaptation techniques, in general,

are used to reduce the mismatch between the trained

model parameters and the test data conditions, the

channel and environmental effects or characteristics

of a new speaker voice. This is possible by trans-

forming the feature set or adjusting the model param-

eters. The model parameters here are Gaussian Mix-

ture Model speciﬁcations (GMM means and conva-

riances) as part of Hidden Markov Models (HMM).

The desired qualities of adaptation techniques are to

be fast, to require a small amount of adaptation data

and to asymptotically converge to the maximum like-

lihood estimate of the parameter.

Maximum A Posteriori (MAP) (Gauvain and Lee,

1994) estimates each parameter after a sample con-

taining that parameter is observed. Although converg-

ing to maximum likelihood estimates, adaptation us-

ing MAP is very slow considering the large number

of existing parameters in a model. In fact, a large

amount of adaptation data is required for MAP to

be effective. This led to other methods to bind the

parameters together and make each estimation valid

for each class. Structural MAP (Shinoda and Lee,

1997) and Regression-based Model Prediction(Ahadi

and Woodland, 1997) are techniques based on MAP.

Some other proposed methods use regression analy-

sis or pooling techniques to accelerate adaptation. In

this section the three methods used in this work are

explained.

2.1 Vocal Tract Length Normalization

Differences in the vocal tract shape and length among

speakers result in the fundamental frequencies to vary

from one speaker to another. This can be noticed be-

tween a male and a female speaker, the fundamental

frequency of a typical female speaker is higher than

that of a typical male speaker. VTLN normalizes

the perceived voice during feature extraction based

on the position of the formants to reduce this vari-

ation. A piece-wise linear approach has been used,

i.e. the transformation function is linear (Eide and

Gish, 1996), and only one warping factor has to be

estimated.

f = A f +b (1)

where,

f and f are the warped and unwarped fre-

quencies respectively. There exists only one parame-

ter to be estimated, the warping factor, with a deﬁned

range, e.g. between 0.8 and 1.2.

2.2 Maximum Likelihood Linear

Regression

The Maximum Likelihood Linear Regression

(MLLR) (Leggetter and Woodland, 1995) method

uses a linear transformation matrix to re-estimate

the model parameters. The transformation is applied

on either the Gaussian parameters or the features,

which deﬁnes the unconstrained or constrained

MLLR respectively. Different variations of MLLR

estimate transformation matrices for means only, for

means and variances, etc. With constrained MLLR

(cMLLR) the same transformation matrix is used

to transform both means and covariances of the

Gaussians.

ˆµ = Aµ + b (2)

Σ = AΣA

(3)

The transformed mean vector ˆµ and covariance

matrix

Σ are obtained by applying the transformation

matrix A on the original values, µ and Σ.

To make adaptation robust, Gaussians close to-

gether in the acoustic space or Gaussians in the same

state can be grouped and the same transformation ma-

trix can be applied to that same class. This is essential

when the adaptation data is small and the probability

of not observing the effect of some parameters is high.

In the case where a large amount of adaptation data

is available, ﬁner transformations can be applied to a

smaller group of Gaussians. Special attention must be

paid to the amount of adaptation data and the size of

the transformation matrix, to prevent overﬁtting the

model parameters.

2.3 Speaker Adaptive Training

Speaker Adaptive Training (SAT) is used to train the

speaker independent acoustic model on the average

voice (Anastasakos et al., 1996). Either VTLN or

MLLR can be used to eliminate the inter-speaker vari-

ability during estimation of HMM parameters. The

speciﬁcations of the transformation matrix must be

estimated jointly with the HMM parameters.

(

λ,

G) = argmax

λ,G

∏

L (O

(λ)) (4)

where, λ is the HMM parameters vector and G is a

block matrix including the speaker speciﬁc transfor-

mation matrices. The training is performed on obser-

vations from each speaker, by maximizing the like-

lihood of the observed data, O

, from speaker r with

the given transformation matrix speciﬁed for the same

speaker, G

, and the model parameters. SAT gives

Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition

279

better results when used with adaptation methods. Its

main drawback is the large memory requirement to

store all the transformation matrices.

3 TOOLS

The tools used to perform the experiments were all

open source. The language models were built using

SRILM (Stolcke et al., 2002). The same tool was used

to assess these language models. The selected lan-

guage model in ARPA format was then transformed

to FST format by OpenFST (Allauzen et al., 2007)

to be readable by Kaldi (Povey et al., 2011). Kaldi,

a powerful ASR tool, was used to build the acoustic

models, perform adaptation methods and produce the

outputs for the ﬁnal evaluation of the ASR system.

4 DATA

The data from Ester - ISLRN: 110-079-844-983-

7; ELRA-E0021, Catalogue ELRA (Evaluation des

syst

emes de transcription enrichie d’

emissions radio-

phoniques) (Galliano et al., 2006) and Etape - ANR

ANR-09-CORD-009-05 (Evaluations en Traitement

Automatique de la Parole) (Gravier et al., 2012) were

combined to form our data set for training and testing.

The data is from French TV and Radio broadcasts.

Etape, compared to Ester, contains more spontaneous

speech and has more multiple-speaker segments, and

so it is more challenging for speech recognition tasks.

The sampling rate of the audio ﬁles is 16 kHz. After a

manual segmentation of audio ﬁles to extract speech

parts only, the average length of the resulting ﬁles was

3.5 seconds.

Given the nature of the work, we needed to re-

arrange the data based on the speakers. 18 speak-

ers with the highest amount of speech data from both

sets were used as the evaluation set. An equal num-

ber of speakers were extracted from Ester and Etape

and only single-speaker segments were preserved for

testing. The rest of the data was used for training af-

ter excluding those segments containing any speaker

from the test set. In total 145 hours of speech were

used for training and 18 hours for test. The test part

contains data from both Ester and Etape: 8 hours from

Ester only, 8 hours from Etape only and 2 hours com-

mon to the two sets. The two hours of shared data

comes from the two speakers appearing in the two

data sets, the results for these two speakers in experi-

ments are presented in a separate part as Ester-Etape.

Test sets from Ester, Etape and Ester-Etape include

46155, 60372 and 16329 words respectively.

5 EVALUATION METRICS

The criteria to select the language model is the per-

plexity which by deﬁnition is the language model con-

fusion to predict the next word. It is formulated as

follows:

Per plexity =

∏

i=1

Pr(W

|H)

(5)

To evaluate the performance of the ASR system,

the word error rate (WER) is used. It is deﬁned based

on the Levenshtein distance and is calculated as fol-

lows:

W ER =

I + D + S

(6)

with I, the number of Inserted, D, Deleted, S, Sub-

stituted words and T , the Total number of original

words.

The conﬁdence interval is 1% maximum with

the conﬁdence of 95%, having 45000-word (Ester),

60000-word (Etape) and 16000-word (Ester-Etape)

test sets. The error margin is obtained by using the

following formula:

= 1.96 ×

¯x(1 − ¯x)

(7)

with ¯x, the error rate and n, the sample size. The

value of 1.96 is obtained from a standard distribution

with the coverage of 95%.

6 EXPERIMENTAL RESULTS

We present the experimental setup in this section. It

describes the system, and then all the results of the

evaluation are presented in the following parts.

6.1 Set Up

The feature set is constructed using Mel-Frequency

Cepstral Coefﬁcients (MFCCs) (Davis and Mermel-

stein, 1980). The MFCCs are extracted over 25 ms-

length frames with a frame shifts of 10 ms. The

ﬁrst 13 coefﬁcients form the basic feature vector. In

the unadapted model system, these coefﬁcients and

their ﬁrst and second derivatives are adjoined to build

the 39-element feature vector. The feature set for

the adapted model system is obtained by using Het-

eroscedastic Linear Discriminant Analysis (HLDA)

(Kumar and Andreou, 1998). HLDA is implemented

on Gaussians using only means and the new classes

ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

280

are assumed to have ﬁxed variances except for the

general model which has a unit variance. With a con-

text dependency of length three (appending 7 con-

secutive feature vectors), a 91-dimensional vector is

formed and then reduced to 40-D, out of which 1

is the general model for all the rejected dimensions.

Cepstral Mean and Variance Normalization (CMVN)

(Prasad and Umesh, 2013) is applied segment-wise

both during training and testing sessions to cancel the

channels effects.

To build the language models a lexicon of 54k

words was used with 39 phonemes. Two 3-gram lan-

guage models were built. One was build on the train-

ing part of Ester and Etape data sets with Kneser-Ney

smoothing (Chen and Goodman, 1999) applied on the

model. The other language model was produced by

using Google n-gram counts made available in 2009

for French data. In this latter, combinations with

probability of less than 10

−7

were pruned out. The

size and perplexity (tested on the test corpus) of these

two language models are presented in Table 1. We

used the language model trained on Ester and Etape

data sets to perform the experiments because of its

smaller size and lower perplexity.

Table 1: Language models and their perplexities.

Data set Perplexity Size

Ester-Etape training set 150 4.5M

Google n-gram counts 289 104M

The monophone model was built with 132 states

and 1000 Gaussians in total. The triphone model was

built with approximately 3000 states and a total num-

ber of 56000 Gaussians (18 Gaussians per state). All

the settings during training and decoding were left to

their default values; 35 iterations for estimation, the

scaling factor of 0.083333 dedicated to the acoustic

likelihood. During decoding the same acoustic likeli-

hood scaling factor was used. The maximum number

of states at each frame was 7000 and the beam factor

of 13 was used as pruning beam during graph search,

and determining the lattices after decoding. SAT was

then implemented to build the ﬁnal speaker indepen-

dent model with normalized GMMs by applying cM-

LLR.

6.2 Results

Unsupervised adaptation was implemented. During

adaptation and decoding, the data was fed to the sys-

tem in a batch mode. With the ﬁrst pass of decod-

ing, the ﬁrst lattices were produced by using which

the transformation matrices were estimated. Lattices

from the second pass were used to readjust the param-

eters and produce the ﬁnal lattices.

Table 2 shows the results for the basic model and

unsupervised adapted models using lVTLN and cM-

LLR. Compared to Etape, tests on Ester data set re-

vealed better results in general since this set includes

mostly planned speech.

Both adaptation methods improved the perfor-

mance of the basic system (Triphone Model in Ta-

ble 2) but cMLLR proved to be more effective than

lVTLN in all test sets. With Ester test set, cMLLR

improved the performance by 11.3% while the gain

obtained by lVTLN with the same set was 7.4%. cM-

LLR provided a relative gain of 8.2% for the Etape

test data. The improvement for the same data by

lVTLN was 5.6%. All relative gains are calculated

in respect to the results of the basic model.

Table 2: WER%; a gain between 6-12 percent is obtained

by adaptation.

Ester Etape Ester-Etape

Triphone Model 28.2% 53.7% 52.3%

SAT+lVTLN 26.1% 50.7% 47.7%

SAT+cMLLR 25.0% 49.3% 46.2%

In the second experiment, in which only cMLLR

was implemented for adaptation, we increased the

number of adaptation utterances to investigate the im-

provement achieved corresponding to each amount.

The number of utterances was increased from 1 to 10.

The results in this part are compared with the perfor-

mance of Google ASR and the basic model in Fig-

ure 1, with test on Ester data set only. The conﬁdence

interval with this set is 0.4%.

The horizontal axis in Figure 1 shows the num-

ber of adaptation utterances (for the adapted model)

and the vertical axis shows the WER. The two hori-

zontal lines in the ﬁgure display the WER for the ba-

sic model (dashed line) and the Google ASR outputs

(solid line). The WER by Google ASR and the basic

model were 26.1% and 28.2% respectively.

Figure 1: A comparison between the Basic Model, the

Google ASR and the incrementally Adapted Model

Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition

281

We observe how the performance changes by in-

creasing the number of adaptation utterances; using

only one utterance as the adaptation data decreases

the performance resulting in a higher WER, with two

and more utterances up to 6, the performance gets im-

proved gradually. Afterwards, with more than 6 utter-

ances, no more gain is obtained. The line is expected

to reach the value of 25% (the WER in Table 2) if

the adaptation data was increased. We also observe

that the adapted model reaches the Google ASR per-

formance with two utterances and outperforms it with

more adaptation utterances.

7 CONCLUSIONS

Here we presented a large vocabulary continuous

speech recognition system based on a GMM-HMM

system. We implemented adaptation methods to im-

prove the system. Two methods, lVTLN and cMLLR,

were used for unsupervised acoustic model adapta-

tion. The performance of these systems were com-

pared with the speaker independent system by testing

on the Ester and Etape data sets. The basic model,

which was a triphone model, was improved by apply-

ing SAT and lVTLN/cMLLR. It was shown that the

performance was improved by a relative 9.44 percent

reduction in WER by using cMLLR. In the end the ba-

sic model and the adapted model using cMLLR were

compared with the Google ASR. It was shown that

the adaptation with cMLLR could improve the basic

system to overpass Google ASR.

We also observed in general the system worked

better for the data set including more planned speech.

This shows the importance of a good language model.

Therefore, we believe further gain could be obtained

by improving the language model, e.g. combining the

Google n-gram counts with n-gram language model

from training set using interpolation methods.

REFERENCES

Ahadi, S. and Woodland, P. C. (1997). Combined bayesian

and predictive techniques for rapid speaker adaptation

of continuous density hidden markov models. Com-

puter speech & language, 11(3):187–206.

Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., and

Mohri, M. (2007). Openfst: A general and efﬁcient

weighted ﬁnite-state transducer library. In Imple-

mentation and Application of Automata, pages 11–23.

Springer.

Anastasakos, T., McDonough, J., Schwartz, R., and

Makhoul, J. (1996). A compact model for speaker-

adaptive training. In Spoken Language, 1996. ICSLP

96. Proceedings., Fourth International Conference on,

volume 2, pages 1137–1140. IEEE.

Chen, S. F. and Goodman, J. (1999). An empirical study of

smoothing techniques for language modeling. Com-

puter Speech & Language, 13(4):359–393.

Davis, S. B. and Mermelstein, P. (1980). Comparison

of parametric representations for monosyllabic word

recognition in continuously spoken sentences. Acous-

tics, Speech and Signal Processing, IEEE Transac-

tions on, 28(4):357–366.

Eide, E. and Gish, H. (1996). A parametric approach to vo-

cal tract length normalization. In Acoustics, Speech,

and Signal Processing, 1996. ICASSP-96. Conference

Proceedings., 1996 IEEE International Conference

on, volume 1, pages 346–348. IEEE.

Galliano, S., Geoffrois, E., Gravier, G., Bonastre, J.-F.,

Mostefa, D., and Choukri, K. (2006). Corpus descrip-

tion of the ester evaluation campaign for the rich tran-

scription of french broadcast news. In Proceedings of

LREC, volume 6, pages 315–320.

Galliano, S., Gravier, G., and Chaubard, L. (2009). The

ester 2 evaluation campaign for the rich transcription

of french radio broadcasts. In Interspeech, volume 9,

pages 2583–2586.

Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a poste-

riori estimation for multivariate gaussian mixture ob-

servations of markov chains. Speech and audio pro-

cessing, ieee transactions on, 2(2):291–298.

Gravier, G., Adda, G., Paulson, N., Carr

e, M., Giraudel,

A., and Galibert, O. (2012). The etape corpus for

the evaluation of speech-based tv content processing

in the french language. In LREC-Eighth international

conference on Language Resources and Evaluation,

page na.

Kumar, N. and Andreou, A. G. (1998). Heteroscedastic

discriminant analysis and reduced rank hmms for im-

proved speech recognition. Speech communication,

26(4):283–297.

Leggetter, C. J. and Woodland, P. C. (1995). Maximum

likelihood linear regression for speaker adaptation of

continuous density hidden markov models. Computer

Speech & Language, 9(2):171–185.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glem-

bek, O., Goel, N., Hannemann, M., Motl

cek, P., Qian,

Y., Schwarz, P., et al. (2011). The kaldi speech recog-

nition toolkit.

Prasad, N. V. and Umesh, S. (2013). Improved cep-

stral mean and variance normalization using bayesian

framework. In Automatic Speech Recognition and Un-

derstanding (ASRU), 2013 IEEE Workshop on, pages

156–161. IEEE.

Shinoda, K. and Lee, C.-H. (1997). Structural map speaker

adaptation using hierarchical priors. In Automatic

Speech Recognition and Understanding, 1997. Pro-

ceedings., 1997 IEEE Workshop on, pages 381–388.

IEEE.

Stolcke, A. et al. (2002). Srilm-an extensible language mod-

eling toolkit. In INTERSPEECH.

ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

282