An Efﬁcient Method for Making Un-supervised

Adaptation of HMM-based Speech Recognition Systems

Robust Against Out-of-Domain Data

Thomas Plötz and Gernot A. Fink

University of Dortmund, Germany

Robotics Research Institute

Intelligent Systems Group

Abstract. Major aspects of cognitive science are based on natural language pro-

cessing utilizing automatic speech recognition (ASR) systems in scenarios of

human-computer interaction. In order to improve the accuracy of related HMM-

based ASR systems efﬁcient approaches for un-supervised adaptation represent

the methodology of choice.

The recognition accuracy of speaker-speciﬁc recognition systems derived by on-

line acoustic adaptation directly depends on the quality of the adaptation data

actually used. It drops signiﬁcantly if sample data out-of-scope (lexicon, acous-

tic conditions) of the original recognizer generating the necessary annotation is

exploited without further analysis.

In this paper we present an approach for fast and robust MLLR adaptation based

on a rejection model which rapidly evaluates an alternative to existing conﬁ-

dence measures, so-called log-odd scores. These measures are computed as ratio

of scores obtained from acoustic model evaluation to those produced by some

reasonable background model. By means of log-odd scores threshold based de-

tection and rejection of improper adaptation samples, i.e. out-of-domain data, is

realized.

By means of experimental evaluations on two challenging tasks we demonstrate

the effectiveness of the proposed approach.

1 Introduction

Automatic speech recognition (ASR) represents one important pre-requisite for ad-

vanced natural language processing methods. In order to get more detailed insights into

human cognition processes the application of sophisticated automatic speech recogni-

tion in real-world interaction scenarios has become a standard approach. Consequently,

intelligent and most natural human-computer interaction substantially relies on robust

ASR systems.

⋆

The work described in this paper was partially supported by the German Research Founda-

tion (DFG) within the Collaborative Research Centre “Situated Artiﬁcial Communicators”

(SFB 360) at Thomas Plötz’ former afﬁliation (University of Bielefeld, Germany, Faculty of

Technology, Applied Computer Science Group).

Plötz T. and A. Fink G. (2007).

An Efﬁcient Method for Making Un-supervised Adaptation of HMM-based Speech Recognition Systems Robust Against Out-of-Domain Data.

In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, pages 109-118

DOI: 10.5220/0002416701090118

 SciTePress

Humans are able to understand spoken language almost independently on the partic-

ular acoustic environment where it has been uttered. The key to this remarkable ability

lies in adaptation to either different speakers, unfamiliar dialects or additional noise.

Trying to imitate this behavior, in the last decade robust adaptation techniques of ASR

systems based on stochastic models – most notably Hidden Markov Models (HMMs)

[1] – have been developed. These techniques allow for the successful application of

ASR systems to environments where the acoustic conditions are different from those

met during training of the recognizer.

Since the acoustic environment of applications within the context of human-

computer interaction, usually, changes almost continuously, rapid and automatic adap-

tation of ASR systems is of major interest. In these premises, in the last few years es-

pecially speaker adaptation based on Maximum Likelihood Linear Regression (MLLR)

[2] has been applied successfully. Basically, this procedure offers two advantages:

1. Large amounts of general, i.e. speaker-independent speech data can be used for

robustly estimating a base system.

2. Speaker-dependent training data can be exploited as effectively as possible for

deriving high-quality speaker-speciﬁc recognizers from the original speaker-

independent system.

For most interactive applications utilizing natural language processing un-

supervised adaptation procedures represent the methodology of choice since labeled

sample data uttered by speciﬁc speakers is, usually, rarely accessible. In these cases

the annotation of adaptation data is obtained on-line from evaluating the base system

appropriately. However, if the speaker-independent system provides poor recognition

results the quality of the speaker-speciﬁc recognizer decreases, too.

Especially within “dynamic” speech-recognition domains, like e.g. in car environ-

ments or in the context of human-robot interaction, ASR adaptation is, according to our

experiences, likely to fail for various reasons. Most prominently speaker-independent

base systems are trained with respect to certain lexica but, naturally, naive ASR users

often do not restrict themselves to such limited inventories. Either they do not know

the particular lexicon or certain alternative conversation, e.g. to an instructor or another

speaker, takes place which is (erroneously) used as adaptation samples. The resulting

(false) hypotheses generated for this out-of-domain data by the base-system are then

used for adaptation which is counterproductive for ASR improvement.

In order to circumvent this kind of mis-adaptation the hypotheses provided by the

speaker-independent base-system need to be judged in some way with respect to their

usefulness for adaptation. Basically, this judgment can be obtained by evaluating con-

ﬁdence measures. Those hypotheses which, according to the conﬁdence measure used,

are classiﬁed as not suitable for proper adaptation should be rejected for speaker depen-

dent specialization.

Within the domain of human-computer interaction computational facilities avail-

able for ASR are limited. Prominent examples are recognizers on embedded systems

as e.g. in car environments or mobile robots. In addition to this the particular recog-

nition system might need to fulﬁll certain more or less strict constraints with respect

to processing-time. For multi-modal architectures where the speech recognizer is only

110

one part of the overall system the problem further increases. This implies that complex

models can hardly be used for monitoring the actual speaker-adaptation process.

Surprisingly, there is only little literature on hypotheses judgment available focusing

on rapid un-supervised adaptation approaches in dynamic environments including com-

putational constraints as outlined above. In [3] conﬁdence measures in terms of poste-

rior word probabilities are exploited in order to “supervise” ASR adaptation. However,

related systems, usually, consist of rather complex techniques which seems problematic

for restricted computation facilities.

Contrary, in our work we concentrated on the development of a rejection scheme

which can be used very efﬁciently for increasing the robustness of MLLR-based speaker

adaptation of ASR-systems. Due to the limited resources available in a typical human-

computer interaction scenario our approach proposes the application of a rather simple

but effective technique principally known from detection applications utilizing discrete

HMMs. The emission probabilities of semi-continuous HMMs are normalized prior to

the model evaluation step. Given the transformed parameters the standard single-pass

model evaluation results in so-called log-odd scores which can be compared directly to

absolute thresholds. Consequently, poorly scoring hypotheses are rejected for adapta-

tion which increases the overall quality of the speaker-dependent recognizer derived. In

an experimental evaluation on two challenging recognition tasks simulating the typical

dynamic environment as addressed by this paper we demonstrate the effectiveness of

our new approach.

2 Related Work

In typical applications where only little sample data is available speaker-related special-

ization of ASR systems is, usually, limited to the adaptation of the acoustic model, i.e.

the mixture parameters of the Hidden Markov Models used. For this purpose, certainly,

one of the most promising approaches is the Maximum Likelihood Linear Regression

(MLLR) technique.

Based on one or more regression classes the maximum likelihood optimization for

small amounts of adaptation data covering only parts of the acoustic model is general-

ized to the complete set of parameters. For each regression class an afﬁne transforma-

tion representing rotations and translations is applied to the appropriate means of the

mixture densities. Using MLLR signiﬁcant improvements of recognition accuracy are

achievable with only small amounts of adaptation data. Since the computational effort

required is not substantial MLLR has also been used successfully for online applica-

tions (cf. e.g. [4,5]).

In order to judge hypotheses obtained from ASR systems conﬁdence measures,

usually computed using some additional general-purpose or higher level recognizer,

are widely applied. Such judgments can be used for out-of-vocabulary rejection, word

spotting etc. [1].

For the ﬁrst general kind of conﬁdence measures Bayes’ rule is directly exploited

for the calculation of posterior probabilities P (w|x) for word hypotheses w given the

acoustic input x, the acoustic model and the language model probability (P (x|w) and

111

P (w), respectively):

P (w|x) =

P (w)P (x|w)

P (x)

P (w)P (x|w)

(1)

Based on this principle e.g. in [6] posterior probabilities are used as conﬁdence mea-

sures for large vocabulary speech recognition. The general-purpose recognizer actually

used for computing P (x) is often referred to as ﬁller model and posterior probabilities

(cf. equation 1) give reasonable hints for hypotheses’ conﬁdences.

Alternatively recognition hypotheses can be judged by conﬁdence measures which

are computed separately by (more or less heuristically) combining the results of various

so-called predictor models. As an example in [7] several numerical values, obtained e.g.

from parallel N -best evaluation or alternative, less complex recognizers (e.g. “phone-

only” decoding), have been exploited.

The majority of applications utilizing conﬁdence measures is directed to the general

improvement of the recognition accuracy of ASR systems aiming at e.g. more robust

dialogue control. As one example in [8] conﬁdence scores are applied in order to im-

prove hands-free speech based navigation in continuous dictation systems. However,

there is hardly any literature addressing the conﬁdence measure based improvement

of un-supervised speaker-adaptation. In [3] a two-pass adaptation strategy based on

word posterior probabilities used as conﬁdence measures is described. Given conﬁ-

dence scores for the hypotheses obtained from the ﬁrst recognition pass a word graph is

generated which is the base for calculating word posterior probabilities. All frames for

a word with low conﬁdence are rejected for adaptation.

The authors report improvements for the recognition accuracy of speaker-dependent

ASR systems derived by adaptation automatically supervised by evaluating conﬁdence

measures. However, the rather complex computation of conﬁdences seems problematic

for applications as addressed by this paper which aims at a simpler rejection model.

3 Log-odd Scores based Rejection

When addressing automatic speech recognition in interactive applications where com-

putational facilities are rather limited certain constraints need to be respected. This in-

cludes the best possible avoidance of multi-pass recognition procedures as well as the

restriction to approaches which are computationally simple but still provide reasonable

recognition results. The situation gets even more problematic when ASR represents

only a single component of a multi-modal processing framework.

Nevertheless, especially for dynamic application domains with changing speakers

and additional conversation out of the recognizer’s scope, i.e. lexicon and acoustic con-

ditions, additional robust un-supervised ASR adaptation is a very important but chal-

lenging task.

In these promises we apply MLLR-based adaptation to speaker-independent recog-

nizers within an online adaptation framework (cf. [4] for details). Recognition results

obtained by evaluating the base system are used for acoustic adaptation towards speaker

speciﬁc models. Comparable to the procedure presented in [3] adaptation is monitored

by conﬁdence measures, i.e. based on threshold comparison non-conﬁdent hypotheses

112

are rejected for MLLR adaptation. In our approach, however, conﬁdence measures are

computed using a normalization technique applied to the scores obtained from evaluat-

ing the acoustic model. Since this normalization can be applied in advance (see below)

its computational effort is almost neglectable.

Basically, raw scores P (x, s|λ

) obtained from Viterbi alignments of speech data

x to a particular acoustic model λ

along the most probable state path s could already

be used as conﬁdence measures. However, these scores are dependent on the lengths of

the particular utterances which prevents direct exploitation.

Inspired by their successful use in alternative applications, e.g. within the bioin-

formatics domain (cf. e.g. [9]), aiming at the detection of certain patterns modeled by

HMMs we use so-called log-odd scores as conﬁdence measures. Originally developed

for discrete HMMs the basic idea is to normalize the particular emission probabilities

) of an HMM state j for generated symbols o

to some reasonable background

distribution P (o

′

) =

)

P (o

)

(2)

When using semi-continuous (SC)HMMs [10] mixtures N (x|µ

, C

) are shared be-

tween all HMM states j and individually weighted by c

resulting in modiﬁed emis-

sion probabilities b

(x):

(x) =

k=1

N (x|µ

, C

) (3)

Technically, SCHMMs can be interpreted as discrete HMMs containing an integrated

“soft” vector quantizer where c

represent the emission probabilities of the discrete

model weighted by means of the density values N . In order to obtain conﬁdence mea-

sures these coefﬁcients c

, and thus implicitly the actual emission probabilities b

(x),

are now (cf. equation 2) normalized with respect to certain background distribution.

Furthermore, the scores are converted into the negative log-domain resulting in log-odd

scores for SCHMMs:

′

(x) = −

k=1

P (c

)

+ ln N (x|µ

, C

) (4)

Normalizing the emission probabilities implies a length normalization of the acoustic

scores P (x, s|λ

). Thus, the resulting scores can be compared directly to an absolute

threshold. Those hypotheses scoring worse than this threshold are rejected for adapta-

tion.

Basically, the background distribution P (c

) used for computing log-odd scores

corresponds to some sort of “random” model. We evaluated two different random mod-

els with respect to their usefulness for computing conﬁdence measures. First, all emis-

sion probabilities are normalized with respect to a uniform state-speciﬁc background

distribution of the mixture weights c

– referred to as Flat background model. The

second type of background model consists of the prior probabilities of the mixtures the

113

semi-continuous HMMs are based on. This corresponds to reasonably biasing the ran-

dom model to the actual statistics of the underlying mixture model and is referred to as

Prior model.

Technically the computation of conﬁdence measures based on log-odd scores can

be performed very efﬁciently. The usual calculations are just to be enhanced by the

additional but non-complex normalization step. Furthermore, this normalization can be

performed implicitly prior to any actual calculations, i.e. without increasing the com-

putational complexity for model evaluation. Therefore, the mixture weights c

of all

states of the acoustic models are converted using the normalization term of equation 4

and standard HMM evaluation is performed using the converted weights.

4 Experimental Evaluation

In order to demonstrate the effectiveness of our new log-odd scores based rejection

model aiming at improved MLLR adaptation of speaker-independent ASR systems

within dynamic environments we performed various experiments. For better repro-

ducibility and generalization we report here the results of systematic evaluations us-

ing combinations of known datasets to realistically cover the dynamic environments

as addressed by this paper. This includes sets of speaker dependent adaptation sam-

ples mixed with additional utterances out of the original recognizers’ scope (language,

lexicon etc.).

In fact this corresponds to a typical scenario for human-computer interaction with

naive users in the loop. Often certain utterances containing e.g. out of vocabulary words

are mixed with utterances which can actually be used for acoustic adaptation. The ﬁrst

kind of utterances might originate from conversation with an instructor or from the

“learning” phase of the human user not being aware of the actual lexicon of the ASR

system. According to our practical experiences with the interaction of humans and a

mobile robot [11] the quality of speaker adaptation drops substantially without detection

and proper treatment of utterances actually not suitable for adaptation.

4.1 Datasets

The ﬁrst set of experiments is directed to speaker adaptation within car environments.

Therefore, the SLACC (Spoken LAnguage Car Control) corpus [12] has been exploited.

It consists of read speech containing instructions (ca. 9 hours for training and about 100

minutes for test) for the control of non safety-relevant functions in car environments,

e.g. mobile phone or air-condition. They were recorded in different cars and environ-

ments (highway or city trafﬁc) by several speakers (lexicon size: 856 words). MLLR

adaptation is performed separately for three speakers where the particular adaptation

sets consist of SLACC utterances as well as of speech data originating from a com-

pletely different corpus, namely from the Wall-Street-Journal (WSJ0) task [13]. Non-

SLACC utterances have been selected randomly, and the ratio of WSJ0- to SLACC-

adaptation data is approximately 1:3 per speaker. Adapted recognizers are evaluated on

speaker-speciﬁc test data from SLACC (about 300 utterances per speaker).

114

Additionally, we tested our new approach directly on the WSJ0-corpus. The 5k

closed vocabulary speaker independent recognition base-system used was trained on

about 15 hours of speech and (in summary) tested on 330 utterances with approximately

40 minutes of speech. For each of the 8 speakers included the ofﬁcial adaptation sets

have been mixed with utterances randomly chosen from the SLACC-corpus (see above)

as well as from the VERBMOBIL corpus [14]. Again, the ratio of out-of-domain and

in-domain sample data is approximately 1:3.

4.2 Results

For all experiments performed the particular training data is used in order to set up

a speaker-independent recognition system consisting of semi-continuous HMMs and

mixture densities with diagonal covariances. Furthermore, for the WSJ0-system a Bi-

gram model is incorporated. The annotation necessary for MLLR adaptation is obtained

automatically using the particular base system. Given the conﬁdence measures com-

puted as log-odd scores based on the acoustic model evaluation, and the comparison

to an absolute threshold the resulting hypotheses are either respected for adaptation or

explicitly rejected. Note that, actually, hypotheses are rejected which not necessarily

corresponds to the rejection of complete utterances. All experiments have been per-

formed using our own HMM toolkit ESMERALDA [15].

In tables 1 and 2, respectively, the results for the experimental evaluation of the

speaker-dependent systems obtained using MLLR adaptation and our new rejection

model based on log-odd scores are summarized.

The ﬁgures reported have been av-

eraged over speaker-wise evaluations.

Table 1. Results for SLACC-based evaluation (WER for “clean” adaptation sets without rejec-

tion: 20.9%).

Threshold for Rejection

WER ∆ WER Rejection

[− ln(P (O|λ))]

[%] [%] [%]

No Rejection 27.7 – –

(Base)

Flat Background Model

-100 22.6 -18.4 87.0

-50

24.7 -10.8 63.3

27.7 0.0 7.00

Prior Background Model

-100 22.6 -18.4 87.7

-50

25.1 -9.4 62.7

27.8 +0.4 5.6

Note that baseline experiments without any adaptation showed no signiﬁcant differences be-

tween results obtained when using log-odds scores or their un-normalized counterparts.

115

Table 2. Results for WSJ0-based evaluation using the rejection threshold determined in SLACC-

based experiments.

Background Model

WER ∆ WER Rejection

– threshold: -100 –

[%] [%] [%]

None 15.1 – –

(Base without Rejection)

Flat 12.7 -15.9 70.4

Prior

12.6 -16.6 71.4

The application of our rejection model based on the simple log-odd scores based

conﬁdence measures substantially improves the recognition accuracy of speaker-

dependent recognizers derived on poor adaptation sets. When rejecting all hypotheses

for adaptation with log-odd scores larger than −100 the word-error rate (WER) can be

reduced by more then 18% relative for the SLACC-task (compared to standard adapta-

tion without rejection). When using this absolute threshold for the WSJ0-task the WER

decreases by more than 16% relative.

The differences between the particular background models are almost negligible.

For both Prior- and Flat-background model the abovementioned substantial reductions

in WER can be reached.

Comparing the results obtained when processing poor adaptation sets using our

rejection model to those related to model specialization exploiting “clean” adapta-

tion samples, the effectiveness of our newly developed approach becomes manifest.

As demonstrated for SLACC-experiments when applying our rejection model on poor

adaptation data the results for optimal conditions, i.e. using “clean” adaptation data,

can almost be reached. Thus, even in dynamic environments as addressed by this paper,

speaker-dependent recognition systems can be obtained very robustly and efﬁciently by

applying MLLR adaptation and the new rejection model based on log-odd scores to a

speaker-independent base system.

5 Summary

The existence of robust automatic speech recognition systems is of major importance

for probably all kinds of natural language processing research within cognitive science.

As great ideal human listeners are able to understand speech even in noisy environments

or from unknown speakers.

MLLR adaptation has been established as state-of-the-art for the improvement of

the recognition accuracy of automatic speech recognition systems based on HMMs.

However, according to our experiences within the domain of human-computer interac-

tion, especially in dynamic application domains with naive speakers in the loop adapta-

tion may fail due to improper adaptation data. This data often originates from conver-

sation out of the particular recognizer’s scope, i.e. containing words beyond the lexicon

of the original recognition system or utterances of poor acoustic quality. Unfortunately,

116

when ASR represents only one part of a complex (e.g. multi-modal) interaction system

computational facilities for adaptation are rather limited.

Addressing such interactive applications with computational restrictions we pre-

sented a rejection model based on the evaluation of conﬁdence scores applying log-odd

scores for semi-continuous Hidden Markov Models. For this simple normalization tech-

nique, which can be applied in advance thus maximally limiting the additional compu-

tational effort, the ratio of actual acoustic scores as obtained by HMM evaluation to a

reasonable background model is computed. Based on these values hypotheses’ scores

can directly be compared to an absolute threshold and rejected for adaptation if neces-

sary. Two variants of the background model based either on a uniform distribution of

mixture coefﬁcients involved or their prior probabilities have been investigated.

The effectiveness of our approach has been demonstrated by means of experimental

evaluations on two challenging tasks. Therefore, MLLR adaptation has been applied

to speaker-independent base systems processing data sets containing both in-domain

and out-of domain utterances. This corresponds to a very common scenario in e.g.

human-robot interaction where often adaptation data out-of-the-scope of the recognizer

(lexicon etc.) need to be processed. When applying the rejection model the adaptation

process for interactive speech recognition applications can be improved substantially.

References

1. Huang, X., Acero, A., Hon, H.: Spoken Language Processing – A Guide to Theory, Algo-

rithm, and System Development. Prentice Hall PTR (2001)

2. Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adapta-

tion of continuous density Hidden Markov Models. Computer Speech & Language (1995)

171–185

3. Pitz, M., et al.: Improved MLLR speaker adaptation using conﬁdence measures for conver-

sational speech recognition. In: Int. Conf. Spoken Lang. Proc. (2000)

4. Plötz, T., Fink, G.A.: Robust time-synchronous environmental adaptation for continuous

speech recognition systems. In: Int. Conf. Spoken Lang. Proc. Volume 2. (2002) 1409–1412

5. Zhang, Z., Furui, S., Ohtsuki, K.: On-line incremental speaker adaptation with automatic

speaker change detection. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing.

(2000)

6. Wessel, F., Schlüter, R., Macherey, K., Ney, H.: Conﬁdence measures for large vocabulary

continuous speech recognition. IEEE Trans. on Speech and Audio Processing 91 (2001)

7. Chase, L.: Word and acoustic conﬁdence annotation for large vocabulary speech recognition.

In: Proc. European Conf. on Speech Communication and Technology. (1997)

8. Feng, J., Sears, A.: Using conﬁdence scores to improve hands-free speech-based navigation

in continuous dictation systems. ACM Transactions on Computer-Human Interaction 11

(2004) 329–356

9. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic

models of proteins and nucleic acids. Cambridge University Press (1998)

10. Huang, X.D., Jack, M.A.: Semi-continuous Hidden Markov Models for speech signals. Com-

puter Speech & Language 3 (1989) 239–251

11. Haasch, A., et al.: BIRON – The Bielefeld Robot Companion. In: Proc. Int. Workshop on

Advances in Service Robotics, Fraunhofer IRB Verlag (2004) 27–32

12. Schillo, C.: Der SLACC Korpus. Technical report, Faculty of Technology, Bielefeld Univer-

sity (2001)

117

13. Paul, D.B., Baker, J.M.: The design for the Wall Street Journal-based CSR corpus. In:

Speech and Natural Language Workshop. (1992)

14. Kohler, K., et al.: Handbuch zur Datenaufnahme und Transliteration in TP 14 von VERB-

MOBIL – 3.0. Technical Report 11, Institut für Phonetik und digitale Sprachverarbeitung,

Universität Kiel (1994)

15. Fink, G.A.: Developing HMM-based recognizers with ESMERALDA. In: Text, Speech

and Dialogue. Volume 1692 of Lecture Notes in Artiﬁcial Intelligence. Springer, Berlin

Heidelberg (1999) 229–234

118