MODEL-MAPPING BASED VOICE CONVERSION SYSTEM

A Novel Approach to Improve Voice Similarity and Naturalness using Model-based

Speech Synthesis Techniques

Baojie Li, Dalei Wu and Hui Jiang

Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, Canada

Keywords:

Voice conversion, HMM-based speech synthesis, GMM, Model mapping.

Abstract:

In this paper we present a novel voice conversion application in which no any knowledge of source speakers is

available, but only sufﬁcient utterances from a target speaker and a number of other speakers are in hand. Our

approach consists in two separate stages. At the training stage, we estimate a speaker dependent (SD) Gaussian

mixture model (GMM) for the target speaker and additionally, we also estimate a speaker independent (SI)

GMM by using the data from a number of speakers other than the source speaker. A mapping correlation

between the SD and the SI model is maintained during the training process in terms of each phone label. At

the conversion stage, we use the SI GMM to recognize each input frame and ﬁnd the closest Gaussian mixture

for it. Next, according to a mapping list, the counterpart Gaussian of the SD GMM is obtained and then

used to generate a parameter vector for each frame vector. Finally all the generated vectors are concatenated

to synthesize speech of the target speaker. By using the proposed model-mapping approach, we can not

only avoid the over-ﬁtting problem by keeping the number of mixtures of the SI GMM to a ﬁxed value, but

also simultaneously improve voice quality in terms of similarity and naturalness by increasing the number of

mixtures of the SD GMM. Experiments showed the effectiveness of this method.

1 INTRODUCTION

Voice conversion (VC) is a technique that con-

verts voice of a source speaker to that of a tar-

get speaker. Generally speaking, text-dependent and

text-independent voice conversion represent two main

streams of research directions. In text-dependent

voice conversion, target voice can be produced with

high-quality of correctness and acceptable smooth-

ness based on the provided transcription for input

speech waveform, e.g. (Yoshimura, 2002). By con-

trast, text-independent systems have no knowledge

about the transcription of input waveform, therefore

more mismatches between source and target speakers

are present and the quality of the generated speech

then degrades. For this reason, text-independent voice

conversion attracts a wider range of studies. The tech-

niques presented in this paper are also focused on

text-independent voice conversion.

In the ﬁeld of text-independent voice conversion,

usually some forms of transforms are estimated from

training data of both source and target speakers, such

as K-means clustering in VTLN-based voice conver-

sion (Suedermann et al., 2003), codebook based map-

ping (Arslan et al., 1999) and GMM based cluster-

ing (Ye et al., 2006). In some applications, how-

ever, no knowledge about source speakers is appli-

cable beforehand. Therefore it is impossible to esti-

mate the transforms between source and target speak-

ers using the conventional techniques. In our previ-

ous work, we built a GMM-based VC system using

hidden Markov model (HMM) based speech synthe-

sis to address such particular requirements. At the

training stage, a SD GMM is trained for the target

speaker using his/her pre-recorded training data. In

the conversion stage, for each utterance from a source

speaker, the best matched Gaussian mixture is cho-

sen from the GMM. Next, the mean vectors of the se-

lected mixtures are concatenated, smoothed and then

sent as inputs to the sound synthesizer, which is pro-

vided by HTS engine (Tokuda et al, 2000; Yoshimura

et al., 2002). By experiments, we found that this ap-

proach was quite capable of conducting voice conver-

sion with acceptable quality. However, we also found

recognizable discontinuity and ﬂatness in the synthe-

sized voices. Through investigation, we found that the

442

Li B., Wu D. and Jiang H. (2010).

MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and Naturalness using Model-based

Speech Synthesis Techniques.

In Proceedings of the Third International Conference on Bio-inspired Systems and Signal Processing, pages 442-446

DOI: 10.5220/0002747104420446

 SciTePress

discontinuity is attributed to frame mismatches. And

the ﬂatness was caused by the use of average of train-

ing samples. To cope with these two problems, we

proposed in this article to renovate our VC system by

introducing additional SI models and model-mapping

technique. We conﬁrmed by experiments that the new

method was quite effective in increasing the quality of

the generated speech.

The rest of this paper is organized as follows: we

begin with an introduction to our previous GMM-

based VC system in Section 2. To deal with the dis-

continuity and ﬂatness problems, a model-mapping

strategy is introduced in Section 3, where its effec-

tiveness is evaluated by experiments. We draw the

conclusions in Section 4.

2 GMM-BASED VOICE

CONVERSION SYSTEM

2.1 HMM-based Speech Synthesis

Technique

Our voice conversion system is based on HMM-

based speech synthesis (HTS) techniques (Tokuda

et al., 2000; Yoshimura et al., 2002). The HMM-

based speech synthesis system assumes that sufﬁ-

cient training data and their corresponding transcrip-

tions are available. The system models phonetic and

prosodic parameters simultaneously. At the train-

ing stage, Mel-cepstrum coefﬁcients (MFCC), fun-

damental frequency (F0) and duration are modeled

by multiple-stream HMMs. At the synthesis stage,

a given string of words is ﬁrstly decomposed into

a string of phonemes. The system then searches in

the HMM pools the corresponding model for each

phoneme. The mean vector of each model is taken

as a frame to represent that phoneme. Later, all these

frames are then concatenated according to their time

order and passed to a smoother, which improves the

quality of the synthesized speech by smoothing over

the whole sequence of frames using dynamic param-

eters (e.g., delta and delta delta) of MFCCs and F0.

The smoothed frame sequence is then input to the

synthesizer, known as MLSA ﬁlter(Tokuda, 2000), to

produce speech waves.

For a text-to-speech task, HTS has been demon-

strated in generating high quality voices especially

in terms of continuity and naturalness. However in

our current task, neither the transcriptions for the in-

put waveforms nor the knowledge about the source

speaker is available. Many conventional transform-

based VC approaches are inapplicable in such a sit-

Speech database

Gaussian mixtures

Training stage

Model selection

MLSA Filter

Generated waveforms

Conversion stage

Source speaker Target speaker

Parameter extraction

…..

Model estimation

Parameter extraction

Dynamic feature smoother

∆

MCP

∆

MCP

…..

∆

MCP

∆

MCP

…..

MCP

Figure 1: GMM-based voice conversion system.

uation. Hence we were motivated to combine speech

recognition and synthesis techniques to conduct voice

conversion.

2.2 Overview of Our VC System

The diagram of our previously proposed VC system

is shown in Figure 1. Our VC system consists of two

stages. At the ﬁrst stage, parameter vectors are ex-

tracted from training data of the target speaker. Then

Gaussian mixture models are estimated. All mixtures

have no phoneme labels but sequential numbers. At

the conversion stage, a mixture is chosen for each in-

put frame from the GMM set according to distance

or probability metrics. Then the mean vectors of the

selected mixtures are concatenated to form an output

sequence of frames. After smoothed by a smoother

using the dynamic features, the sequence of frames is

input to the MLSA ﬁlter to generate speech wave.

2.3 Feature Extraction

The feature vectors are extracted from speech wave-

forms using HTK 3.4 (Young et al., 2003) and

SPTK (Imar et al., 2008). Each vector includes 25-

dimentional MFCCs, 1-dimentional pitch information

(F0), along with their dynamic features. The vector is

divided into four streams: MFCCs and their dynam-

ics, F0, delta F0 and delta delta F0. All streams take

uniform weights.

MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and

Naturalness using Model-based Speech Synthesis Techniques

443

2.4 Model Estimation

GMM for the target speaker is estimated using the

standard EM algorithm. Considering that correla-

tions between F0s and MFCCs introduce inaccuracy

into probability calculations when a diagonal covari-

ance matrix is used in recognition process, only the

MFCCs are used in mixture clustering process. Af-

ter the mixtures are obtained, all the training data are

aligned into the mixtures. The average of F0 over all

frames in a mixture is set as the mean of this mixture.

More speciﬁcally, assume we have N frames {x

} and

K mixtures. Each mixture has a probability distribu-

tion as N (x | µ

, Σ

) with a weight of w

. Then each

frame contributes to this mixture as γ

N (x

| µ

, Σ

)

∑

j=1

N (x

| µ

, Σ

)

(1)

And the total number of samples accumulated in

this mixture is N

∑

n=1

. Note that only the

MFCC part of vector x

are used in the above calcula-

tion. Then we get the F0 parameters for this mixture

∑

(2)

where µ

is the F0 part of µ

, and x

the F0 part of

respectively.

2.5 Model Selection

In conversion stage, each input frame from a source

speaker is calculated on the GMMs of the target

speaker (excluding the F0 parameters). The mixture

with the highest probability is selected as the output.

∗

= argmax

N (x | Σ

, µ

) (3)

where µ

and Σ

is the mean and covariance for

the k-th Gaussian component, respectively.

2.6 Dynamic Features Smoothing

All the mean vectors of the selected Gaussian com-

ponents are concatenated to form a sequence. The

sequence is smoothed according to its delta and delta

delta information in the same way as in HMM-based

speech synthesis (Tokuda,2000; Yoshimura,2002).

Then a new sequence consisting of only static MFCCs

and static F0 features is obtained. We denote

the static, delta, delta delta features for time t as

, ∆c

, ∆

). Then, they are deﬁned as

(1)

∑

−L

(1)

−

(1)(τ)

t+τ

(4)

(2)

∑

−L

(2)

−

(2)(τ)

t+τ

. (5)

where w

( j)

(τ), j ∈ {1, 2} are weight coefﬁcients

and τ is a length of time. The new static features se-

quence C = [c

, c

, ··· , c

]

can be estimated by solv-

ing a system of linear equations as follows:

C = (W

−1

, (6)

where

S = [µ

∗

, µ

∗

, ··· , µ

∗

]

(7)

−1

= diag[Σ

−1

∗

, Σ

−1

∗

, ··· , Σ

−1

∗

] (8)

W = [w

, w

, ··· , w

]

(9)

= [w

(0)

, w

(1)

, w

(2)

] (10)

(n)

= [0

M×M

, ··· , 0

M×M

, w

(n)

(−L

(n)

−

M×M

··· , w

(n)

M×M

, ··· , w

(n)

M×M

, ··· , 0

M×M

]

, n = 0, 1, 2.

(11)

where M is the number of the dimensions of static

MFCC features without dynamic features and l

∗

de-

notes the index of the best-matched Gaussian compo-

nent, selected as in Eq. (3).

2.7 Experiments

Experiments were conducted on The Continuous

Speech Recognition Wall Street Journal Phase I

(CSR-WSJ0) Corpus (Linguistic Data Consortium).

Speaker 001 (male, 598 utterances) and 002 (female,

600 utterances) were alternatively used as source and

target speaker. 1720 Gaussian mixtures were trained

for each target speaker. subjectively listening to the

converted utterances were conducted. Five listeners

were given ten generated utterances for each target

speaker. They were asked to give to each utterance a

score which is ranged from 1 to 5. The score average

of each listener across the ten sentences is listed in

Table 1.

Table 1: Subjective evaluation scores for GMM VC.

Listener id 1 2 3 4 5 Ave

naturalness 2.9 3.4 2.6 3.8 3.2 3.2

similarity 2.9 3.2 2.3 3.2 2.0 2.7

We found that the quality of generated voices was

acceptable. Most of the literal contents were identiﬁ-

able. However there were also noticeable discontinu-

ity and ﬂatness in the voices.

BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing

444

3 IMPROVE VOICE SIMILARITY

AND NATURALNESS BY

MODEL MAPPING

By investigating the output frames in the previous ex-

periments, We noticed that the mis-selection of mod-

els resulted in remarkable discontinuity and the use

of mean vectors of Gaussian mixtures made the gen-

erated speech sounds ﬂat. More speciﬁcally speak-

ing, each mean vector of a Gaussian mixture used for

speech synthesis was an average over some training

frames. This average effaced the details of the char-

acteristics of the target speaker. we tried to increase

the number of Gaussian mixtures to capture more de-

tails of the target speaker. However, when a certain

number was exceeded, the recognition performance

started to decrease thus degrading the quality of the

synthesized voice. This was due to that when number

of mixtures went up, the trained models perfectly ﬁt-

ted the training data but gave a very poor representa-

tion of the test data thus resulting in a bad recognition

performance. This behaviour is known as over-ﬁtting.

To solve this problem, we introduced an additional

model set for recognition, which was a SI model set

trained from a number of speakers.

In this method, both the SI models and the tar-

get SD models were trained for all phonemes. Each

phoneme model consists of a number of Gaussian

mixtures. Therefore, between the SI and SD model

sets, we could construct a mapping list by the model

labels. The modiﬁed VC system is depicted as in Fig-

ure 2. In conversion stage, we ﬁrstly searched over

SI GMMs to ﬁnd out the closest model to the input

frame. Then according to the mapping list, its coun-

terpart model in the target speaker model set was se-

lected as the output model. This scheme enabled us

to capture more detailed characteristics of the target

speaker by increasing the number of mixtures of tar-

get speaker models, without losing high recognition

performance for the input frames by keeping the num-

ber of mixtures of SI models ﬁxed.

3.1 Reduce Mismatch in Recognition

To recognize frames from an unknown speaker, the

SI model set performs much better than the target

speaker model set. When given an input frame x from

source speaker, it is ﬁrstly calculated on the SI model

set

p(x) =

∑

j=1

N (x | µ

, Σ

). (12)

and assigned to a phoneme model which has the

Phoneme “a”

SI models

SD models for

Gaussian mixtures

“a”

“b”

target speaker

Model selection

Conversion stage

Parameter vectors from source speaker

Generated parameter vectors

…..

∆

MCP

∆

MCP

…..

∆

MCP

∆

MCP

Gaussian mixtures

Phoneme “b”

Figure 2: Modiﬁed our voice conversion system.

highest probability, e.g., model ah. Then its counter-

part in the target model set can be found according

to the mapping list. Next we will try to ﬁnd a best

matched mixture among all the mixtures belonging to

the target SD model ah. To alleviate model mismatch,

we shift the input frame according to

y = µ

(x − µ

) (13)

Assume each model in the target model set con-

sists of K mixtures, the nearest mixture to vector y is

selected by calculating the normalized Euclidean dis-

tances

∗

= argmax

∑

m=1

− µ

)

(14)

where M is the dimensionality of MFCCs, µ

the

m-th dimension of the mean vector of k-th mixture.

Experiments were conducted using the same cor-

pus as in Section 2.7. 7138 utterances from SI-84

speakers were used to train the SI GMMs. Each tri-

phone model consists of 10 mixtures (after mixture

sharing, the total number of mixtures was 1720, the

same as that in Section 2.7). The same scoring crite-

rion was applied. The results are listed in Table 2.

Table 2: Subjective evaluation scores for modiﬁed VC.

Listener id 1 2 3 4 5 Ave

naturalness 3.2 4.1 3.4 4.3 3.7 3.7

similarity 3.0 3.1 2.8 3.3 2.5 2.9

The SI model did improve the naturalness of

MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and

Naturalness using Model-based Speech Synthesis Techniques

445

the generated speech. The speech sounded more

smoothly.

3.2 Improve Similarity by Increasing

Number of Mixtures of SD Models

In the following experiments, we keep the number of

mixtures of the SI models ﬁxed to maintain a good

recognition performance, while increasing the num-

ber of mixtures of the target models to capture the

detailed characteristics of the target speaker. This is

based on the assumption that even a target SD mixture

lacks of training data, it is still capable of represent-

ing the voice of the target speaker since it came from

the real samples of that speaker. Experiments were

conducted by ﬁxing the SI models to 1720 mixtures

while changing the target models with different num-

ber of mixtures. Here we listed the evaluation scores

in Table 3 and Table 4 .

Table 3: Naturalness for different number of mixtures.

Listener Id 1 2 3 4 5 Ave

1720 mix 3.2 4.1 3.4 4.3 3.7 3.7

3440 mix 4.1 4.3 3.6 3.9 4.1 4.0

6880 mix 3.5 4.3 4.0 4.2 3.7 3.9

13760 mix 2.8 3.8 3.4 3.7 3.6 3.5

Table 4: Similarity for different number of mixtures.

Listener Id 1 2 3 4 5 Ave

1720 mix 3.0 3.1 2.8 3.3 2.5 2.9

3440 mix 3.2 3.8 3.2 4.4 3.7 3.6

6880 mix 3.8 3.7 4.2 4.0 3.9 3.9

13760 mix 3.3 4.1 3.1 4.3 3.3 3.6

As shown in the tables, increasing the number of

mixtures not only brought improvement in natural-

ness, but also improved similarity remarkably. As

long as the number of mixtures did not go to extreme,

the quality of generated voice was improved greatly.

4 CONCLUSIONS

In this paper, we have presented a new approach for

voice conversion. By introducing a SI model set into

the GMM VC system, naturalness of the converted

voice was improved due to the improved recogni-

tion performance. Moreover the new system could

use more mixtures in target models for parameter

generation, therefore more detailed features of input

speeches could be captured to improve similarity. In

comparison with the original GMM VC system, both

naturalness and similarity were improved.

REFERENCES

T. Yoshimura, ”Simultaneous Modelling of Phonetic and

Prosodic Parameters and Characteristic conversion for

GMM-based Tex-to-speech Systems” ,Ph.D. disserta-

tion, Nagoya Institute of Technology, 2002.

D. Sundermann and H. Ney, ”VTLN-based Voice Con-

version”, Proc. of Signal Processing and Information

Technology (ISSPIT)”, Dec. 2003, pp. 14-17.

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T.

Kitamura, ”Speech parameter generation algorithms

for GMM-based speech synthesis”, Proc. of ICASSP,

pp.1315-1318, June 2000.

L. Arslan and D. Talkin, ”Speaker Transformation Algo-

rithm using Segmental Codebooks (STASC)”, Speech

Commun., pp. 211-226, 1999.

H. Ye and S. Young, ”Quality-enhanced Voice Morphing

Using Maximum Likelihood Transformation”, IEEE

Trans. on Audio, Speech and Language Processing,

Vol.14, No.4, pp.1301-1312, 2006.

S. Young et al., ”HTKBook (V3.4)”, Cambridge University

Engineering Department, 2006.

S. Imai et al., ”Speech Signal Processing Toolkit Ver.3.2”,

http://sp-tk.sourceforge.net, 2008 .

Linguistic Data Consortium, ”The LDC Corpus Catalog”,

http://www.ldc.upenn.edu/Catalog/

BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing

446