Simultaneous Flexible Keyword Detection and Text-dependent Speaker

Recognition for Low-resource Devices

Hiroshi Fujimura, Ning Ding, Daichi Hayakawa and Takehiko Kagoshima

Toshiba Research and Development Center, Komukai-Toshiba-cho 1, Saiwai-ku, Kawasaki, 212-8582 Japan

Keywords:

Keyword Detection, Speaker Identiﬁcation, Low Resource Device, Bayesian.

Abstract:

This paper proposes a new method for simultaneous ﬂexible keyword detection and text-dependent speaker

identiﬁcation using a recognized keyword. The purpose is to identify a speaker from among a set of pre-

registered speakers on the basis of a short-command utterance in an ofﬁce or home on low-resource chip

devices. The ﬁrst contribution is to construct the process that includes a neural network (NN) and a customized

Viterbi-based algorithm for ﬂexible keyword detection, and Gaussian mixture models (GMMs) for speaker

identiﬁcation. Outputs of a middle layer in the NN and alignment information for keyword detection are also

used for creating feature vectors for speaker GMMs. The second contribution is to apply DropConnect in

speaker-modeling uncertainties of the Bayesian NN that is used for speaker reacognition. It results in robust

speaker models when enrollment utterances are few. Evaluation was conducted using 39 Japanese keywords

by 100 speakers. Recognition performance was measured on the basis of false acceptances and false rejects

using keyword utterances. Speaker identiﬁcation for 100 pre-registered speakers for recognized keywords was

simultaneously evaluated. The identiﬁcation rate when using a conventional i-vector method was 71.22%. By

contrast, the identiﬁcation rate of the proposed method was 89.29% while using low-cost resources.

1 INTRODUCTION

In recent years, there has been an increasing de-

mand for keyword detection and speaker identiﬁca-

tion working together as part of a user interface for

small devices (Wu et al., 2018; Chen et al., 2014; Var-

iani et al., 2014; Chen et al., 2015b). They are being

developed for smart speakers, home appliances, and

other devices in the home or the ofﬁce. Some are im-

plemented on an embedded chip. The use of wake up

words tends to be resistant to false acceptance errors.

However, direct control of certain speciﬁc devices by

voice command without the use of wake up words is

more convenient. Moreover, many convenient appli-

cations can be created if a device identiﬁes a speaker.

For example, a user only says “Turn on A/C” to

start an air conditioner with a setting of comfortable

temperature for the user. In this kind of scenario, the

speaker identiﬁcation has only to select one speaker

from among a set of registered speakers using the rec-

ognized keyword utterance.

Many studies focus on keyword detection and

speaker identiﬁcation, independently. However, they

should be simultaneously considered for implement-

ing into low-resource devices in the real world, and

it is important both from theoretical and application

perspectives.

An i-vector representation is widely used for

speaker identiﬁcation (Dehak et al., 2011). However,

an i-vector is generally designed to be used on speech

lasting for at least 2 s (Kanagasundaram et al., 2011;

Poddar et al., 2015; Tsujikawa et al., 2017). On the

other hand, the duration of keywords for controlling

devices is usually about less than 1 s. Thus, it is difﬁ-

cult to achievegood enough accuracyof speaker iden-

tiﬁcation using keywords by an i-vector method.

Some methods for applying i-vectors to short

keywords using a training data-set based on target

keywords have been proposed (Variani et al., 2014;

Larcher et al., 2012). Another approach for speaker

identiﬁcation using a short keyword is to introduce

neural network (NN)-based features (Variani et al.,

2014; Chen et al., 2015b; Chen et al., 2015a; Sny-

der et al., 2018). In (Variani et al., 2014; Chen et al.,

2015b), a NN for speaker identiﬁcation is ﬁrst con-

structed for a speciﬁc keyword using substantial data

on the keyword. Then, the NN outputs probabilities

for all speaker IDs. During enrollment and evalua-

tion, the method calculates output activations of the

last hidden layer that includes speaker information for

Fujimura, H., Ding, N., Hayakawa, D. and Kagoshima, T.

Simultaneous Flexible Keyword Detection and Text-dependent Speaker Recognition for Low-resource Devices.

DOI: 10.5220/0008903202970307

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 297-307

ISBN: 978-989-758-397-1; ISSN: 2184-4313

297

the purpose of extracting feature vectors. These con-

ventional methods require a training data-set based on

target keywords. Therefore the keyword set must be

ﬁxed. This restriction disable the methods from ap-

plying to various applications requiring customization

of the keywords by users, dynamic keywords by ser-

vice providers. The computational cost of using ad-

ditional NNs for speaker identiﬁcation (Variani et al.,

2014; Chen et al., 2015b; Chen et al., 2015a; Snyder

et al., 2018) is also crucial for low-resource devices,

and so on.

As a different approach of speaker identiﬁca-

tion, the hierarchical multi-layer acoustic model

(HiLAM) (Larcher et al., 2014) is proposed. It

is based on the concept of a text-dependent hidden

Markov model (HMM) with Gaussian mixture mod-

els (GMMs) for exploiting alignment information of

keywords. Speaker-speciﬁc HMMs are trained us-

ing only enrollment data. Therefore, the method

don’t need a training data-set based on target key-

words. However, in addition to calculation cost of

keyword detection, the HMMs calculation cost for

each speaker does not match for low-resource de-

vices. Therefore, we focus on simple GMMs meth-

ods (Hebert and Heck, 2003; Sturim et al., 2002) in

this paper.

Above described, many studies focus on speaker

identiﬁcation based on recognition of a speciﬁc key-

word with enough data-set based on the target key-

word. Further, they consider keyword detection and

speaker identiﬁcation separately. In this paper, we

construct a consistent method of simultaneous ﬂex-

ible keyword detection and text-dependent speaker

identiﬁcation with maintaining the low computational

cost. Furthermore, the method creates robust speaker

models on the proposed algorithm.

The ﬁrst contribution is to propose a consistent

and efﬁcient algorithm for simultaneous ﬂexible key-

word detection and text-dependent speaker identiﬁca-

tion for low-resource devices. A low-cost calculation

method for ﬂexible keyword detection is introduced

using a customized Viterbi algorithm (Nasu, 2016)

with NN scores. By adopting the algorithm that fo-

cuses only keyworddetection, the calculation cost can

be reduced compared to the conventional Viterbi al-

gorithm. To avoid increasing computational cost for

speaker identiﬁcation, the proposed method employs

middle-layer outputs in the ﬂexible keyword detec-

tion NN as feature vectors. This method does not

need the data-set based on the target keywords pre-

pared in advance. GMMs are constructed using the

feature vectors for registered speakers. Finally, likeli-

hoods of keyword utterances output by the GMMs are

compared.

The second contributionis to create robust speaker

models on the proposed algorithm. Speaker identi-

ﬁcation using only registered utterances by users is

a kind of few-shot learning, in which it is important

to reduce the amount of training data without perfor-

mance degradation.

During the enrollment stage for creating speaker

models, a user is required to utter keywords multiple

times. To reduce a user burden, the number of ut-

terances during the enrollment stage should be mini-

mized. Even so, the speaker models must sufﬁciently

cover variation in the environment and user speech.

To overcome the problem, the proposed method uses

the DropConnect method (Wan et al., 2013) when cre-

ating speaker GMMs. Normally, DropConnect is ap-

plied to NN training for creating robust NN models.

In this paper, it is used for estimating uncertainties of

GMMs in speaker modeling. The proposed method

augments feature vectors of middle-layer outputs us-

ing the DropConnect method to cover variation. It is

based on Bayesian theory for uncertainty (Gal, 2016).

Finally, speaker GMMs are created using the aug-

mented feature vectors to estimate the uncertainty.

This method does not inﬂuence the calculation cost of

speaker identiﬁcation process. Therefore, it can main-

tain low-resource-cost computation.

2 OVERVIEW OF THE

PROPOSED METHOD

Figure 1 shows an overview of the enrollment pro-

cess. First, each frame feature vector is input to a NN

for keyword detection. This NN outputs probabili-

ties for phoneme states and feature vectors for speaker

identiﬁcation on the basis of a middle layer. Recogni-

tion is based on a method introduced in Section 3 us-

ing probabilities that have been output from the NN.

A feature vector from the middle layer is transformed

into many feature vectors in accordance with multiple

DropConnect patterns to estimate the uncertainty. Fi-

nally, speaker models are constructed using the trans-

formed feature vectors and recognition alignment in-

formation.

In the speaker-identiﬁcation step, averaged vec-

tors for phoneme-state segments of a recognized key-

word are calculated on the basis of a speaker utter-

ance. The averaged vectors are input to each of the

speaker models to estimate a likelihood score. The

speaker with the highest score is selected from among

the set of registered speakers as the output. A detailed

diagram for speaker modeling to register speakers is

shown in Fig. 2. A detailed diagram for speaker iden-

tiﬁcation is shown in Fig. 3. The keyword detection

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

298

Figure 1: Overview of the proposed method.

Figure 2: A detailed block diagram of the proposed method

and the conventional i-vector method for speaker modeling.

Figure 3: A detailed block diagram of the proposed method

and the conventional i-vector method for speaker identiﬁca-

tion.

process is common for speaker modeling and identi-

ﬁcation. The explanation of the detection is given in

Section 3, and the explanation of speaker modeling

and identiﬁcation is given in Section 4.

3 KEYWORD DETECTION

The process of keyword detection is shown in

”Process-Detection” in Fig. 2. First, phoneme-state

probability is calculated for each frame using a NN.

Next, a customized Viterbi-based algorithm is applied

as a ﬂexible keyword detector.

A ﬂexible keyword-detector with low calculation

cost using phoneme-state probability was proposed in

(Zhu, 2017). A score for a target keyword to be rec-

ognized is calculated by summation of phoneme-state

probabilities in a sliding window. It can avoid a heavy

calculation cost of a conventional Viterbi algorithm

that is usually used for keyword detection (Junkaw-

itsch et al., 1996). However, this method needs to

control the size of the window, and the score is ap-

proximated.

To avoid degradation by approximation, we adopt

the algorithm proposed in (Nasu, 2016) for recogni-

tion of ﬂexible keywords. The method is based on dy-

namic programming using a left-to-right HMM. Key-

word HMMs consist of phoneme states. T and τ de-

note the total number of input frames and the frame

index, respectively. {x

}

τ=1

denotes a time sequence

of speech feature vectors. N denotes the total number

of keyword phoneme-states, { j}

j=1

denotes the state

index, and score(x

) denotes the local score for an

assigned state q

in states (1,2,.., N) at frame τ. Q

denotes a set of {q

s+1

,...,q

}. A keyword score

S(s,e) using HMMs for a segment from a start frame

s to an end frame e is calculated using the Viterbi al-

gorithm as follows:

S(s,e) =

e− s+ 1

max

∑

τ=s

score(x

). (1)

A keyword is detected and recognized using a pre-

deﬁned threshold θ for a score S(s,e) when a speech

segment with S(s,e) > θ exists. To ﬁnd the segment

by the frame t (t ∈ [1,T]), the calculation is deﬁned as

follows:

max

s<t

t − s+ 1

max

∑

τ=s

score(x

) > θ. (2)

The computational cost is normally O (NT

) for the

calculation of many possible paths for each combina-

tion of s and e. Because the score S(s,e) is not needed

to know for the purpose of detection, the formula is

Simultaneous Flexible Keyword Detection and Text-dependent Speaker Recognition for Low-resource Devices

299

arranged as follows:

max

s<t

t − s+ 1

max

∑

τ=s

score(x

) > θ,

⇔ max

s<t

max

∑

τ=s

(score(x

) − θ) > 0,

L(s,t) > 0,

where

L(s,t) = max

s<t

max

∑

τ=s

(score(x

) − θ).

(3)

This equation implies that normalization by t − s+ 1

is not needed for merely detecting a segment with

S(s,t) > θ. Algorithm 1 shows pseudo-code for our

method. L(s,t) is easily calculated by the algorithm.

Each lattice point only keeps the path with the max-

imum summation score searched from all of the pos-

sible paths with different starting points. On the

other hand, the conventional Viterbi algorithm re-

quires control of many hypotheses having different

starting points. In this method, the existence of a du-

ration with S(s,t) > θ is guaranteed by ﬁnding a dura-

tion with L(s,t) > 0. Therefore, the detection perfor-

mance is not degraded compared to the performance

of the conventional Viterbi algorithm. However, the

calculation cost is drastically reduced to O (NT). In

this paper, a simple and small feed forward NN is used

for the calculation of a phoneme-state score.

4 SPEAKER IDENTIFICATION

4.1 Speaker Modeling

Three speaker modeling processes are shown in

Fig. 2. One is the conventional method using i-vector,

and the process is shown in ”Conventional”. An-

other is the simple GMMs method shown in ”Pro-

posal1”. The other is the GMMs method using feature

augmentation shown in ”Proposal2”. In this section,

”Proposal1” is mainly introduced.

4.1.1 Feature

In the proposed method, features for speaker models

are derived from a predeﬁned middle layer in the de-

tector NN.

One reason for using the keyword independent

features is that the training set including the key-

words for recognition cannot be collected prior to

user-deﬁnition of keywords in our ﬂexible keyword

scenario though it is beneﬁcial to construct robust

Algorithm 1: Detector Algorithm Procedure.

The score of the max path reached for state j of frame t

is L

t, j

The max path holds alignment information.

The state score for a state j of frame t is l

t, j

< 0).

Initialize max path score L

0, j

= −∞( j = 1,.., N) and

0,0

= 0.

L(s,t) = L

t,N

. (L(s,t) is the same as eq. 3.)

For frame t(t = 1,..,T) : L

t,0

= 0,

Acoustic score for a state l

t, j

is calculated by a NN,

for j = 1 to N do

if L

t−1, j−1

> L

t−1, j

then

t, j

= L

t−1, j−1

+ l

t, j

) − θ

The recorded alignment information including the

starting point s of L

t−1, j−1

is propagated.

else

t, j

= L

t−1, j

+ l

t, j

) − θ

The recorded alignment information including the

starting point s of L

t−1, j

is propagated.

end if

end for

if L(s,t) = L

t,N

> 0 then

The keyword is recognized.

Go to speaker registration or identiﬁcation step.

else

Go to (t + 1)th iteration.

end if

speaker identiﬁcation features such as (Variani et al.,

2014; Larcher et al., 2012; Chen et al., 2015a).

Another reason is to avoid additional computation

cost of other NNs for feature extraction.

However, each layer in the NN recognition step

normalizes not only for channel and noise features,

but also for speaker features, because the NN is

trained to distinguish from among phoneme states.

Channel and noise environment inﬂuences should

be eliminated, but speaker identiﬁcation information

should be retained. Therefore, the best feature can be

found in the outputs of middle layers, which are de-

scribed in Section 6.

4.1.2 Modeling

A Gaussian model (GM) is constructed using fea-

tures for each of the combinations of a speaker and a

phoneme-state existing in a keyword. Phoneme align-

ment is determined as a result of keyword detection,

as described in Section 3. To construct speaker GMs,

mean and variance are calculated for phoneme-state

segments. A lth-layer output vector is used as a fea-

ture vector. y

lτ

denotes an output vector of the lth

layer at the τth frame. W denotes the number of reg-

istered speakers. A speaker w(0 < w ≤ W) utters a

keyword that has N HMM-states, and the recognized

state sequence is Q. When the number of frames in a

state m(0 < m ≤ N) is M

in a state sequence Q, the

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

300

Figure 4: Overview of the DropConnect in Modeling Un-

certainty of Bayesian Deep Networks method.

mean µ

wlm

and the variance σ

wlm

for the state m of a

speaker w using the lth-layer output are calculated in

accordance with the following equations:

wlm

∑

τ,q

∈Q,q

lτ

, (4)

wlm

∑

τ,q

∈Q,q

lτ

− µ

wlm

)

. (5)

For example, when the number of enrollment utter-

ances is three, Q

, Q

, and Q

denote recognized se-

quences. The mean and the variance are calculated

in Q = Q

∪ Q

. To cover the small variance,

”GMMs1” shown in Fig. 2 is composed of GMs cre-

ated by speaker features and a global one calculated

using speaker-and-keyword independent feature vec-

tors.

4.2 Modeling Uncertainty

This section explains the GMMs method using feature

augmentation shown in ”Proposal2” in Fig. 2. In this

process, the feature vectors of middle layer outputs

from a NN for keyword detection is augmented by a

DropConnect method using speech input to estimate

the modeling uncertainties.

In (Gal, 2016), the relation between model un-

certainty and stochastic regularization techniques

such as the dropOut method, multiplicative Gaus-

sian noise (Srivastava et al., 2014), and the Drop-

Connect method (Wan et al., 2013) are described.

In our method, this kind of technique is applied to

speaker model uncertainty. The DropConnect method

is chosen because it is better than the dropOut method

when estimating model uncertainty in (Mobiny et al.,

2019). Random layer-output is acquired through a

stochastic forward pass, to which DropConnect is ap-

plied. By repeating the process R times, sampling

outputs { ˆy

lτ1

,..., ˆy

lτR

} are acquired. Figure 4 shows

the overview of the proposed method. In (Gal, 2016;

Mobiny et al., 2019), the sampling outputs are empir-

ical samples from an approximate predictive distribu-

tion. A predictive mean µ

∗

wlm

and variance σ

∗2

wlm

are

approximated using Eqs. 6 and 7,

∗

wlm

∑

τ,q

∈Q,q

∑

r=1

ˆy

lτr

, (6)

∗2

wlm

∑

τ,q

∈Q,q

∑

r=1

( ˆy

lτr

− µ

∗

wlm

)

. (7)

The proposed method treats modeling uncertainty as

generativespeaker model uncertainty to support small

segment data. ”GMMs2” shown in Figure 2 is com-

posed of the GMs with augmentation and the GMs in

previous section.

4.3 Speaker Identiﬁcation

Speaker identiﬁcation process is shown in Fig. 3.

The conventional i-vector process shown in ”Conven-

tional”. Cosine similarity between the extracted i-

vector and each registered i-vector is calculated in the

process of the identiﬁcation. On the other hand, the

process of the proposed method is shown in ”Pro-

posal” in Fig. 3. During testing for the proposed

methods, an averaged vector for each segment is cal-

culated and input to all speaker GMMs that have been

created for the keyword. Likelihoods calculated from

all speaker GMMs are compared, and a speaker with

the best likelihoods is selected as a result of speaker

identiﬁcation.

5 EXPERIMENTAL SETUP

The proposed method recognizes a keyword, and the

speaker who has uttered the recognized keyword is

identiﬁed from among a set of registered speakers. To

this end, we have conducted two experiments on key-

word detection and speaker identiﬁcation.

5.1 Evaluation Database

We evaluated the proposed method using a Japanese

command dataset recorded by a close-talking micro-

phone. The sampling rate was 16000 [Hz]. For the

evaluation, 39 keywords uttered 5 times by 100 speak-

ers (female:51, male:49) were used. Ten keywords

uttered 5 times by 50 speakers (female:25, male:25)

were used for development. The evaluation set did not

include any speakers in the development set. Table 2

and 4 show all keywords, and each averaged duration

[sec] for the development and evaluation sets.

Simultaneous Flexible Keyword Detection and Text-dependent Speaker Recognition for Low-resource Devices

301

5.2 Evaluation Metrics

5.2.1 Recognition

Recognition performance was measured by the false

reject (FR) rate [%] and the number of false accepts

(FA). N

and N

all

denote the number of correctly

recognized keywords and the total number of utter-

ances, respectively. Therefore, FR = 100× (N

all

−

)/N

all

[%]. The value of FA was calculated us-

ing noises from the third CHiME challenge (Barker

et al., 2015). All noise ﬁles were concatenated to-

gether into a single ﬁle. The detector for all key-

words was executed for the concatenated ﬁle, and

the number of FA was calculated. Finally, the FA

count was normalized by the number of hours and the

number of keywords. Therefore, the metric was FA

[times/(hour · #keyword)].

5.2.2 Speaker Identiﬁcation

For speaker identiﬁcation, the evaluation metric for

all experiments was identiﬁcation rate (IR) value [%].

Each speaker model output a score. The scores were

compared for only registered speakers. IR was calcu-

lated using the equation IR = 100 × C/N

, where C

was the number of utterances matched between a ref-

erence speaker and a selected speaker. There were 3

enrollment utterances for each keyword, and the re-

maining 2 utterances were evaluated after enrollment.

The utterances were cross-validated, for a total of 5

utterances covering all possible variations. In many

real use-cases of the proposed method, the number of

registered speakers is less than 100. This purpose is

to measure the identiﬁcation performance, properly.

5.3 Model Preparation

5.3.1 Recognition

A detector was constructed using the method de-

scribed in Section 3. The algorithm needed an acous-

tic model to calculate the acoustic score for lattice

points at each frame. This system was designed to

work on small devices. Therefore, the size of acous-

tic model was very small , and contained fewer differ-

ences in comparison with a model designed for large-

vocabulary speech recognition.

The training set was the Japanese ATR

database (Kurematsu et al., 1990), which in-

cluded 270 hours of speech. Babble noise was

added to the training data to ensure robustness.

The basic characteristic was 32 Mel-ﬁlter bank

features compressed to 16 dimensions using an

Figure 5: Detection performance FA-FR.

afﬁne transformation. Mel-ﬁlter bank features were

normalized using a moving average. Input feature

vectors for the DNNs were created by concatenating

10 preceding and following frames, for a total of

21 frames. Therefore, the number of dimensions

was 16 × 21 = 336. Hidden layers were structured

as 128[unit] × 4[layer]. The activation function was

a sigmoid function. Final outputs comprised 4500

clustered phoneme-states.

5.3.2 i-vector Setup for Comparison

An i-vector-based method was constructed for the

comparison. The speaker-modeling process is shown

in ”Conventional in Fig. 2. Features comprised 19

Mel-frequency cepstral coefﬁcients (MFCC), in ad-

dition to energy, the ∆, and the ∆∆. There were 60

dimensions in total. The i-vector was extracted us-

ing the ALIZE (Larcher et al., 2013) speaker iden-

tiﬁcation library after normalization of features. A

2048-mixture universal background model and total

variability matrix were trained using the ATR dataset,

which included 3768 speakers. The number of i-

vector dimensions was 200. The parameter of i-

vector was similar as in (Tsujikawa et al., 2017). The

speaker-identiﬁcation process is shown in ”Conven-

tional in Fig. 3. Similarity between a speaker refer-

ence i-vector and a non-reference i-vector was mea-

sured using cosine similarity. Keywords were ex-

tracted by the detector, and the extracted portions

were used for enrollment and evaluation.

6 EXPERIMENT USING

DEVELOPMENT SET

6.1 Recognition

The relation between FA and FR using the develop-

ment set is shown in Fig. 5. The score for a threshold

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

302

in Equation 2 was calculated using accumulation of

logarithm-of-output probabilities for lattice points in

a detected path. The threshold in Fig. 5 was the θ

in Equation 2 for all keywords. In all experiments,

a threshold of −540 was selected to avoid FA in

a noisy environment. Therefore, the total value of

the speaker-identiﬁcation target for the development

set was 10(keywords)×50(speakers)×

× 2× (1−

0.0612) = 9388. The threshold was decreased by 10

until the keyword was recognized during enrollment.

6.2 Speaker Identiﬁcation

6.2.1 Euclidean Distance and Simple GMMs

We ﬁrst conducted experiments in which we mea-

sured the Euclidean distances between vectors of

phoneme-state features, as in (Dutta, 2007). The de-

tector output phoneme-state alignment, and averaged

vectors were calculated for each phoneme-state using

Eq. 4. For enrollment, vectors for each phoneme-state

segment of three utterances were averaged to create a

speaker vector. The speaker vectors were compared

to evaluation vectors from test utterances in accor-

dance with their Euclidean distances. The speaker

with the speaker vector closest to an evaluation vec-

tor was selected as an output speaker. A feature vec-

tor was extracted from each middle layer of the de-

tector NN before application of the sigmoid activa-

tion function. The structure of the hidden layers was

128[unit] × 4[layer] for the detector. Therefore, the

number of dimensions was 128 for a feature vector,

and 4 patterns from 4 layer-outputs were obtained.

”Euclid” in Tab. 1 represents the performance.

”Layer” means the use of the (1st, 2nd, 3rd, 4th) layer

output vector. Performance of the i-vector method

was 76.10%. Performance of the simple Euclidean

distance method was better than that of the i-vector

method. Therefore, alignment information has been

demonstrated to be important for speaker identiﬁca-

tion using short keywords.

Next, the averaged vectors were replaced by

GMMs. The process is shown in ”Proposal1” in

Fig. 2. An averaged vector for each phoneme-state

was input to the speaker-GMMs that had been con-

structed from phoneme-state GMMs, and each of the

speaker-GMMs output a likelihood value for an ut-

terance to be recognized. To prevent an extremely

low score, GMMs were composed of a GM created

by speaker features and a global one. A global GM

was created in advance using training data. Weights

were equal for the two GMs. The output likelihoods

were compared to identify a speaker. For compen-

sation, a global variance σ

and σ

∗2

wlm

were summed

after weighting. A global variance σ

was calculated

using the training dataset.

Table 1 shows the simple GMM-based method

(GMM(0)). Clearly, the performance of the GMMs

was better than that of Euclid. Therefore, variance

was effective for the identiﬁcation, even though there

were few enrollment data values.

6.2.2 GMMs using Augmented Feature Vectors

In this work, GMs created by augmented vectors

were added to the original GMMs described in Sec-

tion 6.2.1. The process is shown in ”Proposal2” in

Fig. 2. In this manner, three GMMs having equal

weights were constructed when augmented vectors

were applied. Feature vectors for speaker identiﬁca-

tion were augmented using DropConnect. DropCon-

nect has a parameter ”drop rate” that can control the

rate of dropping connections of a NN. As a controlled

parameter, the number of augmentation iterations R

was also considered. We controlled the values of the

drop rates for all iterations simultaneously. For each

iteration, the best drop rate was selected from among

(10,20,30,40,50,60)%. The parameter set was one

for all keywords, and was not tuned for individual

keywords.

Table 1 shows the results when the parameters

were controlled. ( ) shows the number of iteration

R. It shows that augmentation was effective for all

layers. It shows the best performance for each layer

after controlling the drop rates. The best performance

was 91.65%, using the 4th-layer output in one-layer

output results. Without augmentation, the perfor-

mance of the ﬁrst-layer output was the best. The

ﬁrst-layer output included much speaker information,

but also other noise information. On the other hand,

deeper layers tended to eliminate other information

that had been excluded for phoneme-state discrimina-

tion. Therefore, augmentation was capable of com-

pensating for inner-speaker variation to maintain the

robustness even in the presence of other factors. Next,

outputs of both the ﬁrst and the fourth layer were used

to exploit both features. Performance was represented

in Tab. 1 ”1st+4th”. The performance was 92.66%

when the number of iterations was 10. Table 2 shows

speaker identiﬁcation performance for each keyword

using the development set.

7 EXPERIMENT USING

EVALUATION SET

We conductedthe experiment using the evaluation set.

The parameters controlled using the development set

Simultaneous Flexible Keyword Detection and Text-dependent Speaker Recognition for Low-resource Devices

303

Table 1: Performance using the development set by IR [%].

i-vector 76.10

GMMs (#Iteration)

Layer Euclid (0) (3) (5) (10) (12)

1st 80.39 88.45 89.49 90.53 90.51 90.48

2nd 80.13 87.44 88.76 89.29 90.03 90.18

3rd 81.76 86.56 88.82 89.74 90.32 90.41

4th 84.42 87.87 90.29 91.07 91.65 91.63

1st+4th * 88.70 91.33 92.22 92.66 92.72

Table 2: Performance using the development set by IR [%].

Keyword Duration i-vector GMM(0) GMM(10)

[sec] 1st+4th 1st+4th

u shi ro 0.29 39.45 80.22 88.40

ja N pu 0.32 72.55 78.18 85.07

mo do ru 0.34 68.24 79.68 86.63

i chi ba N 0.40 87.84 85.88 90.1

kya N se ru 0.46 70.77 95.50 96.79

to ri ke shi 0.47 91.42 95.29 96.13

ri se Q to 0.48 82.57 81.29 88.65

ko ma N do 0.54 74.95 94.81 97.96

tu gi no pe e ji 0.74 86.61 97.98 97.98

ma e no pe e ji 0.75 85.08 96.95 98.21

Table 3: Performance using the evaluation set by IR [%].

i-vector 71.22

Layer 1st 2nd 3rd 4th 1st+4th

GMM(0) 86.39 85.21 83.88 84.51 85.87

GMM(10) 88.03 87.10 86.75 88.18 89.29

Figure 6: Speaker identiﬁcation performance IR [%] using the evaluation set, and ( ) shows the averaged duration for each

keyword.

were used for the evaluation. FR was 8.21%, and FA

was 0.6 [times/hour · #keyword]. Speaker identiﬁ-

cation was applied to the recognized keywords. Ta-

ble 3 shows the speaker identiﬁcation performance.

The i-vector performance for speaker identiﬁcation

was 71.22%.In contrast, the performance of the pro-

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

304

Table 4: Speaker identiﬁcation performance using the evaluation set by Identiﬁcation Rate [%].

Modeling i-vector GMM(0) GMM(10) GMM(0) GMM(10)

Feature MFCC 1st+4th layer 1st+4th layer

#enrollment

utterances 3 3 2

Keyword Dur.

[sec]

i i e 0.19 35.10 52.40 51.29 44.46 44.26

go ba N 0.29 39.11 73.10 77.44 67.30 72.58

o Q ke e 0.35 56.60 73.27 75.54 66.05 68.89

hy o o ji 0.36 82.75 91.40 91.40 86.40 87.10

me nyu u 0.36 42.06 83.24 89.65 76.59 85.24

hi da ri 0.36 50.35 80.58 86.05 72.52 80.07

he ru pu 0.37 50.62 64.77 75.43 56.92 67.57

su su mu 0.40 67.56 66.70 77.34 58.54 69.71

kyu u ba N 0.40 72.75 89.63 91.05 84.04 85.51

i chi ra N 0.40 79.31 89.14 90.98 83.03 85.37

o ha yo o 0.40 68.66 85.48 86.04 78.78 80.88

yo N ba N 0.41 92.30 93.31 94.01 88.33 89.24

ju u ba N 0.41 76.48 89.38 91.30 84.51 87.41

te e ka ku 0.41 76.48 89.38 91.30 64.68 72.17

ke Q te e 0.42 47.51 78.22 80.59 69.54 73.24

ro ku ba N 0.43 80.92 90.54 92.92 84.54 88.68

na na ba N 0.43 53.66 92.17 94.46 86.01 90.03

ka ku da i 0.43 71.57 87.82 92.26 83.24 88.74

ka ta shi ki 0.44 65.40 89.02 93.78 81.23 89.23

sa N ba N 0.45 87.57 95.32 96.07 90.84 92.05

de ba i su 0.45 70.35 83.86 89.64 75.18 82.33

syu ku syo o 0.47 71.74 75.77 83.34 70.17 78.70

ko N ba N wa 0.48 59.82 88.77 91.69 84.74 86.76

ho N su u 0.48 62.79 90.38 95.16 84.50 92.57

se Q te e 0.50 54.69 76.75 80.71 70.70 74.92

ha chi ba N 0.51 79.04 92.73 93.54 88.23 89.65

e N ji N o N 0.54 55.00 77.56 89.04 72.44 84.94

de N ge N o N 0.55 61.71 87.88 94.06 84.85 91.38

e a ko N tsu ke te 0.61 89.01 87.29 96.56 81.88 93.39

ko N ni chi wa 0.62 95.27 96.93 97.48 95.07 95.27

de N ki tsu ke te 0.65 86.23 86.23 93.35 83.02 88.19

e a ko N ke shi te 0.66 82.26 87.07 87.74 95.19 96.48

e N ji N o fu 0.67 79.32 88.58 93.03 86.43 90.44

de N ki ke shi te 0.68 83.02 87.46 92.63 84.33 88.14

de N ge N o fu 0.69 73.12 93.53 96.14 90.12 91.76

se tsu zo ku sa ki 0.79 90.51 95.18 96.30 92.12 93.57

me nyu u i chi ra N 0.82 91.26 97.84 98.53 93.95 95.74

ko ma N do ra i N 0.91 98.78 98.58 99.14 97.16 97.67

o ha yo go za i ma su 0.99 91.76 99.39 99.34 97.78 97.37

Average 0.51 71.22 85.87 89.29 80.96 85.11

posed method was 89.29% when the feature combi-

nation of ”1st” and ”4th” layer output (1th+4th) and

10 augment iterations (GMM(10)) were applied.

Figure 6 and Tab. 4 show speaker identiﬁcation

performance for each keyword using the evaluation

set.

In Fig. 6, value in parentheses after each keyword

shows the averaged duration of the keyword. The

Simultaneous Flexible Keyword Detection and Text-dependent Speaker Recognition for Low-resource Devices

305

order was sorted by the duration. The average du-

ration of all keywords was 0.51. ”i-vector (3utter-

ances)” shows i-vector results using three enrollment

utterances. ”GMM(0)1st+4th (3utterances)” and

”GMM(10)1st+4th (3utterances)” show the result of

GMM(0) and the result of GMM(10) using three en-

rollment utterances, respectively. ”GMM(0)1st+4th

(2utterances)” and ”GMM(10)1st+4th (2utterances)”

show the result of GMM(0) and the result of

GMM(10) using two enrollment utterances, respec-

tively.

Table 4 shows the values of the results. Basically,

the performance increased as the duration was longer.

The performances of GMM(10) using two and three

enrollment utterances were almost over 90% for more

than 0.5 [sec] duration, and they were higher than that

of i-vector using three enrollment utterances even if

GMM(10) used two enrollment utterances.

Figure 6 and Tab. 4 also show that the aug-

mentation was effective for most of evaluation key-

words. The performance of the proposed method im-

proved in comparison with the i-vector method. Fur-

thermore, the augmentation method increased perfor-

mance. The results show that our proposed method

was effective for speaker identiﬁcation using short

keywords.

8 CONCLUSION

This paper proposed a speaker identiﬁcation method

for even a very short duration of keywords recog-

nized by a NN-based detector. Because the feature

of speaker identiﬁcation is keyword independent, the

proposed method can be used for various applica-

tions using ﬂexible keywords. Moreover, computa-

tion cost for speaker identiﬁcation is very small be-

cause the feature is derived from the NN of the de-

tector without any additional NNs. The identiﬁca-

tion rate when using a conventional i-vector method

was 71.22%. In contrast, the performance of the pro-

posed method was 89.29% while maintaining low-

resource-cost computation. Performance of the pro-

posed method was clearly better than that of the con-

ventional i-vector method for speaker identiﬁcation

using short keywords and few enrollment data values

with a low-resource computation cost.

REFERENCES

Barker, J., Marxer, R., Vincent, E., and Watanabe, S.(2015).

The third ‘chime’speech separation and recognition

challenge: Dataset, task and baselines. In 2015 IEEE

Workshop on Automatic Speech Recognition and Un-

derstanding (ASRU), pages 504–511. IEEE.

Chen, G., Parada, C., and Heigold, G. (2014). Small-

footprint keyword spotting using deep neural net-

works. In 2014 IEEE International Conference on

Acoustics, Speech and Signal Processing, ICASSP

2014, pages 4087–4091. IEEE.

Chen, N., Qian, Y., and Yu, K. (2015a). Multi-task learn-

ing for text-dependent speaker veriﬁcation. In Pro-

ceedings of the 16th Annual Conference of the Inter-

national Speech Communication Association, pages

185–189. International Speech Communication Asso-

ciation (ISCA).

Chen, Y.-h., Lopez-Moreno, I., Sainath, T. N., Visontai,

M., Alvarez, R., and Parada, C. (2015b). Locally-

connected and convolutional neural networks for

small footprint speaker recognition. In Proceed-

ings of the 16th Annual Conference of the Inter-

national Speech Communication Association, pages

1136–1140. International Speech Communication As-

sociation (ISCA).

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and

Ouellet, P. (2011). Front-end factor analysis for

speaker veriﬁcation. IEEE Transactions on Audio,

Speech, and Language Processing, 19(4):788–798.

Dutta, T. (2007). Text dependent speaker identiﬁcation

based on spectrograms. Proceedings of Image and vi-

sion computing, pages 238–243.

Gal, Y. (2016). Uncertainty in deep learning. PhD thesis,

University of Cambridge.

Hebert, M. and Heck, L. (2003). Phonetic class-based

speaker veriﬁcation. In Eurospeech 2003 - Interspeech

2003. ISCA.

Junkawitsch, J., Neubauer, L., Hoge, H., and Ruske, G.

(1996). A new keyword spotting algorithm with

pre-calculated optimal thresholds. In Proceeding of

Fourth International Conference on Spoken Language

Processing. ICSLP’96, volume 4, pages 2067–2070.

IEEE.

Kanagasundaram, A., Vogt, R., Dean, D. B., Sridharan, S.,

and Mason, M. W. (2011). I-vector based speaker

recognition on short utterances. In Proceedings of the

12th Annual Conference of the International Speech

Communication Association, pages 2341–2344. Inter-

national Speech Communication Association (ISCA).

Kurematsu, A., Takeda, K., Sagisaka, Y., Katagiri, S.,

Kuwabara, H., and Shikano, K. (1990). Atr japanese

speech database as a tool of speech recognition and

synthesis. Speech communication, 9(4):357–363.

Larcher, A., Bonastre, J.-F., Fauve, B. G., Lee, K.-A., L´evy,

C., Li, H., Mason, J. S., and Parfait, J.-Y. (2013).

ALIZE 3.0 - open source toolkit for state-of-the-art

speaker recognition. In Proceedings of the 14th An-

nual Conference of the International Speech Commu-

nication Association, pages 2768–2772.

Larcher, A., Bousquet, P.-M., Lee, K. A., Matrouf, D., Li,

H., and Bonastre, J.-F. (2012). I-vectors in the con-

text of phonetically-constrained short utterances for

speaker veriﬁcation. In 2012 IEEE International Con-

ference on Acoustics, Speech and Signal Processing,

ICASSP 2012, pages 4773–4776. IEEE.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

306

Larcher, A., Lee, K. A., Ma, B., and Li, H. (2014). Text-

dependent speaker veriﬁcation: Classiﬁers, databases

and rsr2015. Speech Communication, 60:56–77.

Mobiny, A., Nguyen, H. V., Moulik, S., Garg, N., and Wu,

C. C. (2019). Dropconnect is effective in modeling

uncertainty of bayesian deep networks. arXiv preprint

arXiv:1906.04569.

Nasu, Y. (2016). Detection apparatus, detection method,

and computer program product. US Patent App.

15/071,669.

Poddar, A., Sahidullah, M., and Saha, G. (2015). Perfor-

mance comparison of speaker recognition systems in

presence of duration variability. In India Conference

(INDICON), 2015 Annual IEEE, pages 1–6. IEEE.

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and

Khudanpur, S. (2018). X-vectors: Robust DNN em-

beddings for speaker recognition. In 2018 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing, ICASSP 2018, pages 5329–5333.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A simple way

to prevent neural networks from overﬁtting. The Jour-

nal of Machine Learning Research, 15(1):1929–1958.

Sturim, D. E., Reynolds, D. A., Dunn, R. B., and

Quatieri, T. F. (2002). Speaker veriﬁcation using text-

constrained gaussian mixture models. In 2002 IEEE

International Conference on Acoustics, Speech, and

Signal Processing, volume 1, pages I–677. IEEE.

Tsujikawa, M., Nishikawa, T., and Matsui, T. (2017).

I-vector-based speaker identiﬁcation with extremely

short utterances for both training and testing. In 2017

IEEE 6th Global Conference on Consumer Electron-

ics (GCCE), pages 1–4. IEEE.

Variani, E., Lei, X., McDermott, E., Moreno, I. L., and

Gonzalez-Dominguez, J. (2014). Deep neural net-

works for small footprint text-dependent speaker ver-

iﬁcation. In 2014 IEEE International Conference

on Acoustics, Speech and Signal Processing, ICASSP

2014, pages 4052–4056. IEEE.

Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus,

R. (2013). Regularization of neural networks using

dropconnect. In Proceedings of the 30th International

Conference on Machine Learning, pages 1058–1066.

Wu, M., Panchapagesan, S., Sun, M., Gu, J., Thomas, R.,

Vitaladevuni, S. N. P., Hoffmeister, B., and Mandal,

A. (2018). Monophone-based background modeling

for two-stage on-device wake word detection. In 2018

IEEE International Conference on Acoustics, Speech

and Signal Processing, ICASSP 2018, pages 5494–

5498. IEEE.

Zhu, L. (2017). Neural network-based small-footprint ﬂex-

ible keyword spotting. Master’s thesis, University of

Karlsruhe Institute of Technology.

Simultaneous Flexible Keyword Detection and Text-dependent Speaker Recognition for Low-resource Devices

307