Speech Source Tracking Based on Particle Filter under non-Gaussian
Noise and Reverberant Environments
Ruifang Wang
1, a
, Xiaoyu Lan
1, b
1
School of Electronic and Information Engineering, Shenyang Aerospace University, Shenyang 110136, China
Keywords: Speech Source Tracking, non-Gaussian Noise, Particle Filter, Generalized Correntropy Function.
Abstract: Tracking a moving speech source in non-Gaussian noise environments is a challenging problem. A speech
source tracking method based on the particle filter (PF) and the generalized correntropy function (GCTF) in
non-Gaussian noise and reverberant environments is proposed in the paper. Multiple TDOAs are estimated
by the GCTF and the multiple-hypothesis likelihood is calculated as weights for the PF. Next, predict the
particles from the Langevin model for the PF. Finally, the global position of moving speech source is
estimated in term of representation of weighted particles. Simulation results demonstrate the vadility of the
proposed method.
1 INTRODUCTION
Tracking a speech source accurately in reverberant
environments is desirable for teleconferencing
system (B. Kapralos, M. R. M. Jenkin and M.
Evangelos, 2003), robots (K. Nakadai, et al, 2006),
and human-machine interaction (T.P. Spexard, M.
Hanheide, and G. Sagerer, 2007). Acquiring the
position of the speech source plays an important role
in speech signal processing region. The
environmental noise and reverberation of the speech
signal are two challenging problems for speech
source tracking. In conventional speech source
localization and tracking approaches (E. T. Roig, F.
Jacobsen and E. F. Grande, 2010), (M. F. Fallon,
and S. J. Godsill, 2012), they only depend on the
current observations to estimate the positions of the
speech source. To improve tracking performance,
Bayesian filtering algorithms are used to track the
moving speech source, which employs not only
current observations but also previous observations.
The particle filter (PF) is an approximation of the
optimal sequential Bayesian estimation via Monte
Carlo simulations for non-linear and non-Gaussian
system. The PF incorporated multiple-hypothesis
model was applied to the speaker tracking problem
based upon TDOA observations (abbreviated to PF)
(D. B. Ward, E. A. Lehmann and R. C. Williamson,
2003). A novel framework of PF based on
information theory was discussed for speaker
tracking (F. Talantzis, 2010). A non-concurrent
multiple talkers tracking based on extended Kalman
particle filtering (EKPF) was proposed (X. Zhong,
and J. R. Hopgood, 2014). In (X. Zhong, A.
Mohammadi, et al, 2013), a distributed particle filter
(DPF) was proposed in speaker tracking in a
distributed microphone network, in which each node
runs a local PF for local posteriors fused to obtain a
global posterior probability (abbreviated to DPF-
EKF). In (Q. Zhang, Z. Chen, and F. Yin, 2016), a
distributed marginalized auxiliary particle filter was
proposed for speaker tracking.
For above-mentioned speech source tracking
methods, the background noise is assumed to be
Gaussian noise. However, the practical background
noise may be non-Gaussian noise such as knock on
the door, sudden phone ringing and a fit of couching,
which is impulsive in essence and would lead to
poor tracking performance for these speech source
methods. To remedy impacts of non-Gaussian
background noise on tracking performance, a PF
based speaker tracking method under non-Gaussian
noise environments is proposed. First, the symmetric
alpha-stable (SαS) distributions (M. Shao and C. L.
Nikias, 1993) are employed to model the non-
Gaussian noise and TDOA observations of speech
signals received between a microphone pair at each
node are approximated via a generalized correntropy
function (GCTF) (W. Liu, P.P. Pokharel, et al,
2007). Next, the Langevin model (D. B. Ward, E. A.
Lehmann and R. C. Williamson, 2003) is used to
Wang, R. and Lan, X.
Speech Source Tracking based on Particle Filter under non-Gaussian Noise and Reverberant Environments.
DOI: 10.5220/0008874204610466
In Proceedings of 5th International Conference on Vehicle, Mechanical and Electrical Engineering (ICVMEE 2019), pages 461-466
ISBN: 978-989-758-412-1
Copyright
c
2020 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
461
model the time-varying states of a moving speech
source to predict the particles and a multiple-
hypothesis model is introduced to calculate the
likelihood function as weights corresponding to the
particles of the PF. Finally, the global time-varying
position estimations at each time step are obtained in
terms of weighted particles.
2 FUNDAMENTAL ALGORITHM
2.1 Particle Filter for Tracking
Problem
Considering time-dependent state vector
k
x
in a
distributed sensor network, where k being a discrete
time index. The state-space model of the system and
observation models at node j are given as (D. B.
Ward, et al, 2003).
1
()
()
k k k k
k k k k
fu
hv


xx
zx
(1)
Where
k
z
is observation vector of
k
x
,
k
f
and
k
h
are the system dynamics function and the
observation function, respectively,
k
u
and
k
v
are the
process noise and observation noise with known
probability density function, respectively.
The Bayesian filter for tracking problem is to
calculate the posterior probability density
1:
()
kk
p xz
.
Particle filter estimates the Bayesian recursion via
the Monte Carlo simulation and works in the
principle of sequential importance resampling (SIR)
algorithm. In the prediction step, N particles
are drawn from a suitable chosen proposal function
1: 1 1:
( , )
n
k k k
q
x X z
at time k. In the update step, the
weight
n
k
w
corresponding to the nth particle
n
k
X
is
calculated based on the prior transition density as the
proposal function, i.e.,
1: 1 1: 1
( , ) ( ),
nn
k k k k k
qp

x X z x X
written as
1
1
( ) ( )
()
()
n n n
k k k k
nn
k k k
nn
kk
pp
wp
p

z X x X
zX
xX
(2)
Where
()
n
kk
p zX
is likelihood function.
The PF is to represent posterior probability
1:
()
kk
p xz
by a set
1
,
N
nn
kk
n
w
X
, given as
1
1
N
nn
k k k k
k
n
pw
:
(x z ) (x X )
(3)
Where
()
denotes the multi-dimensional Dirac
delta function. Finally, the MMSE estimate of the
state
k
x
is estimated as
1
ˆ
N
nn
k k k
n
w
xX
(4)
2.2 TDOA Estimation under non-
Gaussian Environments
For non-Gaussian noise, symmetric alpha-stable
(SαS) processes can model the impulsive noise
better than other processes (M. Shao and C. L.
Nikias, 1993), (W. Liu, P.P. Pokharel, et al, 2007)
which does not have finite second order statistics
and a closed-form probability density function
unfortunately. Normally, alpha-stable processes can
be described with characteristic functions, written as
( ) exp 1 sign( ) ( , )t jbt t j t t
(5)
tan( / 2) , 1
( , )
(2 / )log , 1



for
t
t for
(6)
Where
(0,2
is the characteristic exponent.
When speech source signals received by a pair of
microphones is polluted by non-Gaussian noise,
accurate TDOA estimations is difficult to be
obtained via typical TDOA estimation methods for
example generalized cross-correlation (GCC) (C.
Knapp, and G. C. Carter, 1976). To solve the
problem, a generalized correntropy function (GCTF)
(W. Liu, P.P. Pokharel, et al, 2007) based TDOA
estimation method is presented for speech source
tracking under the non-Gaussian noise environment.
The GCTF
()
j
kj
D
at node j is defined as
,1 ,2
( ) ( ( ) ( ))
j j j
k k j j k
D E s k s k


(7)
2
2
1 ( )
( ) exp
2
2





(8)
ICVMEE 2019 - 5th International Conference on Vehicle, Mechanical and Electrical Engineering
462
Where
,1
()
j
sk
and
,2
()
j
sk
denote the two signals
received at two microphones of node j,
E
represents mathematical expectation operation,
()
is the Gaussian kernel and
( 0)


is the kernel
size.
The TDOA observations at node j can be
estimated by a GCTF estimator
max max
,
ˆ
( ( ))
argmax
j j j
k
j j j
k k k
D





(9)
Where
maxj
denotes the maximal probable
value of the TDOA at node j.
Considering the noise and reverberation,
generally,
m
N
TDOAs selected from first
m
N
local
maxima of
()
j
k
D
constitute the TDOA observation
vector
,1 ,2
,
ˆ ˆ ˆ
, , ,
m
T
j j j j
k k k
kN


z
at node j (D. B.
Ward, E. A. Lehmann and R. C. Williamson, 2003).
3 SPEECH SOURCE TRACKING
UNDER NON-GAUSSIAN
ENVIRONMENTS
3.1 Speech Source Dynamic Model
The Langevin model is simple and has worked well
in practice to represent the time-varying locations of
a speech source moving trajectory which is denoted
as (D. B. Ward, E. A. Lehmann and R. C.
Williamson, 2003), (Q. Zhang, Z. Chen, and F. Yin,
2016).
1
1 0 0 0 0 0
0 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
kk
a T b T
a T b T
ab
ab



k
xx
(10)
Where
[ , , , ]
T
k k k k k
x y x yx
denotes the speech
source’s state at time step k in the Cartesian
coordinates,
T
is the discrete time interval,
k
is
the time-uncorrelated Gaussian white noise vector,
and the parameters a and b are defined as
2
exp( ) 1a T b v a
(11)
Where β is the rate constant, and
v
is the root-
mean-square velocity.
3.2 Multiple-Hypothesis Likelihood
Model
Consider the local likelihood function
()
j
k
k
p zx
at
node j based on
m
N
TDOA observations in
z
j
k
. Due
to noise and reverberation, among
m
N
TDOAs at
most one associated with the true speech source,
whereas the others correspond to the spurious speech
source (X. Zhong, and J. R. Hopgood, 2014). Thus,
the multiple-hypothesis likelihood model is
employed as the local likelihood function or local
weight for particles at node j, written as (X. Zhong,
and J. R. Hopgood, 2014).
2
0
,
( 1)
max max
1
1
ˆ
( ) ( ; ( ), )
(2 ) (2 )
m
mm
N
j j j
k i k i k k
NN
k
jj
i
q
pq

z x x
(12)
Where
0
q
is the prior probability that none of
m
N
TDOA observations corresponds to the true
speech source,
1,
im
q i N
is the prior
probability that only the ith TDOA corresponds to
the true source,
0
1
m
N
i
i
q
and
()
denotes the
Gaussian distribution.
3.3 Speech Source Tracking Method
Based on PF and GCTF
Under non-Gaussian noise environments, the PF is
employed for speech source tracking. Assume that
observation vectors
( 1,2, , )
j
k
jJz
in the
distributed microphone network with J nodes are
conditionally independent given a particle
n
k
X
. Then
the global likelihood function
()
n
kk
p zX
in Eq. (2)
can be factorized into all local likelihood functions
()
jn
kk
p zX
in Eq. (12), written as
1
( ) ( )
J
n j n
k k k k
j
pp
z X z X
(13)
Then a global MMSE estimate
ˆ
x
k
of the speech
source state
x
k
can be obtained from Eq. (4).
The speech source tracking method based on PF
and GCTF under non-Gaussian noise environments
Speech Source Tracking based on Particle Filter under non-Gaussian Noise and Reverberant Environments
463
is described as follows (abbreviated as PF-GCTF).
Firstly, the TDOA observations of speech signals
with non-Gaussian noise received by microphone
pair are estimated from the GCTF according to Eq.
(7). Taking into account the reverberation, multiple
TDOA candidates are selected as observation vector
at each node and based on them the multiple-
hypothesis likelihood model is performed to
calculate the local likelihood function. Next, predict
the particles according to the dynamic model in Eq.
(10), and global likelihood functions, i.e., weights,
corresponding to the particles are computed from Eq.
(13). Finally, a global position estimate of the
speech source state can be obtained in form of
weighted particles in Eq. (4).
4 SIMULATIONS AND
DISCUSSIONS
4.1 Simulation Setup
To verify the performance of a speech source
tracking method, the Root Mean Square Error
(RMSE) is given as
2
1
1
ˆ
RMSE
kk
M
m
M

xx
ll
(14)
Where
ˆ
k
x
l
and
k
x
l
represent the position estimate
and ground true position at time k, respectively, M
denotes the number of Monte Carlo simulations.
In the SαS noise environment, the generalized
signal noise ratio (GSNR) is used to describe the
different non-Gaussian environments
2
10
GSNR 10log (dB)
s
(15)
Where
2
s
is the signal variance and
is the
dispersion parameter of the SαS noise.
In simulation experiments, a female speech
source with the length about 4s and 16 kHz sampling
frequency moves along a semicircle trajectory in a
room which size is
5m 5m 3m
, and microphone
network has been constructed in advance with J=12
pairs of omni-direction microphones shown in Fig.1.
The heights of microphones and speech source are
set 1.5 m. The spacing distance of two microphones
in each node is 0.6 m. The speech signal is split into
120 frames and each frame length is 32ms. The
signal received by each microphone is captured by
Image method (E. A. Lehmann, A. M. Johansson, et
al., 2007), setting the size of room, reverberation
time T60 and the microphone coordinates, then
different impulsive noise is added to each
microphone, generating different GSNRs and
reverberations signals.
The simulation parameters are set as follows. For
the Langevin model,
1
10s
and
1
1msv
; for the
PF, the number of particles is N =500 and the initial
states of speech source state are considered
randomly; for the TDOAs, the number of the
TDOAs is Nm=4; for the GCTF, the kernel size
is
set as 0.5; for the multi-hypothesis model,
0
q
=0.25
and the observations standard deviation is
5
=5 10
.
Figure 1. Speech source trajectory and layout of the 12
microphone pairs in X-Y plane.
4.2 Result Discussions
To evaluate the proposed method (PF-GCTF), some
comparative experiments with the existing speech
source tracking methods are conducted, i.e., the PF
(D. B. Ward, E. A. Lehmann and R. C. Williamson,
2003) and the DPF-EKF (X. Zhong, A. Mohammadi,
et al, 2013). These methods are evaluated in the
RMSE results in Eq. (14), and the tracking results
are averaged over 50 Monte Carlo simulations based
on the same speech signal and the simulation setup.
4.2.1 Speech Source Tracking Results with
Different Reverberation Time T60
Table 1 shows that the RMSE results of all methods
with different reverberation time T60 from 100 ms
to 300 ms, when GSNR=6 dB and α=0.8. It can be
observed that the RMSE values of all methods
become larger when reverberation gets heavier.
Obviously, the DPF-EKF almost cannot track the
moving speech source under different reverberations
and the PF method has better tracking performance
ICVMEE 2019 - 5th International Conference on Vehicle, Mechanical and Electrical Engineering
464
only when T60 < 200 ms. It can be seen from Table
1 that the tracking performance of the PF-GCTF
method is better than the PF and DPF-EKF with
smaller RMSE values. It illustrates that the proposed
method is robust to the environmental reverberations.
Table 1. Average RMSE results versus different
reverberation times T60.
T
60
(ms)
PF-GCTF (m)
PF (m)
DPF-
EKF (m)
100
0.1062
0.1614
1.405
150
0.0921
0.1792
1.4544
200
0.1101
0.3987
1.7067
250
0.3385
0.63
1.8741
300
0.3974
0.9043
1.9738
4.2.2 Speech Source Tracking Results with
Different GSNR
Fig.2 illustrates that the RMSE results of all methods
with different GSNRs from -4 dB to 8 dB, when the
reverberation time T60 = 100 ms and α=0.8. It can
be seen from Fig.2 that with the rise of the GSNR
the RMSE values of all methods become smaller.
We can find that the tracking performance of the
DPF-EKF is the worst with larger RMSE values and
the PF method owns better tracking accuracy only
when GSNR > 6 dB. However, the proposed method
can successfully track the moving speech source
under different GSNR conditions with smaller
RMSE values. It implies that the PF-GCTF is a valid
speech source method for non-Gaussian background
noise of different GSNRs.
4.2.3 Speech Source Tracking Results with
Different Characteristic Exponents α
The RMSE results of all methods with different
characteristic exponents from 0.6 to 1.6 are
illustrated in Table 2 when T60 = 100 ms and
GSNR=0 dB. We can find that the PF and DPF-EKF
have better tracking accuracies only when α > 1.
Nevertheless, the RMSE values of the proposed
method in Table 2 are smaller when 0.6 < α < 1.6
which implies the PF-GCTF is an effective speech
source tracking method under non-Gaussian noise
environments.
Figure 2. Average RMSE results versus different GSNRs.
Table 2. Average RMSE results versus different
characteristic exponents α.
α
PF-GCTF (m)
PF (m)
DPF-EKF
(m)
0.6
0.1565
2.2503
2.1875
0.8
0.1301
1.4484
1.9428
1
0.0967
0.1552
1.3202
1.2
0.0892
0.0853
0.2833
1.4
0.0846
0.0672
0.1517
1.6
0.09
0.0641
0.1302
5 CONCLUSIONS
In the paper, a tracking method based on PF and
GTPF is proposed to estimate the positions of the
moving speech source under non-Gaussian noise and
reverberant environments. Since the generalized
correntropy function is employed to estimate
TDOAs, the proposed method based on PF can track
a moving speech source successfully in non-
Gaussian noise environments. Simulation results
illustrate that the PF-GCTF outperforms other
comparative methods and is robust against non-
Gaussian background noise and room
reverberations.
ACKNOWLEDGMENTS
This work was supported by National Science
Foundation for Young Scientists of China (Grant
No.61801308).
Speech Source Tracking based on Particle Filter under non-Gaussian Noise and Reverberant Environments
465
REFERENCES
B. Kapralos, M. R. M. Jenkin and M. Evangelos,
Audiovisual localization of multiple speakers in a
video teleconferencing setting, Int. J. Imaging Syst.
Technology, pp. 13 (1): 95-105 (2003).
C. Knapp, and G. C. Carter, The generalized correlation
method for estimation of time delay, IEEE Trans.
Acoust., Speech, Signal Process., pp. 24(4): 320-327
(1976).
D. B. Ward, E. A. Lehmann and R. C. Williamson,
Particle filtering algorithms for tracking an acoustic
source in a reverberant environment, IEEE Trans.
Speech and Audio Process., pp. 11 (6):826-836 (2003).
E. T. Roig, F. Jacobsen and E. F. Grande, Beamforming
with a circular microphone array for localization of
environmental noise sources, J. Acoust. Soc. Am., pp.
128(6):3535-3542 (2010).
E. A. Lehmann, A. M. Johansson, et al., Reverberation-
time prediction method for room impulse responses
simulated with the image-source model, in: IEEE
Workshop on Applications of Signal Processing to
Audio and Acoustics, pp. 159-162 (2007).
F. Talantzis, An acoustic source localization and tracking
framework using particle filtering and information
theory, IEEE Trans. Audio Speech Lang. Process., pp.
18 (7):1806-1817 (2010).
K. Nakadai, et al., Robust tracking of multiple sound
sources by spatial integration of room and robot
microphone arrays, in: IEEE Int. Conf. Acoust.,
Speech, Signal Process., pp. IV- 929- IV-932 (2006).
M. F. Fallon, and S. J. Godsill, Acoustic source
localization and tracking of a time-varying number of
speakers, IEEE Trans. Audio Speech Lang. Process.,
pp. 20(4):1409-1415 (2012).
M. Shao and C. L. Nikias, Signal processing with
fractional lower order moments: Stable processes and
their applications, Proceedings of the IEEE, pp. 81 (7):
986-1010 (1993).
Q. Zhang, Z. Chen, and F. Yin, Distributed marginalized
auxiliary particle filter for speaker tracking in
distributed microphone networks, IEEE Trans. Audio
Speech Lang. Process., pp. 24(11): 1921-1934 (2016).
T.P. Spexard, M. Hanheide, and G. Sagerer, Human-
oriented interaction with an anthropomorphic robot,
IEEE Trans. Robotics, pp.23 (5):852- 862 (2007).
W. Liu, P.P. Pokharel, et al., Correntropy: properties and
applications in non-Gaussian signal processing, IEEE
Trans. Signal Process., pp. 55 (11): 5286-5298 (2007).
X. Zhong, and J. R. Hopgood, Particle filtering for TDOA
based acoustic source tracking: Non-concurrent
multiple talkers, Signal Process., pp. 96(5):382-394
(2014).
X. Zhong, A. Mohammadi, et al., Acoustic source tracking
in a reverberant environment using a pairwise
synchronous microphone network, in: 16th Int. Conf.
Information Fusion, pp. 953-960 (2013).
ICVMEE 2019 - 5th International Conference on Vehicle, Mechanical and Electrical Engineering
466