
 
symmetric. Over the last several years, various 
measures to symmetrize the KL divergence have 
been introduced in the literature. Among these 
measures, we choose simply summing the two 
combinations to define KL distance: 
,
||
||
 
(4)
Although Jeffreys (Jeffreys, 1946) do not develop 
Eq. (4) to symmetrize KL divergence, the so-called 
J-divergence equals the sum of the two possible KL 
divergences between a pair of probabilistic 
distributions. Because using full covariance causes 
the number of parameters to increase in proportion 
to the square of dimensions of the features, a 
diagonal covariance matrix is generally adopted, in 
which the elements outside the diagonal are taken to 
be zero. In this case, Gaussian distributions have 
independent and uncorrelated dimensions. So Eq. (4) 
can be written as the following closed-form 
expression: 
,
1
2
1
1
2 
(5)
3.2  Approximation by the Nearest Pair 
In speech recognition, the KL distance is required to 
be calculated for GMMs. However, it is not easy to 
analytically determine the KL distance between two 
GMMs. For GMMs, the KL distance has no closed-
form expression, such as the one shown in Eq. (5). 
For this reason, approximation methods have been 
introduced for GMMs. The simple method adopted 
here is to use the nearest pair of mixture 
distributions (Hershey and Olsen, 2007), 
,
min
,
,
 
(6)
where i, j are components of mixture M. As shown 
in Eq. (5) and (6), the mixture weight is not 
considered at this stage. So this approximation using 
a closed-form expression is still based on a single 
Gaussian distribution. In our experiments, the 
average (d
KL2ave
) and the maximum (d
KL2max
) are also 
evaluated. 
3.3  Approximation by Montecarlo 
Method 
In addition to approximation based on the closed-
form expression, the KL distance can be 
approximated from pseudo-samples using the Monte 
Carlo method. Monte Carlo simulation is the most 
suitable method to estimate the KL distance for 
high-dimensional GMMs. An expectation of a 
function over a mixture distribution, 
f(x)=Σπ
m
N(x;µ
m
,σ
2
m
), can be approximated by 
drawing samples from f(x) and averaging the values 
of the function at those samples. In this case, by 
drawing the sample x
1
, …,x
N
 ~ f(x), we can 
approximate (Bishop, 2006). 
||
||
≡
1
 
(7)
In this approximation, Eq. (7), D
MC
(f||g) converges 
to D(f||g) as N→∞. To draw x from the GMM f(x), 
first, the size of the sample is determined on the 
basis of the prior probability of each distribution, π
m
, 
and then samples are generated from each single 
Gaussian distribution. 
3.4  Approximation by Gibbs Sampler 
Furthermore, for sampling from multivariate 
probabilistic distributions, the Markov Chain Monte 
Carlo (MCMC) method has been widely applied to 
simulate the desired distribution. A Gibbs sample is 
drawn such that it depends only on the previous 
variable. The conditional distribution of the current 
variable  x
f
 on the previous variable x
g
 has the 
following normal distribution. 
;
,
1
 
(8)
where,  ρ is the correlation coefficient. Herein, the 
full-covariance matrix cannot be calculated due to 
the insufficient training data in our experiments; 
therefore, we adopt the unique correlation 
coefficients from the full training data. The 10,000 
(10K) samples from the beginning of the chain, the 
so-called burn-in period, are removed. In our 
experiments, we generate samples of size 10K and 
100K for the MC and MCMC methods. For the 
symmetric property, we calculate arithmetic mean 
(AM), geometric mean (GM), and harmonic mean 
(HM) from the resulting KL divergence with MC 
and MCMC sampling (Johnson and Sinanovi´c, S., 
2001). The maximum and minimum between the 
two divergences, D(f||g) and D(g||f) are also 
calculated for comparison. 
3.5  Bhattacharyya Distance and Others 
The Bhattacharyya distance, which is another  
ExperimentalEvaluationofProbabilisticSimilarityforSpokenTermDetection
443