Assessing the Number of Clusters in a Mixture Model with
Side-information
Edith Grall-Maes and Duc Tung Dao
ICD - LM2S - UMR 6281 CNRS - Troyes University of Technology, Troyes, France
Keywords:
Clustering, Model Selection, Mixture Model, Side-information, Criteria.
Abstract:
This paper deals with the selection of cluster number in a clustering problem taking into account the side-
information that some points of a chunklet arise from a same cluster. An Expectation-Maximization algorithm
is used to estimate the parameters of a mixture model and determine the data partition. To select the number of
clusters, usual criteria are not suitable because they do not consider the side-information in the data. Thus we
propose suitable criteria which are modified version of three usual criteria, the bayesian information criterion
(BIC), the Akaike information criterion (AIC), and the entropy criterion (NEC). The proposed criteria are
used to select the number of clusters in the case of two simulated problems and one real problem. Their
performances are compared and the influence of the chunklet size is discussed.
1 INTRODUCTION
Clustering is used in many fields with an increasing
interest. It aims to determine a partition rule of data
such that observations in the same cluster are similar
to each other. The estimation of mixture models has
been proposed for quite some time as an approach for
clustering. It assumes that data are from a mixture
of clusters in some proportions and that the probabil-
ity density function is a weighted sum of parameter-
ized probability density functions. When the number
of clusters is known, the problem consists in deter-
mining the parameters of the density functions and
the proportions of each cluster. However the number
of clusters is generally unknown and it has to be as-
sessed.
In this paper, we consider the problem of data
with side-information, which gives the constraint that
some data originate from the same source. In partic-
ular, when different measures are realizations of the
same random variable, these points belong a same
chunklet. It is the case when some spatiotemporal
measures are available and it is known that for exam-
ple the random variable does not depend on the time;
this means that all the measures originating from the
same position in space belong a same cluster. An ex-
ample is the temperature in a given month in different
towns and in different years. The temperature is a ran-
dom variable and the values for a same town and for
all the years make a chunklet. The clustering prob-
lem consists in grouping similar towns, considering
the values of the different years as a chunklet. It has
to be noticed that the number of samples in a chun-
klet is not fixed. Another application is provided by
time series. In a series, all points arise from the same
system then they have to belong to the same cluster.
The mixture models using side-information have
already been studied. For a given number of clus-
ters, an algorithm for determining the parameters of
the probability density functions and the proportions
has been introduced in (Shental et al., 2003). A modi-
fied version has been proposed in (Grall-Ma
¨
es, 2014)
when partitioning the data is the main concern. How-
ever as in classical clustering problems, the number
of clusters is generally unknown.
This paper addresses the problem of assessing the
number of clusters in a mixture model for data with
the constraint that some points arise from the same
source. To select the number of clusters, usual crite-
ria are not suitable because they do not consider the
side-information in the data. Thus we propose suit-
able criteria which are based on usual criteria.
This paper is organized as follows. Section 2 de-
scribes the method for determining jointly the param-
eters of the mixture model and the cluster labels, with
the constraint that some points arise from the same
source, in the case of a known number of clusters.
At the same time it introduces notations. In section
3, three criteria based on usual criteria are proposed.
The Bayesian information criterion (BIC), the entropy
criterion (NEC), and the Akaike information criterion
Grall-Maes, E. and Dao, D.
Assessing the Number of Clusters in a Mixture Model with Side-information.
DOI: 10.5220/0005682000410047
In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2016), pages 41-47
ISBN: 978-989-758-173-1
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
41
(AIC) are modified in order to be adapted to the prob-
lem of clustering with constraint. The results using
two examples on simulated data and one example on
real data are reported in section 4. The criteria are
compared and the influence of the chunklet size is dis-
cussed. We conclude the paper in section 5.
2 CLUSTERING WITH
SIDE-INFORMATION
The data we consider is a set of N observations X =
{s
n
}
n=1..N
. Each observation s
n
is assumed to be a
chunklet, which is a set of |s
n
| independent points that
originate from the same source: s
n
= {x
n
i
}
i=1..|s
n
|
.
The observation set is assumed to be a sample
composed of K sub-populations which are all mod-
els of the same family. Each model corresponds to a
statistical law parameterized by θ
k
. The latent clus-
ter labels directly related to the parameters θ
k
are de-
scribed by Z = {z
n
}
n=1..N
where z
n
= k means that
the n
th
realization originates from the k
th
cluster. This
means that all points x
n
i
are within the k
th
cluster due
to the side-information.
Then in the case of data with side-information, the
observation set and the cluster label set are respec-
tively given by:
X = {s
n
}
n=1..N
with s
n
= {x
n
i
}
i=1..|s
n
|
(1)
and
Z = {z
n
}
n=1..N
. (2)
In order to compare this problem to an equivalent
case without side-information, we define the obser-
vation set X
0
which is composed of N
0
points with
N
0
=
N
n=1
|s
n
| and the cluster label set Z
0
respectively
by:
X
0
= {x
n
i
}
i=1..|s
n
|,n=1..N
(3)
and
Z
0
= {z
n
i
}
i=1..|s
n
|,n=1..N
. (4)
The mixture model approach to clustering
(McLachlan and Basford, 1988) assumes that data
are from a mixture of a number K of clusters in
some proportions. The model is parameterized by
θ
θ
θ
K
= {θ
k
, α
k
}
k=1..K
where α
k
is the probability that
a sample belongs to class k, α
k
= P(Z = k), and θ
k
is
the model parameter value for the class k. Then the
density function of a sample s given θ
θ
θ
K
writes as :
f (s|θ
θ
θ
K
) =
K
k=1
α
k
f
k
(s|θ
k
)
where f
k
(s|θ
k
) is the density function of the compo-
nent k.
The maximum likelihood approach to the mixture
problem for a data set X and a given value K consists
of determining θ
θ
θ
K
that maximizes the log-likelihood.
The log-likelihood is given by:
L
X
(θ
θ
θ
K
) =
N
n=1
log f (s
n
|θ
θ
θ
K
).
Due to the constraint given by the side-
information, and the independence of points within
a chunklet, we get
f (s
n
|θ
θ
θ
K
) =
K
k=1
α
k
f
k
(s
n
|θ
k
) =
K
k=1
α
k
|s
n
|
i=1
f
k
(x
n
i
|θ
k
) (5)
Then
L
X
(θ
θ
θ
K
) =
N
n=1
log
K
k=1
α
k
|s
n
|
i=1
f
k
(x
n
i
|θ
k
)
!
(6)
The log-likelihood in the equivalent case without
side-information, for a data set X
0
and a parameter set
θ
0
θ
0
θ
0
K
= {θ
0
k
, α
0
k
}
k=1..K
is:
L
X
0
(θ
0
θ
0
θ
0
K
) =
N
n=1
|s
n
|
i=1
log f (x
n
i
|θ
0
θ
0
θ
0
K
)
=
N
n=1
|s
n
|
i=1
log
K
k=1
α
0
k
f
k
(x
n
i
|θ
0
k
)
(7)
A common approach for optimizing the pa-
rameters of mixture models is the expectation-
maximization (EM) algorithm (Celeux and Govaert,
1992). This is an iterative method that produces
a set of parameters that locally maximizes the log-
likelihood of a given sample, starting from an arbi-
trary set of parameters.
A modified EM algorithm has been proposed in
(Grall-Ma
¨
es, 2014) for taking into account the side-
information, as in (Shental et al., 2003), and with the
aim of getting a hard partition, as in (Celeux and Go-
vaert, 1995). It repeats an estimation step (E step), a
classification step, and a maximization step (M step).
The E step at iteration m requires to compute the pos-
teriori probability c
(m)
nk
that the n
th
chunklet originates
from the k
th
cluster. It is given by:
c
(m)
nk
= p(Z
n
= k|s
n
, θ
θ
θ
(m1)
K
)
=
α
(m1)
k
|s
n
|
i=1
f
k
(x
n
i
|θ
(m1)
k
)
K
r=1
α
(m1)
r
|s
n
|
i=1
f
r
(x
n
i
|θ
(m1)
r
)
(8)
The M step consists in estimating the parameters
that maximize the expected value of log-likelihood
determined on the E step.
Let denote L
(K) the maximized log-likelihood
for a given number K
L
(K) = max
θ
θ
θ
K
L
X
(θ
θ
θ
K
) = L
X
(θ
θ
θ
K
) (9)
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
42
where θ
θ
θ
K
is the optimal parameter set. One can com-
pare L
(K) with the usual log-likelihood without side-
information L
0∗
(K) which is defined similarly for the
set X
0
.
The model complexity increases with K and con-
sequently the maximized log-likelihood L
0∗
(K) is
generally an increasing function of K. Then in the
classical case (without side-information) the maxi-
mized log-likelihood cannot be used as a selection
criterion for choosing the number K. In the case with
side-information, this is the same then L
(K) cannot
be used as a selection criterion for choosing the num-
ber K.
3 CRITERIA
In order to choose the number of clusters K, a crite-
rion for measuring the model’s suitability which bal-
ances the model fit and the model complexity has to
be used. Various criteria have been previously pro-
posed for data without side-information. In this paper
we modify three criteria to adapt them to data with
side-information.
Generally a criterion allows to select a model in
a set. The complexity of a model depends on the
parameter dimension r of the model. For instance
for d-dimensional K components Gaussian mixture,
r = K 1 + dK +
Kd(d+1)
2
.
Let {M
m
}
m=1,...,M
denote the set of candidate
models, where M
m
corresponds to a model of dimen-
sion r
m
parameterized by φ
m
in the space Φ
m
.
3.1 Criterion BIC
One of the most used information criterion is BIC
(Schwarz, 1978) which is a likelihood criterion penal-
ized by the number of parameters in the model. The
idea of BIC is to select a model from a set of candidate
models by maximizing the posterior probability:
P(M
m
|X ) =
P(X |M
m
)P(M
m
)
P(X )
With hypothesis that P(M
1
) = P(M
m
), m =
(1, ..., M), the maximization of P(M
m
|X ) is equiva-
lent to the maximization of P(X |M
m
). It can be ob-
tained from the integration of the joint distribution :
P(X |M
m
) =
Z
Φ
m
P(X , φ
m
|M
m
)dφ
m
=
Z
Φ
m
P(X |φ
m
, M
m
)P(φ
m
|M
m
)dφ
m
The exact calculation of this integral can be ap-
proached by using the Laplace approximation (Lebar-
bier and Mary-Huard, 2006). The maximization
of P(X |M
m
) is equivalent to the maximization of
logP(X |M
m
). Neglecting error terms, it is shown that
logP(X |M
m
) logP(X |
c
φ
m
, M
m
)
r
m
2
log(N)
When the model depends directly to the number
of clusters, the criterion BIC for the set X
0
defined by
relation 3 is given by:
BIC(K) = 2 L
0∗
(K) + r log(N
0
)
This is the value of the criterion in the case of no side-
information, for a model with r free parameters.
In order to take into account the side-information,
the criterion has to be adapted. For the set X given
by relation 1 the number of observations is N and the
maximum log-likelihood L
(K), which takes into ac-
count the positive constraints is computed differently.
Then the BIC criterion is given by:
BIC(K) = 2 L
(K) + r log(N) (10)
Then the criterion does not depend directly on the
total number of points
N
n=1
|s
n
|. It depends on the
number of chunklets N.
3.2 Criterion AIC
The criterion AIC proposed in (Akaike, 1974) is an-
other largely used information criterion to select a
model from a set. The chosen model is the one that
minimizes the Kullback-Leibler distance between the
model and the truth M
0
.
d
KL
(M
0
, M
i
) =
Z
+
P(X |M
0
)log(X |M
0
)
Z
+
P(X |M
0
)logP(X |M
i
)
It is equivalent to select the model giving the max-
imized value of
Z
+
P(X |M
0
)P(X |M
i
)
The criterion AIC without side-information takes
the form:
AIC(K) = 2 L
0∗
(K) + 2 r
For taking into account that some points arise
from the same source, we propose to replace L
0∗
(K)
by L
(K). Then the modified AIC is given by:
AIC(K) = 2 L
(K) + 2 r (11)
= 2
N
n=1
log
K
k=1
α
k
|s
n
|
i=1
f
k
(x
n
i
|θ
k
)
!
+ 2r
Assessing the Number of Clusters in a Mixture Model with Side-information
43
3.3 Criterion NEC
The normalized entropy criterion (NEC) proposed in
(Celeux and Soromenho, 1996) is derived from a re-
lation underscoring the differences between the max-
imum likelihood approach and the classification max-
imum likelihood approach to the mixture problem.
In the case of a data set without side-information
X
0
, the criterion is defined as :
NEC(K) =
E
0∗
(K)
L
0∗
(K) L
0∗
(1)
(12)
where E
0∗
(K) denotes the entropy term which mea-
sures the overlap of the mixture components.
This criterion is expected to be minimized in or-
der to assess the number of clusters of the mixture
components. Because NEC(1) leads to an indeter-
minate form a new procedure has been proposed in
(Biernacki et al., 1999) to retain the number K. This
procedure is equivalent to setting NEC(1)=1 and to
retain the value K leading to the minimal NEC value.
Considering a data set with side-information X ,
we need to modify the computation of the terms of
entropy and of log-likelihood. Since
K
k=1
c
nk
= 1, we
can rewrite L
X
(θ
θ
θ
K
) given by relation (6) as :
L
X
(θ
θ
θ
K
) =
N
n=1
K
k=1
c
nk
log
K
r=1
α
r
|s
n
|
i=1
f
r
(x
n
i
|θ
r
).
Using the value of c
nk
adapted to data with side-
information:
c
nk
=
α
k
|s
n
|
i=1
f
k
(x
n
i
|θ
k
)
K
r=1
α
r
|s
n
|
i=1
f
r
(x
n
i
|θ
r
)
we obtain
L
X
(θ
θ
θ
K
) =
K
k=1
N
n=1
log
α
k
|s
n
|
i=1
f
k
(x
n
i
|θ
k
)
c
nk
which can be rewritten as
L
X
(θ
θ
θ
K
) = C
X
(θ
θ
θ
K
) + E
X
(θ
θ
θ
K
)
with
C
X
(θ
θ
θ
K
) =
K
k=1
N
n=1
c
nk
log
α
k
|s
n
|
i=1
f
k
(x
n
i
|θ
k
)
!
and
E
X
(θ
θ
θ
K
) =
K
k=1
N
n=1
c
nk
logc
nk
.
Thus we propose to use the criterion given by:
NEC(K) =
E
(K)
L
(K) L
(1)
(13)
in which E
(K) = E
X
(θ
θ
θ
K
) and L
(K) is given by (9).
−4 −2 0 2 4 6
−5
−4
−3
−2
−1
0
1
2
3
4
5
Figure 1: An example of a mixture of three Gaussian com-
ponents.
4 RESULTS
The performances of the three criteria have been as-
sessed using two simulated problems, a Gaussian
mixture and a Gamma process, and one real problem
using climatic data.
4.1 Gaussian Mixture
We considered a mixture of three two dimensional
Gaussian components. The observation data have
been generated with the following parameter values
m
1
= [0, 0], m
2
= [2, 2], m
3
= [2, 2],
Σ
1
= Σ
2
= Σ
3
= I,
N = 150 (50 chunklets per cluster) and |s
n
| = 5 n.
Consequently the total number of points is equal to
750. An example of data is given on figure 1. Let
note that on this figure we are not able to show the
chunklets and then we can only see the points. The
number of clusters was determined using each of the
three criteria. This experiment has been repeated 200
times in order to estimate the percent frequency of
choosing a K-component mixture for each criterion.
Three other cases have been tested changing the
number of chunklets and the parameter values of the
Gaussian components. The value of N and |s
n
| have
been changed to have the same total number of points
but a larger number of points in each chunklet. We
have used N = 15 (5 chunklets per cluster) and |s
n
| =
50. Thus the ”side-information” is more important
than in the reference case. Then the value of m
2
and
m
3
have been modified to have less separated clusters.
We have used m
2
= [1, 1] and m
3
= [1, 1].
The results for the four cases and for each of the
criteria BIC, AIC and NEC are given in table 1.
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
44
Table 1: Percent frequencies of choosing K clusters.
K BIC AIC NEC
1 0 0 0
m
1
= [0, 0] N = 150 2 0 0 40
m
2
= [2, 2] |s
n
| = 5 3 97 79 59
m
3
= [2, 2] 4 3 17 1
5 0 4 0
1 0 0 0
m
1
= [0, 0] N = 15 2 0 0 32
m
2
= [2, 2] |s
n
| = 50 3 99 95 68
m
3
= [2, 2] 4 1 5 0
5 0 0 0
1 0 0 0
m
1
= [0, 0] N = 150 2 0 0 78
m
2
= [1, 1] |s
n
| = 5 3 96 78 17
m
3
= [1, 1] 4 4 15 0
5 0 7 5
1 0 0 0
m
1
= [0, 0] N = 15 2 0 0 33
m
2
= [1, 1] |s
n
| = 50 3 97 94 67
m
3
= [1, 1] 4 3 6 0
5 0 0 0
The best criterion to evaluate the number of clus-
ters for data with side-information is the BIC; over
95% success is obtained in all cases. AIC has a
slight tendency to overestimate the number of clus-
ters, while NEC has a tendency to underestimate. This
conclusion is in accordance with the results given in
(Fonseca and Cardoso, 2007) obtained in case of the
classical criteria for mixture-model clustering without
side-information. It is also mentioned that for nor-
mal distributions mixtures, BIC performs very well.
The NEC criterion shows the worst behavior with a
rate success under 80%. In (Celeux and Soromenho,
1996), it is mentioned that this criterion is efficient
when some cluster structure exists. Since the cluster
structure is not obvious in this experiment, this crite-
rion has low performances.
Comparing the four cases, the results are not sur-
prising. When the side-information increases, or
when the cluster overlapping decreases, it is easier to
determine the right number of clusters.
4.2 Gamma Process
The Gamma process is widely used for the
modeling of monotonic and gradual deterioration
(Van Noortwijk, 2009). This process is defined by
parameters that describe the deterioration evolution
in time. The parameters depend on the properties or
operational conditions of the observed system. Their
values are usually estimated using data obtained on
the observed systems. In the case of data coming from
different systems, within an unknown finite number
of operational conditions, it is required to group sim-
0 5 10 15
0
10
20
30
40
50
60
70
80
90
Time
Realization value
Figure 2: An example of simulated data with three Gamma
processes, 10 realizations for each process, and 10 points
for each realization.
ilar systems to estimate the parameters of the Gamma
process model. Then it is necessary to determine the
number of components (i.e. process models), the pa-
rameters of each component, and the component used
to model each system.
An homogeneous Gamma process is parameter-
ized by a shape and a scale parameters respectively
noted a and b. Each observed increment of dete-
rioration x observed for a time increment t, is a
random variable which follows a Gamma distribution
Γ(at, b). Instead of defining the Gamma process by
the shape and the scale parameters, it is possible to
define it using the mean m and variance v of the dis-
tribution Γ(a, b). It is more convenient since m (resp.
v) corresponds to the mean value (resp. the variance
value) of the degradation per unit of time. m and v are
given by:
m =
a
b
v =
a
b
2
.
We used Monte-Carlo simulation to generate data
with mean m
k
= 2 k, variance σ
2
k
= 2 for k = 1. . . K,
where K is the number of clusters, i.e. the number
of processes. An example of simulated data is given
on figure 2, in the case of three Gamma processes
(clusters), a number of systems (chunklets) equal to
10 for each process, and a number of points per chun-
klet equal to 10, i.e. there are 10 measures for each
system. It means that K = 3, |s
n
| = 10 n and N = 30
and the total sample size is equal to 300.
Two experiments have been done: one with con-
stant values for N and |s
n
| and varying the value of
K, and one with constant value for K and varying the
value of N and |s
n
|.
In the first experiment, we have used 20 realiza-
tions for each process and |s
n
| = 10. We considered
4 cases for the value of the cluster number: K = 1, 2,
3, and 4. Then the sample size was equal to 200K
points. A Gamma mixture model was used in the
clustering algorithm. The number of clusters was se-
lected using each of the three proposed criteria (mod-
Assessing the Number of Clusters in a Mixture Model with Side-information
45
Table 2: Percent frequencies of choosing K clusters in the
case of N = 20K and |s
n
| = 10 and different cluster num-
bers.
theoretical K chosen K BIC AIC NEC
1
1 96 96 44
2 4 4 42
3 0 0 8
4 0 0 6
2
1 0 0 0
2 100 98 78
3 0 2 22
4 0 0 0
3
1 0 0 0
2 0 0 46
3 96 94 52
4 4 6 2
4
1 0 0 0
2 0 0 68
3 0 0 8
4 98 94 24
5 2 4 0
6 0 2 0
ified BIC, AIC, and NEC). The experiment has been
repeated 200 times for each value of K for estimat-
ing the percent frequency of choosing K clusters. The
results are reported in table 2.
As in the case of the Gaussian mixture experiment,
the best results are obtained with the criterion BIC
whatever the value of K. The results with the criterion
AIC are close to that with BIC. The results obtained
with the criterion NEC are not good. This is due to the
fact that the cluster overlapping is quite important.
In the second experiment, we have used K = 3.
We considered 6 cases for the couple (N, |s
n
|). The
experiment has been repeated 200 times for each case
for estimating the percent frequency of choosing 3
clusters. The results are given in table 3.
As expected, for a given value of N, the efficiency
of the clustering algorithm and the efficiency of the
criteria are increasing with the value of |s
n
|. For a
given value of |s
n
|, they are increasing with the value
of N. However the influence of |s
n
| is more important
than the value of N. This is due to the fact that |s
n
|
is related to the amount of side-information, while N
modifies only the size of the learning data base.
4.3 Climatic Data
The public website donneespubliques.meteofrance.fr
provides climatic data in France. We have used the
average temperature and the average rainfall for the
months of January and July in 109 available towns.
Then the dimension of a point is equal to 4. The mea-
Table 3: Percent frequencies of choosing 3 clusters in the
case of 3 clusters and different couples (N, |s
n
|).
N |s
n
| BIC AIC NEC
20K 3 68 62 20
20K 5 92 78 20
20K 20 98 92 70
6K 10 96 94 24
10K 10 98 96 52
40K 10 98 96 78
sure for each town is a random variable in dimension
4. We have used the data for the years 2012 to 2014. It
is assumed that the random variable for a given town
has the same distribution within all years. Then we
can consider that 3 realizations are available for each
random variable. It also means that the number of
points for each town (chunklet) is equal to 3. And
we assume that we can group random variables which
follow the same distribution. We have used a Gaus-
sian mixture model for the clustering algorithm.
The selected number of clusters with the crite-
rion BIC is 5. The results are reported on figure 3.
We can see different towns with climates which are
rather mediterranean, maritime, mountainous, conti-
nental, and semi-continental. In a future work these
results will be compared with results obtained by in-
creasing the data dimension (adding data from other
months and from other parameters such as the wind),
and by increasing the numbers of points for each town
(adding data from other years).
The proposed approach for the clustering with
side-information allows to deal with a variable num-
ber of points within each chunklet (i.e. town). Thus
it can deal with missing data for some years in some
towns. It is very convenient since it occurs that sen-
sors are off.
−5 0 5 10
41
42
43
44
45
46
47
48
49
50
51
longitude
latitude
Figure 3: Classification of climates in France.
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
46
5 CONCLUSION
In this paper, criteria are proposed for assessing the
number of clusters in a mixture-model clustering ap-
proach with side-information. The side-information
defines constraints, grouping some points in the same
cluster.
Three criteria used for assessing the number of
clusters in a mixture-model clustering approach with-
out side-information have been modified. These cri-
teria are the Bayesian information criterion (BIC), the
Akaike information criterion (AIC) and the entropy
criterion (NEC). For adapting the criteria, the com-
putation of the log-likelihood has been modified. It
takes into account that some points arise from the
same source. In addition the criteria depend on the
number of chunklets but not on the total number of
points.
Experiments have been done with simulated prob-
lems: Gaussian mixtures and Gamma processes, and
with a real problem. The simulations allowed to com-
pare the efficiency of the criteria to determine the right
number of clusters. The climatic data problem has
given an application example.
The side-information helps to determine the clus-
ters mainly when the clusters overlap. Thus the cri-
teria fitted to such situations are the most efficient.
The experiments have shown the best behavior of BIC
compared with the two other criteria. AIC presents
a slight tendency to overestimate the correct number
of clusters while NEC has an underestimating ten-
dency. Because NEC is strongly efficient when the
mixture components are well separated, its perfor-
mance is quite poor for the considered experimental
cases.
The influence of point number per chunklet on the
performance of the proposed criteria has also been
studied. The larger the chunklet size is, the better the
clustering algorithm performances are, and the better
the estimated number of clusters is.
REFERENCES
Akaike, H. (1974). A new look at the statistical model iden-
tification. IEEE Transactions on Automatic Control,
19(6):716–723.
Biernacki, C., Celeux, G., and Govaert, G. (1999). An im-
provement of the nec criterion for assessing the num-
ber of clusters in a mixture model. Pattern Recogni-
tion Letters, 20(3):267–272.
Celeux, G. and Govaert, G. (1992). A classification EM
algorithm for clustering and two stochastic versions.
Computational statistics & Data analysis, 14(3):315–
332.
Celeux, G. and Govaert, G. (1995). Gaussian parcimonious
clustering models. Pattern Recognition, 28:781–793.
Celeux, G. and Soromenho, G. (1996). An entropy crite-
rion for assessing the number of clusters in a mixture
model. Journal of classification, 13(2):195–212.
Fonseca, J. R. and Cardoso, M. G. (2007). Mixture-model
cluster analysis using information theoretical criteria.
Intelligent Data Analysis, 11(1):155–173.
Grall-Ma
¨
es, E. (2014). Spatial stochastic process cluster-
ing using a local a posteriori probability. In Proceed-
ings of the IEEE International Workshop on Machine
Learning for Signal Processing (MLSP 2014), Reims,
France.
Lebarbier, E. and Mary-Huard, T. (2006). Le crit
`
ere BIC:
fondements th
´
eoriques et interpr
´
etation. Research re-
port, INRIA.
McLachlan, G. and Basford, K. (1988). Mixture models. in-
ference and applications to clustering. Statistics: Text-
books and Monographs, New York: Dekker, 1988, 1.
Schwarz, G. (1978). Estimating the dimension of a model.
Annals of Statistics, pages 461–464.
Shental, N., Bar-Hillel, A., Hertz, T., and Weinshall, D.
(2003). Computing gaussian mixture models with EM
using side-information. In Proc. of the 20th Interna-
tional Conference on Machine Learning. Citeseer.
Van Noortwijk, J. (2009). A survey of the application of
gamma processes in maintenance. Reliability Engi-
neering & System Safety, 94(1):2–21.
Assessing the Number of Clusters in a Mixture Model with Side-information
47