A Probabilistic Approach based on a Finite Mixture Model of

Multivariate Beta Distributions

Narges Manouchehri and Nizar Bouguila

Concordia Institute for Information System Engineering (CIISE), Concordia University, Montréal, Canada

Keywords:

Mixture Models, Multivariate Beta Distribution, Maximum Likelihood, Clustering.

Abstract:

Model-based approaches speciﬁcally ﬁnite mixture models are widely applied as an inference engine in ma-

chine learning, data mining and related disciplines. They proved to be an effective and advanced tool in

discovery, extraction and analysis of critical knowledge from data by providing better insight into the nature

of data and uncovering hidden patterns that we are looking for. In recent researches, some distributions such

as Beta distribution have demonstrated more ﬂexibility in modeling asymmetric and non-Gaussian data. In

this paper, we introduce an unsupervised learning algorithm for a ﬁnite mixture model based on multivariate

Beta distribution which could be applied in various real-world challenging problems such as texture analysis,

spam detection and software modules defect prediction. Parameter estimation is one of the crucial and critical

challenges when deploying mixture models. To tackle this issue, deterministic and efﬁcient techniques such as

Maximum likelihood (ML), Expectation maximization (EM) and Newton Raphson methods are applied. The

feasibility and effectiveness of the proposed model are assessed by experimental results involving real datasets.

The performance of our framework is compared with the widely used Gaussian Mixture Model (GMM).

1 INTRODUCTION

Over the past couple of decades, machine learning ex-

perienced tremendous growth and advancement. Ac-

curate data analysis, extraction and retrieval of infor-

mation have been largely studied in the various ﬁelds

of technology (Han and Pei, 2012). Technological im-

provement led to the generation of huge amount of

complex data of different types (Diaz-Rozo and Lar-

ranaga, 2018). Various statistical approaches have

been suggested in data mining, however data clus-

tering received considerable attention and still is a

challenging and open problem (Giordan, 2015). Fi-

nite mixture models have been proven to be one of

the most strong and ﬂexible tools in data clustering

and have seen a real boost in popularity. Multimodal

and mixed generated data consists of different compo-

nents and categories and mixture models proved to be

an enhanced statistical approach to discover the latent

pattern of data (McCabe, 2015). One of the crucial

challenges of modeling and clustering is applying the

most appropriate distribution. Most of the literatures

on ﬁnite mixtures concern Gaussian mixture model

(GMM) (Zhou, 2017), (Guha and Shim, 2001), (Gev-

ers, 1999), (Hastie and Tibshirani, 1996), (Luo et al.,

2017). However, GMM is not a proper tool to express

the latent structure of non-Gaussian data. Recently,

other distributions such as Dirichlet and Beta distri-

butions which are more ﬂexible have been considered

as a powerful alternative (Giordan, 2015), (Olkin and

Trikalinos, 2015), (Fan and Bouguila, 2013), (Cock-

riel and McDonald, 2018), (Elguebaly and Bouguila,

2013), (Fan et al., 2014), (Klauschies et al., 2018),

(Wentao et al., 2013).

In this work, we introduce multivariate Beta mix-

ture model to cluster k dimensional vectors with

features deﬁned between zero and one. For learn-

ing the parameters, we applied the Expectation-

Maximization (EM) algorithm. Our model will be

evaluated on two real world applications. The ﬁrst

one is software defect detection. Nowadays, complex

software systems are increasingly applied and the rate

of software defects is growing correspondingly. Er-

rors, failures and defects may cause serious and costly

complications in systems and projects by providing

unexpected or unintended results. Hence, prediction

of defective modules by statistical methods has be-

come one of the attention-grabbing subjects of many

studies using machine learning methods to differenti-

ate fault prone or non-fault prone softwares (Malhotra

and Jain, 2012). Spam ﬁltering is our second topic of

interest. Evolutionary automated communication by

Manouchehri, N. and Bouguila, N.

A Probabilistic Approach based on a Finite Mixture Model of Multivariate Beta Distributions.

DOI: 10.5220/0007707003730380

In Proceedings of the 21st International Conference on Enterprise Information Systems (ICEIS 2019), pages 373-380

ISBN: 978-989-758-372-8

373

Internet improved the style of everyday communica-

tion. Electronic mail is a dominant medium for dig-

ital communications as it is convenient, economical

and fast. However, unwanted emails take advantage

of the Internet. Spam emails are sent to widely and

economically advertise a speciﬁc product or service,

serve online frauds (Malhotra and Jain, 2012) or are

carrying a piece of malicious code that might damage

the end user machines.

The rest of this paper is organized as follows; In

sections 2 and 3, we present our proposed mixture

model and model learning, respectively. Model as-

sessment is performed by devoting section 4 to exper-

imental results and the accuracy of our model is esti-

mated by comparing it with Gaussian mixture models.

Finally, we conclude this paper in section 5.

2 THE MIXTURE MODEL

In this section, we propose a new mixture model

based on a multivariate Beta distribution.

2.1 The Finite Multivariate Beta

Distribution

Bivariate and multivariate Beta distributions have

been introduced by Olkin and Liu (Olkin and Liu,

2003), (Olkin and Trikalinos, 2015). This article is

devoted to our proposed mixture model based on mul-

tivariate Beta distribution. In this section, we will

brieﬂy introduce the bivariate distribution with three

shape parameters and then describe the multivariate

case in detail.

Let us consider two correlated random variables X

and Y deﬁned by Beta distribution and described as

follows:

X =

(U +W )

(1)

Y =

(V +W )

(2)

U, V and W are three independent random vari-

ables arisen from standard Gamma distribution and

parametrized by their shape parameters a, b and c,

respectively. Both variables X and Y have positive

real values and are less than one. The joint density

function of this bivariate distribution is expressed by

Equation 3.

f (X,Y ) =

a−1

b−1

(1 − X)

b+c−1

(1 −Y )

a+c−1

B(a,b, c)(1 −XY )

(a+b+c)

(3)

where

B(a,b, c) =

Γ(a)Γ(b)Γ(c)

Γ(a + b + c)

(4)

The multivariate Beta distribution is constructed

by generalization of above bivariate distributions to k

variate distribution. Let U

,....,U

and W be indepen-

dent random variables each having a Gamma distri-

bution and variable X is deﬁned by Equation 5 where

i = 1,...k.

+W )

(5)

The joint density function of X

,...., X

after integra-

tion over W is expressed by:

f (x

,..., x

) = c

∏

i=1

−1

∏

i=1

(1 −x

)

+1)

1 +

∑

i=1

(1 −x

)

−a

(6)

where x

is between zero and one and:

c = B

−1

,..., a

) =

Γ(a

+ ... + a

)

Γ(a

)......Γ(a

)

Γ(a)

∏

i=1

Γ(a

)

(7)

is the shape parameter of each variable X

and:

a =

∑

i=1

(8)

2.2 Mixture model

Let us consider X = {

,...,

} be a set of N

k-dimensional vectors such that each vector

,..., X

) is generated from a ﬁnite but unknown

multivariate Beta mixture model p



X|Θ



. We as-

sume that X is composed of M different ﬁnite clusters

and can be approximated by a ﬁnite mixture model as

below (Bishop, 2006), (Figueiredo and Jain, 2002),

(McLachlan and Peel, 2000):



X|Θ



∑

j=1

) (9)

where

= (a

,...., a

). The weight of component j

is denoted by p

. All of mixing proportions are posi-

tive and sum to one.

∑

j=1

= 1 (10)

The complete model parameters are denoted by

,..., p

,...,

} and Θ = (p

) represents

the set of weights and shape parameters of component

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

374

3 MODEL LEARNING

In this section, we ﬁrst estimate the initial values of

the parameters. Then, the optimal parameters are es-

timated by developing maximum likelihood estima-

tion within EM algorithm. The initialization phase

is based on k-means framework and method of mo-

ments.

3.1 Method of Moments for the Finite

Multivariate Beta Distribution

Method of moments (MM) is a statistical technique

to estimate model’s parameter. Considering Equation

11 and Equation 12 as the ﬁrst two moments, sample

mean and variance, the shape and scale parameters of

Beta distribution can be estimated using the method of

moments by Equation 13 and Equation 14 as follow:

E(X) = ¯x =

∑

i=1

(11)

Var(X) = ¯v =

N − 1

∑

i=1

− ¯x)

(12)

α = E(X)



E(X)

Var(X)



1 − E(X)



− 1



(13)

β =



1 − E(X)



E(X)

Var(X)



1 − E(X)



− 1



(14)

By the help of the mean and variance of compo-

nents obtained from k-means phase, the initial param-

eters are approximated.

3.2 Maximum Likelihood and EM

Algorithm

As one of the suggested methods to tackle the prob-

lem of ﬁnding the parameters of our model, we apply

maximum likelihood estimate (ML) approach (Gane-

salingam, 1989) and expectation maximization (EM)

framework (McCabe, 2015), (McLachlan and Krish-

nan, 2008) on the complete likelihood. In this tech-

nique, the parameters which maximize the probability

density function of data are estimated as follow:

∗

= argmax

L (X , Θ) (15)

L (Θ, X ) = log



p(X |Θ)



∑

n=1

log



∑

j=1

)



(16)

Each

is supposed to be arisen from one of the

components. Hence, a set of membership vectors is

introduced as

= (

,. .. ,

) where:

n j



1 if X belongs to a component j

0 otherwise,

(17)

∑

j=1

n j

= 1 (18)

This gives the following complete log-likelihood:

L (Θ, Z, X ) =

∑

j=1

∑

n=1

n j



log p

+ log p(

)



(19)

In the EM algorithm, as the ﬁrst step in Expectation

phase, we assign each vector

to one of the clusters

by its posterior probability given by:

n j

= p( j|

) =

)

∑

j=1

)

(20)

The complete log-likelihood is computed as:

L (Θ, Z, X ) =

∑

j=1

∑

n=1

n j



log p

+ log p(

)



∑

j=1

∑

n=1

n j

log p

+ log



∏

i=1

−1)

∏

i=1

(1 − X

)

+1)

1 +

∑

i=1

(1 − X

)

−a

Γ(

∑

i=1

)

∏

i=1

Γ(a

)



∑

j=1

∑

n=1

n j

log p

+ log



∏

i=1

−1)



−log



∏

i=1

(1 − X

)

+1)



+ log



Γ(a

)



−log

∏

i=1

Γ(a

) + log

1 +

∑

i=1

(1 − X

)

−a

∑

j=1

∑

n=1

n j

log p

∑

i=1

− 1)



log(X

)



−

∑

i=1

+ 1)



log(1 − X

)



+ log



Γ(a

)



−

∑

i=1

log



Γ(a

)



− a

log

1 +

∑

i=1

(1 − X

)

(21)

The value of a

is computed by Equation 13 for

each component of mixture model.

A Probabilistic Approach based on a Finite Mixture Model of Multivariate Beta Distributions

375

To reach our ultimate goal and in maximization

step, the gradient of the log-likelihood is calculated

with respect to parameters. To solve optimization

problem, we need to ﬁnd a solution for the following

equation:

∂L(Θ,Z, X )

∂Θ

= 0 (22)

The ﬁrst derivatives of Equation 21 with respect to

where i = 1,..., k are given by:

∂L(Θ,Z, X )

∂a

∑

j=1

∑

n=1

n j

log(X

) − log(1 − X

)

+Ψ(a

) − Ψ(a

) − log

1 +

∑

i=1

(1 − X

)

(23)

where Ψ(.) and Ψ

(.) are digamma and trigamma

functions respectively deﬁned as follow:

Ψ(X) =

(X)

Γ(X)

(24)

(X) =

(X)

Γ(X)

−

(X)

Γ(X)

(25)

As this equation doesn’t have a closed form so-

lution, we use an iterative approach named Newton-

Raphson method expressed as follow:

new

old

− H

−1

(26)

where G

is the ﬁrst derivatives vector described in

Equation 23 and H

is Hessian matrix.



1 j

,..., G

k j



(27)

The Hessian matrix is calculated by computing the

second and mixed derivatives of L(Θ,Z,X ).







∂G

∂a

·· ·

∂G

∂a

∂G

∂a

·· ·

∂G

∂a







∑

i=1

n j

× (28)







(|~a

|) − Ψ

) ·· · Ψ

(|~a

|) ·· · Ψ

(|~a

|) − Ψ

)







where

|~a

| = a

+ ... + a

(29)

The estimated values of mixing proportions are

expressed by Equation 30 as it has a closed-form so-

lution:

∑

n=1

p( j|

)

(30)

3.3 Estimation Algorithm

The initialization and estimation framework is de-

scribed as follows:

1. INPUT: k-dimensional data

and M.

2. Apply the k-means to obtain initial M clusters.

3. Apply the MOM to obtain

4. E- step: Compute

n j

using Equation 20.

5. M-step: Update the

using Equation 26 and

using Equation 30.

6. If p

< ε, discard component j and go to 4.

7. If the convergence criterion passes terminate,

else go to 4.

4 EXPERIMENTAL RESULTS

In this section, we estimate the accuracy of our algo-

rithm by testing on two real world applications. As

the ﬁrst step, we normalize our datasets by Equation

31 as one of the assumptions of our distribution is that

the values of all observations are positive and less than

one.

X −X

min

max

− X

min

(31)

To assess the accuracy of the algorithm, the ob-

servations are assigned to different clusters based on

Bayesian decision rule. Afterward, the accuracy is

inferred by confusion matrix. At the next step, multi-

variate Beta mixture and Gaussian mixture model will

be compared.

4.1 Software Defect Prediction

Software quality assurance and detection of a fault or

a defect in a software program have become one of the

topics that have received lots of attention in research

and technology. Any failure in software may result in

high costs for the system (Bertolino, 2007), (Briand

and Hetmanski, 1993), (El Emam and Rai, 2001). The

evaluation of the quality of complex software systems

is costly and complicated. Consequently, prediction

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

376

Figure 1: Two-component mixture of bivariate Beta distri-

bution.

Figure 2: Three-component mixture of bivariate Beta distri-

bution.

of software failures and improving reliability is one

of the attractive applications for scientists (Boucher

and Badri, 2017), (Kawashima and Mizuno, 2015),

(Lyu, 1996), (Koru and Liu, 2005). To tackle this

problem, it is critical to deﬁne the appropriate met-

rics to express the attributes of the software modules.

There are some metrics (Aleem et al., 2015) for as-

sessing software complexity such as the code size,

McCabes cyclomatic and Halsteads complexity (Mc-

Cabe, 1976). The McCabes metric includes essential,

cyclomatic and design complexity and the number of

lines of code. While the Halsteads metric consists of

base and derived measures and line of code (LOC)

(McCabe, 1976), (Shihab, 2014). Prediction mod-

els are applied to improve and optimize the quality

which is translated to customer satisfaction as a sig-

niﬁcant achievement for the companies. Finite mix-

ture models as ﬂexible statistical solutions and clus-

tering techniques are considered as powerful tools in

this area (Oboh and Bouguila, 2017), (Bouguila and

Hamza, 2010), (Kawashima and Mizuno, 2015). Our

experiment is performed on three datasets from the

PROMISE data repository obtained from NASA soft-

ware projects and its public MDP (Modular toolkit for

Data Processing) which are currently used as bench-

mark datasets in this area of research (NASA, 2004).

Figure 3: Four-component mixture of bivariate Beta distri-

bution.

Figure 4: Five-component mixture of bivariate Beta distri-

bution.

The metrics or features of each dataset are ﬁve dif-

ferent lines of code measure, three McCabe metrics,

four base Halstead measures, eight derived Halstead

measures and a branch-count. The datasets are clas-

siﬁed by a binary variable to indicate if the module is

defective or not. CM1 as the ﬁrst dataset is a NASA

spacecraft instrument software written in "C". KC1

as the second one, is a "C++" dataset raised from sys-

tem implementing storage management for receiving

and processing ground data. The last case, PC1 is

developed using "C" considering functions ﬂight soft-

ware for earth orbiting satellite. To highlight the basic

properties of the datasets, Table 1 is created. As it is

shown in Table 2 and Table 3, multivariate Beta mix-

ture model (MBMM) has better performance in all

three datasets in comparison with Gaussian mixture

model (GMM). For CM1, the accuracy of our model

is 98.79% while this value for GMM is 85.94% . KC1

has a more accurate result (94.12%) with MBMM

than GMM (88.66%). The performance of models for

PC1 is similar by 94.13% and 91.79% of accuracy

for MBMM and GMM, respectively. The precision

and recall follow the same behavior as accuracy. The

multivariate Beta mixture model is capable to reach

97.44% precision and 99.55% of recall for PC1 and

KC1, respectively. While GMM has the best preci-

A Probabilistic Approach based on a Finite Mixture Model of Multivariate Beta Distributions

377

Table 1: Software modules defect properties.

Dataset Language Instances Defects

CM1 C 498 49

KC1 C++ 2109 326

PC1 C 1109 77

Table 2: Software modules defect results inferred from the

confusion matrix of multivariate Beta mixture model.

Dataset Accuracy Precision Recall

CM1 98.79 99.15 99.55

KC1 94.12 94.69 98.31

PC1 94.13 97.44 95.97

sion and recall in PC1 with 96.06% and 95.23%.

4.2 Spam Detection

Spam ﬁltering as our second real application is one

of the major research ﬁelds in information systems

security. Spams or unsolicited bulk emails pose se-

rious threats. As it was mentioned is some literature

up to 75–80% of email messages are spam (Blanzieri

and Bryl, 2008) which resulted in heavy ﬁnancial

losses of 50 and 130 billion dollars in 2005, respec-

tively (Galati, 2018), (Lugaresi, 2004), (Wang et al.,

2018). Considering serious risks and costly conse-

quences, classiﬁcation and categorization of email

(Ozgur and Gungor, 2012), (Amayri and Bouguila,

2009a), (Amayri and Bouguila, 2009b), (Amayri and

Bouguila, 2012) have received a lot of attention. Ap-

plying machine learning and pattern recognition tech-

niques capability was enhanced compared to hand-

made rules (Amayri and Bouguila, 2010), (Bouguila

and Amayri, 2009), (Fan and Bouguila, 2013), (Cor-

mack and Lynam, 2007), (Chang and Meek, 2008),

(Hershkop and Stolfo, 2005), (Drake, 2004) .

Our experiment was carried out on a challenging

spam data set obtained from UCI machine learning

repository, created by Hewlett-Packard Labs (UCI,

1999). This dataset contains 4601 instances and 58

attributes (57 continuous input attributes and 1 nomi-

nal class label target attribute). 39.4% of email (1813

instances) are spam and 60.6% (2788) are legitimate.

The attributes are extracted from a commonly used

technique called Bag of Words (BoW) as one of the

main information retrieval methods in natural lan-

guage processing. In this method, each email is pre-

sented by its words disregarding grammar. Most of

the attributes in spam base dataset indicate whether a

particular word or character was frequently occurring

in the e-mail. 48 features include the percentage of

words in the e-mail that match the word. 6 attributes

Table 3: Software modules defect results according to the

confusion matrix of Gaussian mixture model.

Dataset Accuracy Precision Recall

CM1 85.94 92.21 90.88

KC1 88.66 93.99 92.69

PC1 91.79 96.06 95.23

Table 4: Spam ﬁltering results to compare the performance

of MBMM and GMM.

Mixture model Accuracy Precision Recall

MBMM 79.92 80.6 82.74

GMM 67.81 78.99 68.29

are extracted from the percentage of characters in the

e-mail that match characters. The rest of the features

are the average length of uninterrupted sequences of

capital letters, the length of the longest uninterrupted

sequence of capital letters and the total number of

capital letters in the e-mail. The dataset class denotes

whether the e-mail was considered spam or not. To

evaluate our framework, ﬁrst the dataset has been re-

duced to 3626 instances to have a balanced case. Then

it was normalized by Equation 31 as our assumption is

that all observation values are between zero and one.

Table 4 shows the results of our model performance in

comparison with Gaussian mixture model considering

their confusion matrix. As we can realize from table

4, multivariate Beta mixture model is more accurate

(79.92%) and has higher value in terms of precision

and recall, 80.6% and 82.74%, respectively.

5 CONCLUSION

In this paper, we have developed a clustering tech-

nique and a mixture model in order to propose a new

approach to model data and improve clustering accu-

racy. We have mainly proposed our model using a

multivariate Beta distribution which has more ﬂex-

ibility. The work addresses the parameters estima-

tion within a deterministic and efﬁcient method us-

ing maximum likelihood estimation. After the pre-

sentation of our algorithm for parameters estimation,

we evaluated the capability of the proposed statisti-

cal mixture model in two real attractive domains and

applied confusion matrix as a typical evaluation ap-

proach to estimate the accuracy, precision and recall

and effectiveness of our solution. As the ﬁrst real

world experiment, we considered a popular and crit-

ical application in information security engineering

about predicting defects in software modules in the

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

378

context of three NASA datasets. Our clustering algo-

rithm was developed to discover two groupings based

on some software complexity metrics. The proposed

methodology has been shown to outperform Gaussian

mixture model as a classical approach and our offered

solution achieved better results in terms of data mod-

eling capabilities and clustering accuracy. The second

application was spam detection using the spam base

dataset from the UCI repository. The ultimate goal of

our extensive study is developing a powerful classi-

ﬁer as a devoted ﬁlter to accurately distinguish spam

emails from legitimate emails in order to improve the

blocking rate of spam emails and decrease the mis-

classiﬁcation rate of legitimate emails. Spam ﬁltering

solutions presented in this paper generates acceptable,

accurate results in comparison with Gaussian mixture

model as the results of our algorithm has higher pre-

cision and recall. From the outcomes, we can infer

that the multivariate Beta mixture model could be a

competitive modeling approach for the software de-

fect and spam prediction problems. In other words,

we can say that our model produces enhanced clus-

tering results largely due to its model ﬂexibility.

REFERENCES

Aleem, S., Capretz, L. F., and Ahmed, F. (2015). Bench-

marking machine learning technologies for software

defect detection. 6(3):11–23.

Amayri, O. and Bouguila, N. (2009a). A discrete mixture-

based kernel for svms: Application to spam and im-

age categorization. Artiﬁcial Intelligence Review,

34(1):73–108.

Amayri, O. and Bouguila, N. (2009b). Online spam ﬁltering

using support vector machines. IEEE Symposium on

Computers and Communications, pages 337–340.

Amayri, O. and Bouguila, N. (2010). A study of spam ﬁl-

tering using support vector machines. Artiﬁcial Intel-

ligence Review, 34(1):173–108.

Amayri, O. and Bouguila, N. (2012). Unsupervised feature

selection for spherical data modeling: Application to

image-based spam ﬁltering. International Conference

on Multimedia Communications, Services and Secu-

rity, pages 13–23.

Bertolino, A. (2007). Software testing research: Achieve-

ments, challenges, dream. Future of Software Engi-

neering, page 85–103.

Bishop, C. (2006). Pattern recognition and machine learn-

ing. Springer, New York.

Blanzieri, E. and Bryl, A. (2008). A survey of learning-

based techniques of email spam ﬁltering. Artiﬁcial

Intelligence Review, 29:63–92.

Boucher, A. and Badri, M. (2017). Predicting fault-prone

classes in objectoriented software. page 306–317.

Bouguila, N. and Amayri (2009). A discrete mixturebased

kernel for svms: Application to spam and image cate-

gorization. Information Processing and Management,

45:631–642.

Bouguila, N., W. J. and Hamza, A. (2010). Software mod-

ules categorization through likelihood and bayesian

analysis of ﬁnite dirichlet mixtures. Applied Statistics,

37(2):235–252.

Briand, L. C., B. V. and Hetmanski, C. J. (1993). Develop-

ing interpretable models with optimized set reduction

for identifying high-risk software components. vol-

ume 19, page 1028–1044. IEEE Transactions on Soft-

ware Engineering.

Chang, M., Y. W. and Meek, C. (2008). Partitioned logistic

regression for spam ﬁltering. page 97–105. 14th ACM

SIGKDD international conference on knowledge dis-

covery and data mining.

Cockriel, W. M. and McDonald, J. B. (2018). Two multi-

variate generalized beta families, communications in

statistics. Theory and Methods, 47(23):5688–5701.

Cormack, G. and Lynam, T. (2007). Online supervised

spam ﬁlter evaluation. ACMTransactions on Informa-

tion Systems, 25(3):1–31.

Diaz-Rozo, J., B. C. and Larranaga, P. (2018). Clustering

of data streams with dynamic gaussian mixture mod-

els: An iot application in industrial processes. IEEE

Internet of Things Journal, 5:3533.

Drake, C., O. J. K. E. (2004). Anatomy of a phishing email.

First conference on email and anti- Spam (CEAS),

25(3):1–31.

El Emam, K. Benlarbi, S. G. N. and Rai, S. N. (2001). Com-

paring casebased reasoning classiﬁers for predicting

high risk software components. Journal of Systems

and Software, 55(3):301–320.

Elguebaly, T. and Bouguila, N. (2013). Finite asymmetric

generalized gaussian mixture models learning for in-

frared object detection. Computer Vision and Image

Understanding, 117:1659–1671.

Fan, W. and Bouguila, N. (2013). Variational learning of a

dirichlet process of generalized dirichlet distributions

for simultaneous clustering and feature selection. Pat-

tern Recognition, 46:2754–2769.

Fan, W., Bouguila, N., and Ziou, D. (2014). Variational

learning of ﬁnite dirichlet mixture models using com-

ponent splitting. Neurocomputing, 129:3–16.

Figueiredo, M. and Jain, A. K. (2002). Unsupervised

learning of ﬁnite mixture models. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

24(3):381–396.

Galati, L. (2018). The anatomy of a phishing attack. Fair-

ﬁeld County Business Journal, 54:5.

Ganesalingam, S. (1989). Classiﬁcation and mixture ap-

proaches to clustering via maximum likelihood. Jour-

nal of the Royal Statistical Society: Series C (Applied

Statistics, 38(3):455–466.

Gevers, T., S. A. (1999). Color-based object recognition.

Journal of the Royal Statistical Society: Series C (Ap-

plied Statistics, 32(3):453–464.

Giordan, M., W. R. (2015). A comparison of compu-

tational approaches for maximum likelihood estima-

tion of the dirichlet parameters on high-dimensional

data. SORT-Statistics and Operations Research Trans-

actions, 39(1):109–126.

A Probabilistic Approach based on a Finite Mixture Model of Multivariate Beta Distributions

379

Guha, S., R. R. and Shim, K. (2001). Cure: an efﬁcient

clustering algorithm for large databases. Information

Systems, 26:35–58.

Han, J., K. M. and Pei, J. (2012). Data mining: concepts

and techniques. Elsevier, Morgan Kaufmann, Amster-

dam; Boston.

Hastie, T. and Tibshirani, R. (1996). Discriminant analysis

by gaussian mixtures. Journal of the Royal Statistical

Society Series B (Methodological), 58(1):155–176.

Hershkop, S. and Stolfo, S. (2005). Combining email mod-

els for false positive reduction. The eleventh ACM

SIGKDD international conference on Knowledge dis-

covery in data mining, pages 98–107.

Kawashima, N. and Mizuno, O. (2015). Predicting fault-

prone modules by word occurrence in identiﬁers. Soft-

ware Engineering Research, Management and Appli-

cations, page 87–98.

Klauschies, T., Coutinho, R. M., and Gaedke, U. (2018). A

beta distribution-based moment closure enhances the

reliability of trait-based aggregate models for natural

populations and communities. Ecological Modelling,

381:46–77.

Koru, A. G. and Liu, H. (2005). Building effective de-

fect prediction models in practice. IEEE software,

22(6):23–29.

Lugaresi, N. (2004). European union vs. spam: a legal

response. First conference on email and anti-Spam

(CEAS).

Luo, Z., He, W., Liwang, M., Huang, L., Zhao, Y., and

Geng, J. (2017). Real-time detection algorithm of ab-

normal behavior in crowds based on gaussian mixture

model. 12th International Conference on Computer

Science and Education (ICCSE), 20:183.

Lyu, M. R. e. a. (1996). Handbook of software reliabil-

ity engineering. IEEE computer society press CA,

222:183.

Malhotra, R. and Jain, A. (2012). Fault prediction using sta-

tistical and machine learning methods for improving

software quality. Journal of Information Processing

Systems, 8(2):1241–262.

McCabe, T. J. (1976). A complexity measure. IEEE Trans-

actions on software Engineering, 8(4):308–320.

McCabe, T. J. (2015). Mixture Models in Statistics. Elsevier

Ltd.

McLachlan, G. and Krishnan, T. (2008). The EM algorithm

and extensions. Wiley-Interscience.

McLachlan, G. and Peel, D. (2000). Finite Mixture Models.

New York: Wiley.

NASA (2004). PROMISE Software Engineering Repos-

itory data set. http://promise.site.uottawa.ca/

SERepository/datasetspage.html.

Oboh, S. and Bouguila, N. (2017). Unsupervised learning

of ﬁnite mixtures using scaled dirichlet distribution

and its application to software modules categorization.

volume 8, page 1085–1090.

Olkin, I. and Liu, R. (2003). A bivariate beta distribution,

statistics and probability letters. volume 62, pages

407–412.

Olkin, I. and Trikalinos, T. A. (2015). Constructions for a

bivariate beta distribution. volume 96, pages 54–60.

Ozgur, L. and Gungor, T. (2012). Optimization of depen-

dency and pruning usage in text classiﬁcation. vol-

ume 15, pages 45–58.

Shihab, E. (2014). Practical software quality prediction,

in software maintenance and evolution (icsme). page

639–644.

UCI, R. (1999 (accessed 2 August 1999)).

Spambase UCI Repository data set.

https://archive.ics.uci.edu/ml/machine-

learningdatabases/spambase/spambase.data.

Wang, S., Zhang, X., Cheng, Y., Jiang, F., Yu, W., and Peng,

J. (2018). A fast content-based spam ﬁltering algo-

rithm with fuzzy-svm and k-means. pages 301–307.

Wentao, F., Bouguila, N., and Ziou, D. (2013). Unsu-

pervised hybrid feature extraction selection for high-

dimensional non-gaussian data clustering with varia-

tional inference. IEEE Transactions on Knowledge

and Data Engineering, 25(7):1670–1685.

Zhou, J. H., P. C. K. Y. W. (2017). Gaussian mixture model

for new fault categories diagnosis. pages 1–6.

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

380