A CONNECTIONIST APPROACH
IN BAYESIAN CLASSIFICATION
Luminita State
Department of Computer Science, University of Pitesti, Pitesti, Romania
Catalina Cocianu
Department of Computer Science, Academy of Economic Studies, Bucharest, Romania
Panayiotis Vlamos
Department of Computer Science, Ionian University, Corfu, Greece
Viorica Stefanescu
Department of Mathematics, Academy of Economic Studies, Bucharest, Romania
Keywords: Hidden Markov Models, learning by examples, Bayesian classification, training algorithm, neural
computation.
Abstract: The research reported in the paper aims the development of a suitable neural architecture for implementing
the Bayesian procedure in solving pattern recognition problems. The proposed neural system is based on an
inhibitive competition installed among the hidden neurons of the computation layer. The local memories of
the hidden neurons are computed adaptively according to an estimation model of the parameters of the
Bayesian classifier. Also, the paper reports a series of qualitative attempts in analyzing the behavior of a
new learning procedure of the parameters an HMM by modeling different types of stochastic dependencies
on the space of states corresponding to the underlying finite automaton. The approach aims the development
of some new methods in processing image and speech signals in solving pattern recognition problems.
Basically, the attempts are stated in terms of weighting processes and deterministic/non deterministic
Bayesian procedures.
1 PRELIMINARIES
Stochastic models represent a very promising
approach to temporal pattern recognition. An
important class of the stochastic models is based on
Markovian state transition, two of the typical
examples being the Markov model (MM) and the
Hidden Markov Model (HMM). In a Markov model,
the transition between states is governed by the
transition probabilities, that is, the state sequence is
a Markov process and the observable state is then
directly observed as the output feature. However,
usually, there are two sorts of variable to be taken
into consideration, namely the manifest variables
which can be directly observed and latent variables
that are hidden to the observer. The HMM model is
based on a doubly stochastic process, one producing
an (unobservable) state and another producing an
observable feature sequence.
The doubly stochastic process is useful in coping
with unpredictable variation of the observed patterns
and its design requires a learning phase when the
parameters of both, the state transition and emission
distributions have to be estimated from the observed
data. The trained HMM can be then used for the
retrieving (recognition) phase when the test
sequence (complete or incomplete) observations
have to be recognized.
The latent structure of observable phenomenon is
modeled in terms of a finite automaton Q, the
observable variable being thought as the output
185
State L., Cocianu C., Vlamos P. and Stefanescu V. (2007).
A CONNECTIONIST APPROACH IN BAYESIAN CLASSIFICATION.
In Proceedings of the Ninth International Conference on Enterprise Information Systems - AIDSS, pages 185-190
DOI: 10.5220/0002346401850190
Copyright
c
SciTePress
produced by the states of Q. Both evolutions, in the
spaces of non observable as well as in the space of
observable variables, are assumed to be governed by
probabilistic laws.
In the sequel, we denote by
()
0n
n
Λ
the
stochastic process describing the hidden evolution
and by
()
0n
n
X
the stochastic process corresponding
to the observable evolution.
Let Q be the set of states of the underlying finite
automaton;
mQ = . We denote by
n
τ
the
probability distribution on Q at the moment n. Let
()
P,,
Ω
be a probability space,
()
σ
,C, be a
measure space, where
σ
is a
σ
-finite measure.
We assume that
*
CQ:
ρ
is a
σ
-experiment,
that is for any
Qq ,
(
)
σ
ρ
<<q , where C
*
is the
set of all probability measures defined on the
σ
algebra C. Let
()
.f
q
be a measurable version of the
Radon-Nycodim derivative
()
σ
ρ
d
qd
f
q
= . The output
of each state
Qq
is represented by the random
element
Ω
:X of density function
(
)
.f
q
. Let
ξ
be the apriori probability distribution on Q; for
any
Qq ,
()
q
is the subjective credibility that q
is the true emitting state at any moment. We assume
that
()
0q,Qq
. The conclusions on the
hidden evolution are derived using the Bayesian
procedure when the
apriori probability distribution
ξ
and the set of density functions
(
)
Qq,f
q,n
are
known.
If
),0[: ×QQL is a risk function, then, for
any
Qq,q
*
,
()
*
q,qL represents the cost implied
by taking the output emitted by
q
as being emitted
by
q
*
. The outputs of the automaton are represented
by the sequence of random elements
()
0n
n
X
, where
the output at the moment
n,
n
X is distributed
(
)
n
q
ρ
if it was emitted by the state
n
q .
A random decision procedure is an element of
[]
(
)
{
}
=
Q
1,0t/tR ,
where, for any
,x,Qq,Rt
()()
qxt is the
probability of deciding that the output
x is produced
by the state
q.
For any
Rt we denote the expected risk by,
() ()()() ()
(
)
∑∑
∈∈
=
QqQq
qq
dxxfxtqqLqftR
σξξ
,,,
The Bayesian decision procedure
Rt
~
assures
the minimum risk that
is,
(
)
(
) ()
f,f,t,Rinff,t
~
,R
Rt
ξΦΔξξ
= and it is given
by,
(
)
1
()
()
{}
()
()
{}
()
()
{}
()
=
>
<
=
x,qTminx,qT,
x,qTminx,qT,0
x,qTminx,qT,1
xt
~
*
q\Qq
q
*
q\Qq
*
q\Qq
q
*
*
*
α
, where
(
)
2
(
)
(
)
(
)()
=
Qq
q
xfq,qLqx,qT
ξ
, 1
Aq
q
=
α
,
0,Aq
q
α
and
()
{}
()
==
x,qTminx,qT,Qq/qA
*
q\Qq
*
.
The true evolution in the space
Q of non
observable variables is governed by probabilistic
laws,
(
)
0n
n
τ
, where
n
τ
represents the probability
distribution on
Q at the moment n.
Let
(
)
0n
n
u
be a sequence of subjective utilities
assigned to the states of the automaton;
),0[:,0
Qun
n
. We assume that, for any
(
)
0qu,1n
Qq
n
. For any 0n and Qq
,
(
)
qu
n
stands for the subjective utility assigned to
the state
q at the moment n. Typically,
()
qu
n
can be
taken as the relative emitting frequency of the state q
during the time interval
[
]
n,0 .
In case the HMM evolution is directly
observable of a certain time interval
[]
N,1 , that is a
sequence of N-realizations of both processes
(
)
0n
n
Λ
and
(
)
0n
n
X
are available to the experimenter, we
get a learning sequence of length
N which can be
used to estimate the hidden evolution on
Q as well
as to derive estimations for the conditional density
functions
(
)
Qq,f
q,n
. Let
()
1n
n
g
be a sequence of
measurable functions,
),0[: ×
n
g , 1n ,
such that the following regularity conditions hold,
(
)
(
)( )
1dyy,xg,1nA
n1
=
σ
,
σ
.
s
.a
(
)()
1y,xg0,y,x,1nA
n2
ICEIS 2007 - International Conference on Enterprise Information Systems
186
() ( )()
=
XxgEQqA
nq
n
,lim,
3
()()
(
)()
xfdyyfyxg
qqn
n
=σ
,lim ,
σ
.
s
.a
Our method is a supervised technique based on
the learning sequence
()()
1n/X,S
nn
=
Λ
, where
the true probability distribution
n
τ
is approximated
by a weighting process
()()
0n
n
Qq,q
defined by
()
() ()
() ()
=
Qq
n
n
n
quq
quq
q
ξ
ξ
ξ
representing the guess that q
is the emitting state at the moment
n. The decision
procedure
*
n
t
~
is defined by
()
1 in terms of
(
)
q
n
and
()
()
()( )
=
=
n
1j
jnj
n
q,n
X,xgq,
qn
1
xf
Λδ
ξ
, where
()
=
=
qq,0
qq,1
q,q
δ
. The criterion function
(
)
x,qT
given by
()
2 is replaced by
() ()()()
=
Qq
q,nn
xfq,qLqx,qT
ξ
.
2 QUALITATIVE ANALYSIS OF
THE LEARNING SCHEME
Let
() ( )()
f,t
~
,REt
~
R
*
n
*
n
ξ
ξ
= be the expected risk
corresponding to the random decision procedure
*
n
t
~
when
ξ
is the true probability distribution on Q and
(
)
Qq,ff
q,n
= is the set of output density
functions.
Theorem 1. Let
()
0n
n
g
be a sequence of
measurable functions such that the assumptions
A
1
,
A
2
, A
3
, A
4
hold, where,
()
4
A for any 1k ,
x,Qq
,
()()()
xfX,xgE
qkq
= .
If
()()
1n/X,S
nn
=
Λ
is a learning sequence
such that the random elements
()
1n,X,
nn
Λ
are
independent,
n
Λ
is distributed
ξ
and
n
X is
distributed
q
f if q
n
=
Λ
, then,
()
()
f,t
~
Rlim
*
n
n
ξΦ
ξ
=
.
Proof: The conclusion can be established using
straightforward computations and invoking the
strong law of large numbers and the dominated
convergence theorem.
Theorem 2. Let
()()
1n/X,S
nn
=
Λ
be a
learning sequence such that the random elements
(
)
1n,X,
nn
Λ
are independent,
n
Λ
is distributed
n
τ
and
n
X is distributed
q
f if q
n
=
Λ
. If for the
sequence
(
)
0n
n
g
, the assumptions A
1
, A
2
, A
3
, A
4
hold and, for any
Qq
,
() ()
qq
n
1
lim
n
1j
j
n
ττ
=
=
,
then,
(
)
(
)
()
f,f,t
~
,RElim
*
n
n
ξΦτ
=
.
Proof: The following series of equations can be
derived,
(
)
(
)
(
)
Φ fftRE
n
,,
~
,0
*
ττ
(
)
(
)()()
{
}
()
+στξ
Qq
qqnnq
dxxfqxfqEL
,
() () ()
()()()
dxxfxft
qq
n
qqL
qq
QqQq
n
j
j
στ
ττ
∑∑
∈∈ =
+
,,
1
,
1
Obviously, the second term converges to 0 when
n .
Also, using the strong law of large numbers, we
obtain,
()()
() ()
[]
0,,
1
lim
1
=Λ
=
n
j
qjjnj
n
xfqXxgq
n
τδ
a.s.-P
for any
Qqx
, .
Using the dominated convergence theorem, we
get
() () () ()
0
1
lim
1
,
=
=
xfq
n
xfqE
q
n
j
jqnn
n
τξ
a.s.-P
for any
Qqx
, which finally implies that, for
any
Qq
,
A CONNECTIONIST APPROACH IN BAYESIAN CLASSIFICATION
187
() () () () ()
0
1
lim
1
,
=
=
x
q
n
j
jqnn
n
dxxfq
n
xfqE
στξ
which implies the conclusion of Theorem 2.
Theorem 3. Assume that the conditions
mentioned in theorem 2 hold. If, for any
Qq
,
() ()
qqlim
n
n
τ
τ
=
, then,
()()
()
fftRE
nn
n
,,
~
,lim
*
ττ
Φ=
.
Proof:
Since
()()
()
()
(
)
[]
+Φ=Φ fftREfftRE
nnnnn
,,
~
,,,
~
,
**
ττττ
+
()()
[]
ff
n
,, τΦ
τΦ
we obtain,
()()
ΦΦ ff
n
,,
ττ
() () ()
(
)
+
QqQq
qqnqq
dxxftxtxfL
στ
,,
~
*
,
+
() ()
Qq
nq
qqL
ττ
.
Using the dominated convergence theorem, we
get
()
(
) ()()
0,,
~
lim
*
,
=
QqQq
qqnqq
n
dxxftxtxfL
στ
and, consequently,
()()
[]
0,,lim =Φ
Φ
ff
n
n
τ
τ
.
Using Theorem 2, the definition of procedures
()()
ftft
n
,,,
τ
τ
and dominated convergence
theorem, we get,
()
()
(
)
[]
0,,,,lim =Φ
fftRf
n
n
τττ
and
()()()()( )
∑∑
∈∈
QqQq
qnq
n
dxxtxfqqqLE
στ
*
,
~
,lim
() () ()()
0,, =
dxxftxfq
nqq
σττ
Since
()()()
()
()
()
()
⎧⎫
⎡⎤
⎪⎪
⎢⎥
∑∑
⎨⎬
⎢⎥
⎪⎪
⎣⎦
⎩⎭
ELq,qτ q-τ qfxtτ ,f,x σ dx =0
nqqn
qÎQqÎQ
À
and
( ) () () () ()
()
()( )
+
∑∑
∈∈
QqQq
qnqnnqn
dxxtxfqxfqqqLE
σξτ
*
,,
~
,
() () () ()
(
)
()( ) () ()
*
,,
qf x qf x t x dx L q q
nnq q nq qn
qQ
ξτ σ ττ
−⎥
we finally get,
( ) () () () ()
(
)
()( )
*
lim ,
,,
ELqqqfxqfxtxdx
nq nnq nq
n
qQqQ
τξ σ
−+
∑∑
→∞
∈∈
() () () ()
(
)
()( )
*
0
,,
qf x qf x t x dx
nnq q nq
ξτ σ
⎥=
,
which implies
(
)
()
[
]
0f,f,t
~
,RElim
*
nn
n
=
τΦτ
.
Let us assume that
is a denumerable set,
(
)
=
x,1x
σ
. Obviously, taking
()
0n
n
g
such
that for any
0n and for any
y,x ,
(
)
(
)
y,xy,xg
n
δ
=
, the conditions A
1
, A
2
, A
3
hold.
Since for any
Qq
, 1k ,
(
)
(
)
(
)() ()
xfyfy,xgX,xgE
q
y
qkkq
==
, we get that
A
4
also holds.
Theorem 4. Let
()()
1n/X,S
nn
=
Λ
be a
learning sequence such that
()
1n,
n
Λ
is a Markov
chain of stationary transition probabilities having an
unique recurrent class Q’. If
()
1n,X
n
are
independent and
n
X is distributed
q
f if q
n
=
Λ
,
then
(
)
(
)
()
fftRE
n
n
,,
~
,lim
*
ττ
Φ=
,
where
τ
is the probability distributions of
1
Λ
.
Proof: For Qq
,
x , we define,
(
)
(
)
(
)
(
)( )
xXqxf ,,,3
δ
δ
Λ
=
Λ
.
Obviously, f is
()
11
, XΛ
-measurable. Since
(
)
{
}
=Λ
11
, XfE
(
)
(
)
(
) () () ()
<==
∈∈
xfqqxfxxqq
q
xQq
q
ττδδ
,, ,
we obtain,
(
)
(
)
{
}()()
xfqXfE
q
τ
=
Λ
11
,4 .
ICEIS 2007 - International Conference on Enterprise Information Systems
188
Also, the series
()
()
()
×Qxq
xqxq
r
,
,,
**
converge
uniformly in
()
**
,xq
. We get that, for f defined by
()
3 , the conditions of Theorem 1 hold.
Using
()
4 , Theorem 1 and the dominated
convergence theorem, we get
() () () ()
[
]
()
0lim
,
=
dxxfqxfqE
qqnn
n
στξ
,
which implies
() () () ()
[
]
()
0lim
,
=
Qq
qqnnq
n
dxxfqxfqEL
στξ
.
Finally, since
(
)
[
]
()
() () () ()
[]
()
Φ
Qq
qqnnq
n
n
dxxfqxfqEL
fftRE
στξ
ττ
,
*
lim
,,
~
,0
we get
()()
()
fftRE
n
n
,,
~
,lim
*
ττ
Φ=
.
3 NEURAL IMPLEMENTATION
We assume that
d
R= . Then the neural
architecture consists of the layers
HX
F,F of d and
respectively
Q neurons. The neurons of the input
layer
X
F have no local memory, they distribute the
corresponding inputs toward the neurons of the
hidden layer
H
F . Each neuron of
H
F is assigned to
one of the pattern classes from Q. For simplicity
sake, we’ll refere to each neuron of
H
F by its
corresponding pattern class.
The local memory of each neuron
H
Fq consists of
()
q
n
and the parameters needed
to compute
q,n
f . The activation function of the
neuron
H
Fq at the moment n is
() () ()
qfh
nqnqn
ξ= xx
,,
. The layer
H
F is fully
connected, the connection from q to
q
is weighted
by
()()
q,qL . Consequently, the input
()
d
xx ,...,
1
=x applied to
X
F induces the neural
activations,
()
(
) ()()
()
H
Qq
qnn
FqqT
fqqLqqnet
=
=ξ=
,,
,0,
,
x
x
.
The recognition task corresponds to the
identification of the states
q for which
()
x,qT is
minimum. This task is solved by installing a discrete
time competitive process among the neurons of
H
F .
Let
(
)
(
)
(
)
t,qnetftS
q
=
be the output of the neuron
H
Fq
at the moment t, where the competition
process starts at the moment 0 and the activation
function f is given by
()
<
=
0u,u
0u,0
uf
. We denote
by
(
)
(
)
(
)
Hq
Fq,tStS
=
the state at the moment t.
The initial state is
(
)()()
(
)
H
Fq,0,qnetf0S
=
.
The synaptic weights of the connections during
the competition are,
=
=
qq,
qq,1
w
q,q
ε
,
where
0>
ε
is a vigilance parameter.
The update of the state is performed
synchronously, that is, for any
H
Fq ,
(
)
(
) ()
()() ()
εε+=
=ε=+
H
Fq
qq
qq
qq
tStS
tStStqnet
1
1,
(
)()
(
)
1t,qnetf1tS
q
+
=
+
.
The conclusions concerning the behavior of the
competition in the space of states stem from the
following arguments. Note that
()
0tS
q
, for any
0t and
H
Fq
.
1. If
(
)
0tS
q
=
, then
()
01th
q
+ , hence
(
)
01tS
q
=
+
. Moreover, for any
()
0'tS,t't
q
=
.
2.
Assume that for some
H
Fq , 0t ,
(
)
0tS
q
<
. If
(
)
01t,qnet
<
+
then
(
)
(
)()()
tStStS1tS0
q
q'q
'qqq
=
+
ε
and
(
)
(
)
tS1tS
qq
=
+
if and only if
()
0tS
'q
= for all
q'q
.
3.
Assume that for some
H
F'q,q , 0t ,
(
)
(
)
0tStS
q'q
<
=
. Then
(
)
(
)() ()
+
=
+
H
Fq
qq
tStS11t,qnet
ε
ε
,
(
)
(
)() ()
+
=
+
H
Fq
q'q
tStS11t,'qnet
ε
ε
,
A CONNECTIONIST APPROACH IN BAYESIAN CLASSIFICATION
189
that is
() ()
01tS1tS
q'q
+=+ and, for any t't ,
() ()
0'tS'tS
q'q
=
.
4.
Assume that for some
H
F'q,q , 0t ,
() ()
0tStS
q'q
<< . Then,
()()() ()
+=+
H
Fq
qq
tStS11t,qnet
ε
ε
,
()()()
(
)
+=+
H
Fq
q'q
tStS11t,'qnet
ε
ε
,
that is
()( )
1t,'qnet1t,qnet +>+ .
Obviously, if
()
01t,'qnet + then
() ()
01tS1tS
q'q
=+=+ . Also, if
() ( )
1t,'qnet01t,qnet
+
>+ then
() ( ) ()
1tS01t,'qnet1tS
q'q
+=
+
=+ ,
so we get
() ()
1tS1tS
q'q
+
+ .
Finally, if
()
01t,qnet
<
+ then
()()() ()
()() () ()
1,'1
11,
'
+=εε+>
>εε+=+
tqnettStS
tStStqnet
H
H
Fq
qq
Fq
qq
,
that is
() ()
1tS1tS
q'q
+
<+ .
Consequently, if
() ()
tStS
q'q
< , then
() ()
1tS1tS
q'q
++ . Moreover, for any
,t't
() ()
'tS'tS
q'q
.
From
() ()
(
)()
(
)
(
)
tStS11tS1tS
'qq'qq
+
=
++
ε
we get
() () ( ) () ()
()
0S0S1tStS
'qq
t
'qq
+=
ε
,
that is if both components of the state vector were
different from 0, for any
0t , then
() ()
(
)
=
tStSlim
'qq
t
, hence
(
)
(
)
=
tSlim
'q
t
,
which obviously contradicts the conclusion
established by 2.
We arrived at the conclusion that there exists
()
0'qt such that
()
0tS
'q
=
for any
()
'qtt .
5.
Assume that for
H
F'q,q
,
() ( )
xx ,',0 qTqT << ,
(
)()
00S0S
q'q
<< .
Using the previously obtained conclusions, we
get that, for any
0t ,
() ()
0tStS
q'q
and there
exists
()
0'qt such that
()
0tS
'q
= for any
()
'qtt . Therefore, the competition installed by the
above mentioned process among the neurons of
H
F
determines that the outputs of all neurons
q’ that
received values
(
)()
xx ,min,' qTqT
H
Fq
> are inhibited
in a finite number of stages, that is there exists
fin
t
such that
(
)
0tS
fin'q
if and only if
(
)
(
)
xx ,min,' qTqT
H
Fq
=
.
Moreover, using the remark 3, we get that, for
any
H
F"q,'q
such that
(
)
(
)()
xxx ,min,",' qTqTqT
H
Fq
=
=
,
(
)
(
)
0tStS
fin"qfin'q
=
and for any
0t ,
(
)()
tStS
"q'q
=
.
The local memories of the hidden neurons are
determined in a supervised way by adaptive learning
algorithms using a learning sequence
(
)
(
)
1n/X,S
nn
=
Λ
. The recurrent relations for
(
)
Hnq,n
Fq,1n,q,f
ξ
are derived in terms of the
particular relationships of
()
(
)()()
yx,,
nn
gqu .
4 CONCLUSIONS
The supervised estimation techniques of the
Bayesian decision procedure in pattern recognition
presented in the paper were tested against data in
automated speech recognition.
REFERENCES
Bishop, C., 1996. Neural Networks for Pattern
Recognition, Oxford University Press
Devroye, L., Gyorfi, L., Lugosi, G., 1996. A Probabilistic
Theory of Pattern Recognition, Springer Verlag
Husmeier, D., 2000 Learning Non-stationary Conditional
Probability Distributions. In Neural Networks, Vol. 13
Lampinen, J., Vehtari,2001. A. Bayesian Approach for
Neural Networks: Review and Case Studies. In Neural
Networks, Vol. 14
State, L.. C. Cocianu, 2001. Information Based
Algorithms in Signal Processing. In Proceedings of
SYNASC’2001, Timisoara, 3-5 octombrie.
ICEIS 2007 - International Conference on Enterprise Information Systems
190