On Equivalence between Linear-chain Conditional Random Fields and
Hidden Markov Chains
Elie Azeraf
1,2 a
, Emmanuel Monfrini
2 b
and Wojciech Pieczynski
2 c
1
Watson Department, IBM GBS, avenue de l’Europe, Bois-Colombes, France
2
SAMOVAR, CNRS, Telecom SudParis, Institut Polytechnique de Paris, Evry, France
Keywords:
Linear-chain CRF, Hidden Markov Chain, Bayesian Segmentation, Natural Language Processing.
Abstract:
Practitioners successfully use hidden Markov chains (HMCs) in different problems for about sixty years.
HMCs belong to the family of generative models and they are often compared to discriminative models, like
conditional random fields (CRFs). Authors usually consider CRFs as quite different from HMCs, and CRFs
are often presented as interesting alternatives to HMCs. In some areas, like natural language processing (NLP),
discriminative models have completely supplanted generative models. However, some recent results show that
both families of models are not so different, and both of them can lead to identical processing power. In this
paper, we compare the simple linear-chain CRFs to the basic HMCs. We show that HMCs are identical to
CRFs in that for each CRF we explicitly construct an HMC having the same posterior distribution. Therefore,
HMCs and linear-chain CRFs are not different but just differently parametrized models.
1 INTRODUCTION
Let Z
1:N
= (Z
1
, ..., Z
N
) be a stochastic sequence, with
Z
n
= (X
n
,Y
n
). X
1
, ..., X
N
take their values in a finite set
, while Y
1
, ...,Y
N
take their values in a finite set Λ.
Realizations of X
1:N
= (X
1
, ..., X
N
) are hidden while
realizations of Y
1:N
= (Y
1
, ...,Y
N
) are observed, and
the problem we deal with is to estimate X
1:N
= x
1:N
from Y
1:N
= y
1:N
.
The simplest model allowing dealing with the
problem is the well-known hidden Markov chain
(HMC). In spite of their simplicity, HMCs are very ro-
bust and provide quite satisfactory results in many ap-
plications. We only cite the pioneering papers (Baum
et al., 1970; Rabiner, 1989), and some books (Capp
´
e
et al., 2005; Koski, 2001), among great deal of publi-
cations. However, they can turn out to be too simple
in complex cases and thus authors extended them in
numerous directions. In particular, conditional ran-
dom fields (Lafferty et al., 2001; Sutton and McCal-
lum, 2006) are considered as interesting alternative
to HMCs, especially in Natural Language Process-
ing (NLP) area. They have also been used in dif-
ferent areas as diagnostic (Fang et al., 2018; Fang
a
https://orcid.org/0000-0003-3595-0826
b
https://orcid.org/0000-0002-7648-2515
c
https://orcid.org/0000-0002-1371-2627
et al., 2019), natural language processing (Jurafsky
and Martin, 2009; Jurafsky and Martin, 2021), en-
tity recognition (Song et al., 2019), or still relational
learning (Sutton and McCallum, 2006). In general,
authors consider CRFs as quite different from HMCs,
and often prefers the former to the latter. In this paper,
we show that CRFs and HMCs may be not so differ-
ent. More precisely, we show that basic linear- chain
CRFs are equivalent to HMCs.
Let us specify what “equivalence” in the paper’s
title means. One can notice that HMCs and CRFs
cannot be compared directly as they are of differ-
ent nature. Assuming that a “model” is a distribu-
tion p(x
1:N
, y
1:N
), we may say that HMC is a model,
while CRF is a family of models, in which all models
have the same p(x
1:N
|y
1:N
), but can have any p(y
1:N
).
We will say that a CRF p(x
1:N
|y
1:N
) is equivalent
to a HMC q(x
1:N
, y
1:N
) if and only if p(x
1:N
|y
1:N
) =
q(x
1:N
|y
1:N
). To show that linear-chain CRFs are
equivalent to HMCs it is thus sufficient to show that
for each linear-chain CRF p(x
1:N
|y
1:N
), it is possible
to find a HMC q(x
1:N
, y
1:N
) such that p(x
1:N
|y
1:N
) =
q(x
1:N
|y
1:N
). This precisely is the contribution of the
paper.
More generally, let us note that certain criticisms
of the HMCs, put forward to justify the preference
of the CRFs, currently appear to be not always en-
tirely justified. For example, in monitoring prob-
Azeraf, E., Monfrini, E. and Pieczynski, W.
On Equivalence between Linear-chain Conditional Random Fields and Hidden Markov Chains.
DOI: 10.5220/0010897400003116
In Proceedings of the 14th International Conference on Agents and Artificial Intelligence (ICAART 2022) - Volume 3, pages 725-728
ISBN: 978-989-758-547-0; ISSN: 2184-433X
Copyright
c
2022 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
725
lems, two independence conditions inherent to HMCs
were put forward to justify this preference. How-
ever, these conditions are sufficient conditions for
Bayesian processing, not necessary ones. Indeed,
it is possible removing them considering Pairwise
Markov Chains (PMCs), which extend HMCs and al-
low the same Bayesian processing (Pieczynski, 2003;
Gorynin et al., 2018). Another example is related
to NLP. HMCs are considered as generative mod-
els, and as such improper to NLP because of the fact
p(x
1:N
|y
1:N
) are difficult to handle to (Jurafsky and
Martin, 2021; Brants, 2000; McCallum et al., 2000).
However, as recently shown in (Azeraf et al., 2020a),
while defining Bayesian processing methods HMCs
can also be used in discriminative way, without call-
ing on p(y
1:N
|x
1:N
). The same is true in the case
of other generative models like Na
¨
ıve Bayes (Azeraf
et al., 2020b).
2 LINEAR-CHAIN CRF AND HMC
2.1 Bayesian Classifiers
In the Bayesian framework we consider, there is a
loss function L(x
1:N
, x
1:N
), where x
1:N
is the true value
and x
1:N
is the estimated one. Bayesian classifier
y
1:N
ˆx
1:N
= ˆs
L
B
(y
1:N
) is optimal in that it minimizes
the mean loss E [L( ˆs(Y ), X)]. It is defined with
ˆx
1:N
= ˆs
L
B
(y
1:N
)
= arg inf
x
1:N
E[L(x
1:N
, X
1:N
)|Y
1:N
= y
1:N
],
(1)
where E[L(x
1:N
, X
1:N
)|Y
1:N
] denotes the conditional
expectation. In this paper, we consider the Bayesian
classifier ˆs
L
B
corresponding to the loss function
L (x
1:N
, x
1:N
) = 1
[x
1
6=x
1
]
+ ... + 1
[x
N
6=x
N
]
, (2)
which simply means that the loss is the number of
wrongly classified data. Called ”maximum posterior
mode” (MPM), the related Bayesian classifier is de-
fined with
ˆx
1:N
= ( ˆx
1
, ..., ˆx
N
) = ˆs
L
B
(y
1:N
)
[n = 1, ..., N, p( ˆx
n
|
y
1:N
) = sup
x
n
(p(x
n
|y
1:N
))]
(3)
Let us remark that Bayesian classifiers ˆs
L
B
only depend
on p(x
1:N
|y
1:N
), and are independent from p(y
1:N
).
In other words, for any distribution q(y
1:N
), every
other law of (X
1:N
, Y
1:N
) of the form q(x
1:N
|y
1:N
) =
p(x
1:N
|y
1:N
)q(y
1:N
) gives the same Bayesian classifier
ˆs
L
B
. This shows that dividing classifiers into two cat-
egories “generative” and “discriminative” as usually
done is somewhat misleading as they all are discrim-
inative. Such a distinction is thus related to the way
classifiers are defined, not to their intrinsic structure.
2.2 Equivalence between Linear-chain
CRF and a Family of HMCs
We show in this section that for each linear-chain
CRF one can find an equivalent HMC, with param-
eters specified from the considered CRF.
The following general Lemma will be useful in the
sequence:
Lemma
Let W
1:N
= (W
1
, ...,W
N
) be random sequence, taking
its values in a finite set . Then
(i) W
1:N
is Markov chain iff there exist N 1 func-
tions ϕ
1
, . . . , ϕ
N1
from
2
to R
+
such that
p(w
1
, ..., w
N
) ϕ
1
(w
1
, w
2
)...ϕ
N1
(w
N1
, w
N
),
(4)
where “” means “proportional to”;
(ii) for HMC defined with ϕ
1
, ..., ϕ
N1
verifying (4),
p(w
1
) and p(w
n+1
|w
n
) are given with
p(w
1
) =
β
1
(w
1
)
w
1
β
1
(w
1
)
;
p(w
n+1
|w
n
) =
ϕ
n
(w
n
, w
n+1
)β
n+1
(w
n+1
)
β
n
(w
n
)
,
(5)
where β
1
(w
1
), ..., β
N
(w
N
) are defined with the fol-
lowing backward recursion:
β
N
(w
N
) = 1,
β
n
(w
n
) =
w
n+1
ϕ
n
(w
n
, w
n+1
)β
n+1
(w
n+1
)
(6)
For the proof see (Lanchantin et al., 2011),
Lemma 2.1, page 6.
We can state the following Proposition.
Proposition
Let Z
1:N
= (Z
1
, ..., Z
N
) be stochastic sequence, with
Z
n
= (X
n
,Y
n
). Each (X
n
,Y
n
) takes its values in ×
Λ, with and Λ finite. If Z
1:N
is a linear-chain
conditional random field (CRF) with the distribution
p(x
1:N
|y
1:N
) defined with
p(x
1:N
|y
1:N
) =
1
κ(y
1:N
)
exp
"
N1
n=1
V
n
(x
n
, x
n+1
) +
N
n=1
U
n
(x
n
, y
n
)
#
,
(7)
where U
n
and V
n
are arbitrary “potential functions”.
Then (7) is the posterior distribution of the HMC
q(x
1:N
, y
1:N
) =
q
1
(x
1
)q
1
(y
1
|x
1
)
N
n=2
q
n
(x
n
|x
n1
)q
n
(y
n
|x
n
),
(8)
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
726
defined as follows.
Let
ψ
n
(x
n
) =
y
n
exp(U (x
n
, y
n
)) (9)
ϕ
1
(x
1
, x
2
) = exp(V
1
(x
1
, x
2
))ψ
1
(x
1
)ψ
2
(x
2
); (10)
and, for n = 2, ..., N 1:
ϕ
n
(x
n
, x
n+1
) = exp(V
n
(x
n
, x
n+1
))ψ
n+1
(x
n+1
). (11)
Besides, let
β
N
(x
N
) = 1, and
β
n
(x
n
) =
x
n+1
ϕ
n
(x
n
, x
n+1
)β
n+1
(x
n+1
)
(12)
for n = N 1, ..., 2.
Then q(x
1:N
, y
1:N
) is given with
q(x
1
) =
β
1
(x
1
)
x
1
β
1
(x
1
)
; (13)
q(x
n+1
|x
n
) =
ϕ
n
(x
n
, x
n+1
)β
n+1
(x
n+1
)
β
n
(x
n
)
(14)
q(y
n
|x
n
) =
exp(U (x
n
, y
n
)
ψ
n
(x
n
)
. (15)
Proof
According to (9)-(15), the distribution (7) can be writ-
ten:
p(x
1:N
|y
1:N
) =
1
κ(y
1:N
)
N1
n=1
ϕ
n
(x
n
, x
n+1
)
N
n=1
q(y
n
|x
n
)
According to the Lemma,
N1
n=1
ϕ
n
(x
n
, x
n+1
) is a
Markov chain defined by (13) and (14), with β
n
(x
n
)
defined (12), which ends the proof.
2.3 HMCs in Natural Language
Processing
Let us notice that the relationship between linear-
chain CRFs and HMCs has been pointed out and dis-
cussed by some authors in the frame of natural lan-
guage processing (NLP). For example, in (Sutton and
McCallum, 2006) authors remark that in linear-chain
CRFs it is possible to compute the posterior margins
p(x
n
|y
1:N
) using the same forward-backward method
as in HMCs. However, they keep on saying that CRFs
are more general and better suited for applications in
NLP. In particular, they consider that CRFs are able to
model any kind of features while HMCs cannot. Sim-
ilarly, in (Jurafsky and Martin, 2021), paragraph 8.5,
authors recall that in general it’s hard for generative
models like HMCs to add arbitrary features directly
into the model in a clean way.
These arguments are no longer valid since the re-
sults presented in (Azeraf et al., 2020a). Indeed, ac-
cording to the results the inability to take into ac-
count certain features is not due to the structure of
HMCs, but is due to the way of calculating the a pos-
teriori laws. More precisely, replacing the classic
forward-backward computing by an “entropic” one
allows HMCs to take into account the same features
as discriminative linear-chain CRFs do. Similar kind
of results concerning Na
¨
ıve Bayes is specified in (Az-
eraf et al., 2021a).
Let us notice that HMCs defined with (8) are even
more general than linear-chain CRFs defined with
(7). Indeed, in the latter we have p(x
1:N
|y
1:N
) > 0,
while in the former q(x
1:N
|y
1:N
) 0. However, this
is not a very serious advantage as one could extend
(7) by removing the function exp and by considering
p(x
1:N
|y
1:N
) =
1
κ(y
1:N
)
N1
n=1
V
n
(x
n
, x
n+1
)
N
n=1
U
n
(x
n
, y
n
)
with all V
n
(x
n
, x
n+1
) and U
n
(x
n
, y
n
) positive or null.
3 CONCLUSION AND
PERSPECTIVES
We discussed relationships between simple linear-
chain CRFs and HMCs. We showed that for each
linear-chain CRF, which is a family of models, one
can find an HMCs giving the same posterior distribu-
tion. In addition, the related HMC’s parameters can
be computed from those of CRFs. In particular, joint
to results in (Azeraf et al., 2020a), this shows that
HMCs can be used in NLP with the same efficiency
as CRFs do.
Let us mention some perspectives for further
work. One recurrent argument in favour of CRFs with
respect to HMCs is related to some independence
properties assumed in HMCs and considered as bind-
ing. More precisely, in HMCs we have p(y
n
|x
1:N
) =
p(y
n
|x
n
) and p(x
n+1
|x
n
, y
n
) = p(x
n+1
|x
n
). These
constraints can be removed by extending HMCs to
pairwise Markov chains (PMCs) (Pieczynski, 2003;
Gorynin et al., 2018; Azeraf et al., 2020b). More
general that HMCs, PMCs allow strictly the same
Bayesian processing. Furthermore, PMCs can be ex-
tended to triplet Markov chains (TMCs) (Boudaren
et al., 2014; Gorynin et al., 2018), still allowing same
Bayesian processing.
Extending HMCs considered in this paper to
PMCs and TMCs should lead to extensions of recent
hidden neural Markov chain (Azeraf et al., 2021b),
which is a first perspective for further works. Of
On Equivalence between Linear-chain Conditional Random Fields and Hidden Markov Chains
727
course, there exist many CRFs much more sophisti-
cated that the linear-chain CRF considered in the pa-
per. Let us cite some recent papers (Siddiqi, 2021;
Song et al., 2019; Quattoni et al., 2007; Kumar et al.,
2003; Saa and C¸ etin, 2012), among others. Compar-
ing different sophisticated CRFs to different PMCs
and TMCs will undoubtedly be an interesting second
perspective.
REFERENCES
Azeraf, E., Monfrini, E., and Pieczynski, W. (2021a). Us-
ing the Naive Bayes as a Discriminative Model. In
Proceedings of the 13th International Conference on
Machine Learning and Computing, pages 106–110.
Azeraf, E., Monfrini, E., Vignon, E., and Pieczynski, W.
(2020a). Hidden Markov Chains, Entropic Forward-
Backward, and Part-Of-Speech Tagging. arXiv
preprint arXiv:2005.10629.
Azeraf, E., Monfrini, E., Vignon, E., and Pieczynski, W.
(2020b). Highly Fast Text Segmentation With Pair-
wise Markov Chains. In Proceedings of the 6th IEEE
Congress on Information Science and Technology,
pages 361–366.
Azeraf, E., Monfrini, E., Vignon, E., and Pieczynski, W.
(2021b). Introducing the Hidden Neural Markov
Chain Framework. In Proceedings of the 13th Inter-
national Conference on Agents and Artificial Intelli-
gence - Volume 2, pages 1013–1020.
Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A
Maximization Technique Occurring in the Statistical
Analysis of Probabilistic Functions of Markov Chains.
The Annals of Mathematical Statistics, 41(1):164–
171.
Boudaren, M. E. Y., Monfrini, E., Pieczynski, W., and Ais-
sani, A. (2014). Phasic Triplet Markov Chains. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 36(11):2310–2316.
Brants, T. (2000). TnT: a Statistical Part-of-Speech Tag-
ger. In Proceedings of the 6th Conference on Applied
Natural Language Processing, pages 224–231.
Capp
´
e, O., Moulines, E., and Ryden, T. (2005). Inference in
Hidden Markov Models. Springer Series in Statistics,
Berlin, Heidelberg.
Fang, M., Kodamana, H., and Huang, B. (2019). Real-Time
Mode Diagnosis for Processes with Multiple Operat-
ing Conditions using Switching Conditional Random
Fields. IEEE Transactions on Industrial Electronics,
67(6):5060–5070.
Fang, M., Kodamana, H., Huang, B., and Sammaknejad,
N. (2018). A Novel Approach to Process Operating
Mode Diagnosis using Conditional Random Fields in
the Presence of Missing Data. Computers & Chemical
Engineering, 111:149–163.
Gorynin, I., Gangloff, H., Monfrini, E., and Pieczynski, W.
(2018). Assessing the Segmentation Performance of
Pairwise and Triplet Markov Models. Signal Process-
ing, 145:183–192.
Jurafsky, D. and Martin, J. H. (2009). Speech and Lan-
guage Processing: An Introduction to Natural Lan-
guage Processing, Speech Recognition, and Compu-
tational Linguistics, 2nd Edition. Prentice-Hall.
Jurafsky, D. and Martin, J. H. (2021). Speech and Lan-
guage Processing, 3rd Edition Draft of December
2021. Copyright ©. All rights reserved.
Koski, T. (2001). Hidden Markov Models for Bioinformat-
ics. Springer Netherlands.
Kumar, S. et al. (2003). Discriminative Random Fields:
A Discriminative Framework for Contextual Interac-
tion in Classification. In Proceedings of the 9th IEEE
International Conference on Computer Vision, pages
1150–1157.
Lafferty, J. D., McCallum, A., and Pereira, F. C. (2001).
Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data. In Proceed-
ings of the 18th International Conference on Machine
Learning, pages 282–289.
Lanchantin, P., Lapuyade-Lahorgue, J., and Pieczynski, W.
(2011). Unsupervised Segmentation of Randomly
Switching Data Hidden with non-Gaussian Correlated
Noise. Signal Processing, 91(2):163–175.
McCallum, A., Freitag, D., and Pereira, F. C. N. (2000).
Maximum Entropy Markov Models for Information
Extraction and Segmentation. In Proceedings of the
17th International Conference on Machine Learning,
page 591–598.
Pieczynski, W. (2003). Pairwise Markov Chains. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 25(5):634–639.
Quattoni, A., Wang, S., Morency, L.-P., Collins, M., and
Darrell, T. (2007). Hidden Conditional Random
Fields. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 29(10):1848–1852.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Mod-
els and Selected Applications in Speech Recognition.
Proceedings of the IEEE, 77(2):257–286.
Saa, J. F. D. and C¸ etin, M. (2012). A Latent Discriminative
Model-Based Approach for Classification of Imagi-
nary Motor tasks from EEG Data. Journal of Neural
Engineering, 9(2).
Siddiqi, M. H. (2021). An Improved Gaussian Mixture
Hidden Conditional Random Fields Model for Audio-
Based Emotions Classification. Egyptian Informatics
Journal, 22(1):45–51.
Song, S., Zhang, N., and Huang, H. (2019). Named Entity
Recognition Based on Conditional Random Fields.
Cluster Computing, 22(3):5195–5206.
Sutton, C. and McCallum, A. (2006). An Introduction to
Conditional Random Fields for Relational Learning.
Introduction to Statistical Relational Learning, 2:93–
128.
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
728