A Statistic Criterion for Reducing Indeterminacy
in Linear Causal Modeling
Gianluca Bontempi
Machine Learning Group, Computer Science Department, Faculty of Sciences, Universit
´
e Libre de Bruxelles,
Brussels, Belgium
Keywords:
Graphical Models, Causal Inference, Feature Selection.
Abstract:
Inferring causal relationships from observational data is still an open challenge in machine learning. State-of-
the-art approaches often rely on constraint-based algorithms which detect v-structures in triplets of nodes in
order to orient arcs. These algorithms are destined to fail when confronted with completely connected triplets.
This paper proposes a criterion to deal with arc orientation also in presence of completely linearly connected
triplets. This criterion is then used in a Relevance-Causal (RC) algorithm, which combines the original causal
criterion with a relevance measure, to infer causal dependencies from observational data. A set of simulated
experiments on the inference of the causal structure of linear networks shows the effectiveness of the proposed
approach.
1 INTRODUCTION
One of the most difficult aspect of causal inference
from observational data is the indeterminacy of causal
structures, due to the existence of dependency struc-
tures implying different causal directions but which
are indistinguishable in terms of statistical likelihood
or fit indexes. For instance it is well known that the
detection of causal directionality requires strong as-
sumptions (e.g. nonlinearity, high dimensional obser-
vations) in a bivariate (i.e., single cause single effect)
context (Janzing et al., 2010; Janzing et al., 2011).
This is the reason why existing techniques address
triplet configurations to reconstruct the directional-
ity of the causal relationships. Well-known exam-
ples are the algorithms which infer causal structures
in Bayesian networks by searching for unshielded col-
liders (Spirtes et al., 2000), i.e. patterns where two
variables are both direct causes of a third one, with-
out being each a direct cause of the other. Under as-
sumptions of Causal Markov Condition and Faithful-
ness, this structure is statistically distinguishable and
so-called constraint based algorithms (notably the PC
and the SGS algorithms) rely on conditional indepen-
dence tests to orient at least partially a graph (Koller
and Friedman, 2009).
Other research works take advantage of condi-
tional independence and propose information theo-
retic methods for network inference and feature selec-
tion (Brown, 2009; Watkinson et al., 2009; Bontempi
and Meyer, 2010; Bontempi et al., 2011). In particu-
lar these works use the notion of feature interaction,
a three-way mutual information that differs from zero
when group of attributes are complementary, which
allows to prioritize causes with respect to irrelevant
and effect variables.
However trivariate settings may present strong
problems of indeterminacy, too. Think for instance
to a fully connected triplet made of two causes and
one common effect. In this case the lack of indepen-
dency makes ineffective the adoption of conditional
independency tests or interaction measures to infer
the direction of the arrows. As stressed in (Guyon
et al., 2007) when there are no independencies, the di-
rection of the arrows can be anything. Though a pos-
sible remedy to indeterminacy comes from the use of
additional instrumental variables (IV) (Bowden and
Turkington, 1984), this strategy is not always feasible
in real settings with lack of a priori knowledge.
This paper focuses on the definition of a data-
dependent measure able to reduce the statistical indis-
tinguishability of completely and linearly connected
triplets. In particular, we propose a modification
of the covariance formula of a structural equation
model (Bollen, 1989; Mulaik, 2009) which results in a
statistic taking opposite signs for different causal pat-
terns when the unexplained variations of the variables
are of the same magnitude. Though this assumption
159
Bontempi G. (2013).
A Statistic Criterion for Reducing Indeterminacy in Linear Causal Modeling.
In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 159-166
DOI: 10.5220/0004254301590166
Copyright
c
SciTePress
could appear as too limiting, our rationale is that as-
sumptions of comparable strength (e.g. the existence
of unshielded colliders) have been commonly used so
far in causal inference. We expect that this alterna-
tive approach could shed additional light on the issue
of causality in the perspective of extending it to more
general configurations.
For this reason the paper proposes also a Rele-
vance Causal (RC) inference algorithm which inte-
grates the proposed causal measure with a relevance
measure to prioritize direct causes for a given target
variable. In order to assess the effectiveness of the al-
gorithm with respect to state-of-the-art algorithms a
set of experiments aiming to infer linear and nonlin-
ear networks from observed data are carried out. The
experimental comparison with state-of-the-art tech-
niques shows that such approach is promising for re-
ducing indeterminacy in causal inference.
2 COVARIANCE OF A LINEARLY
CONNECTED TRIPLET
The use of directed acyclic graphs (DAG) to encode
causal dependencies and independencies is common
to the two most known formalisms for causal mod-
eling (Anderson and Vastage, 2004): Bayesian net-
works and structural equation models (SEM). These
formalisms can accommodate both nonlinear and lin-
ear causal relationships. Here we will restrict our at-
tention to the linear causal structure represented in
Figure 1 where the variables x
1
and x
2
are causes of
the random variable x
3
. Since a DAG can always be
translated into a set of recursive structural equations,
this linear dependency can be written as
x
1
= w
1
x
2
= b
1
x
1
+ w
2
x
3
= b
3
x
1
+ b
2
x
2
+ w
3
(1)
where it is assumed that each variable has mean 0,
the b
i
6= 0 are also known as structural coefficients
and the disturbances, supposed to be independent, are
designated by w. This set of equations can be put in
the matrix form
x = Ax + w (2)
where x = [x
1
,x
2
,x
3
]
T
,
A =
0 0 0
b
1
0 0
b
3
b
2
0
and w = [w
1
,w
2
,w
3
]
T
. The multivariate variance-
covariance matrix (Mulaik, 2009) has no zero entries
b
2
x
1
x
2
x
3
b
3
b
1
w
1
w
2
w
3
Figure 1: Collider pattern: completely connected triplet
where the variable x
3
is a common effect of x
1
and x
2
.
and is given by
Σ = (I A)
1
G((I A)
T
)
1
= (3)
=
"
σ
2
1
b
1
σ
2
1
b
1
σ
2
1
b
2
1
σ
2
1
+ σ
2
2
b
3
σ
2
1
+ b
1
b
2
σ
2
1
b
1
b
3
σ
2
1
+ b
2
(b
2
1
σ
2
1
+ σ
2
2
)
b
3
σ
2
1
+ b
1
b
2
σ
2
1
b
1
b
3
σ
2
1
+ b
2
(b
2
1
σ
2
1
+ σ
2
2
)
(b
2
1
σ
2
1
+ σ
2
2
)b
2
2
+ 2b
1
b
2
b
3
σ
2
1
+ b
2
3
σ
2
1
+ σ
2
3
#
(4)
where I is the identity matrix and
G =
σ
2
1
0 0
0 σ
2
2
0
0 0 σ
2
3
is the diagonal covariance matrix of the disturbances.
It is worthy noting here that the lack of zero en-
tries in the covariance matrix (as well as in the in-
verse) illustrate the lack of conditional or uncondi-
tional independencies in the data. Constraint-based
approaches (Spirtes et al., 2000) which rely on in-
dependence tests to retrieve the v-structure are con-
sequently useless in this context. In the following
section we will discuss whether SEM techniques can
tackle such case.
3 INDETERMINACY IN A
CONNECTED TRIPLET
Structural equation modeling techniques for causal
inference proceed by 1) making some assumptions
on the structure underlying the data, 2) perform the
related parameter estimation, usually based on maxi-
mum likelihood and 3) assessing by significance test-
ing the discrepancy between the sample covariance
matrix and the covariance matrix implied by the hy-
pothesis.
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
160
This section shows that, as known in littera-
ture (Stelzl, 1986; Hershberger, 2006; Mulaik, 2009),
conventional SEM is not able to reconstruct the right
directionality of the connections in a completely con-
nected triplet. Let us observe a set of data gener-
ated according to the dependency illustrated in Figure
1 and described algebrically by the set of structural
equations (1). Suppose we want to test two alternative
hypothesis, represented by the two directed graphs in
Figure 2a and Figure 2b, respectively. Note that the
hypothesis of Figure 2a is correct while the hypothe-
sis illustrated by Figure 2b inverses the directionality
of the link between x
2
and x
3
and consequently misses
the causal role of the variable x
2
. Let us consider
the following question: is it possible to discriminate
between the structures 1 and 2 by simply relying on
parameter estimation (in this case regression fitting)
according to the hypothesized dependencies? The an-
swer is unfortunately negative. Suppose we assess the
hypothesis 1, by performing the two linear fittings im-
plied by the hypothesis itself
(
x
2
=
ˆ
b
1
x
1
+ w
2
x
3
=
ˆ
b
3
x
1
+
ˆ
b
2
x
2
+ w
3
where (Graybill, 1976)
ˆ
b
1
= Σ
12
/Σ
11
= b
1
,
ˆ
b
3
ˆ
b
2
=
Σ
11
Σ
12
Σ
21
Σ
22
1
Σ
13
Σ
23
=
b
3
b
2
Since the above estimators are unbiased, if we com-
pute the triplet covariance matrix by plugging the
above estimates within the formula (3) we obtain
again the variance in (4).
Let us consider now the second hypothesis (Fig-
ure 2b) and perform the two least-squares fittings
(
x
2
=
ˆ
b
1
x
1
+
ˆ
b
2
x
3
+ w
2
x
3
=
ˆ
b
3
x
1
+ w
3
where the estimates are returned by
ˆ
b
3
= Σ
13
/Σ
11
= b
3
+ b
1
b
2
,
ˆ
b
1
ˆ
b
2
=
Σ
11
Σ
13
Σ
13
Σ
33
1
Σ
12
Σ
23
=
(b
1
σ
2
3
b
2
b
3
σ
2
2
)
b
2
2
σ
2
2
+σ
2
3
b
2
σ
2
2
b
2
2
σ
2
2
+σ
2
3
)
Standard results give also the variance of the residu-
als. For instance the variance of w
2
is returned by
ˆ
σ
2
= Σ
22
Σ
12
Σ
23
Σ
11
Σ
13
Σ
13
Σ
33
1
Σ
12
Σ
23
We remark that, though the estimation of the parame-
ters differs from the real structural coefficients, if we
b
2
x
1
x
2
x
3
b
3
b
1
b
2
x
1
x
2
x
3
b
3
b
1
Figure 2: a) Hypothesis 1. b) Hypothesis 2.
compute the complete covariance matrix by using (3)
where
ˆ
A =
0 0 0
ˆ
b
1
0
ˆ
b
2
ˆ
b
3
0 0
,
ˆ
G =
ˆ
σ
1
2
0 0
0
ˆ
σ
2
2
0
0 0
ˆ
σ
2
3
we obtain again the expression (4). In other terms fit-
ting different causal structures to the connected triplet
does not allow to distinguish between the configura-
tion in Figure 2a and the one in Figure 2b.
4 A CRITERION TO DETECT
CAUSAL ASYMMETRY
A main characteristic of a causal relationship is its
asymmetry. For this reason, if we wish to infer causal
directionality from observational data we need to de-
fine discriminant criteria able to distinguish causes
from effects. Let us suppose we want to discrimi-
nate between the causal patterns in Figure 1 where
both x
1
and x
2
are direct causes of x
3
and alternative
patterns like the ones in Figure 3a and Figure 3b. As
we have seen in the previous section, the conventional
SEM procedure does not allow to distinguish between
different patterns. What we propose here is an alter-
native criterion able to perform such distinction.
The computation of our criterion requires the fit-
ting of the two hypothetical structures in Figures 2a
and 2b to the data, as done in the previous section.
What is different is that, instead of computing the
term (3), we consider the term
S = (I A)
1
((I A)
T
)
1
. (5)
Let us see the impact of such modification in the de-
tection of causality by analyzing in the following sec-
tions three different causal patterns. In all cases we
will make the assumption that σ
1
= σ
2
= σ
3
= σ, i.e.
that the unexplained variations of the variables are of
comparable magnitude. Though we are aware that
this assumption is quite specific, some considerations
are worthy to be made. So far, most of the approaches
of causal inference from data have relied on similar,
if not stronger, assumptions like postulating the exis-
tence of unshielded colliders. At the same time the
AStatisticCriterionforReducingIndeterminacyinLinearCausalModeling
161
b
2
x
1
x
2
x
3
b
3
b
1
w
2
w
1
w
3
b
2
x
1
x
2
x
3
b
3
b
1
w
1
w
2
w
3
a) b)
Figure 3: a) Chain pattern, completely connected triplet
where the variable x
2
is the common effect of x
1
and x
3
. b)
Fork pattern: completely connected triplet where the vari-
ables x
2
and x
1
have the common cause x
3
.
following derivation is expected to shed a new light
on the issue of causality with the aim of applying it to
more general configurations.
4.1 Collider Pattern
Let us suppose that data are generated according to
the structure in Figure 1 where the node x
3
is a col-
lider.
If we fit the hypothesis 1 to the data and we com-
pute the term (5) we obtain
ˆ
S
1
= (I A
1
)
1
((I A
1
)
T
)
1
=
"
1 b
1
b
1
b
2
1
+ 1
(b
3
+ b
1
b
2
) b
1
(b
3
+ b
1
b2) + b
2
]
(b
3
+ b
1
b
2
)
b
1
(b
3
+ b
1
b
2
) + b
2
(b
3
+ b
1
b
2
)
2
+ 1 + b
2
2
#
If we fit the hypothesis 2 to data and we compute
the term (5) we obtain
ˆ
S
2
= (I A
2
)
1
((I A
2
)
T
)
1
=
"
1 b
1
b
1
b
2
2
(b
2
2
+1)
2
+ b
2
1
+ 1
b
3
+ b
1
b
2
b
2
(b
2
2
+1)
+ b
1
(b
3
+ b
1
b
2
)
b
3
+ b
1
b
2
b
2
(b
2
2
+1)
+ b
1
(b
3
+ b
1
b
2
)
(b
3
+ b
1
b
2
)
2
+ 1
#
Let us denote by S[i, j] the i jth element of a matrix
S. Since i b
i
6= 0, it follows that the quantity
C(x
1
,x
2
,x
3
) =
=
ˆ
S
1
[3,3]
ˆ
S
2
[3,3] +
ˆ
S
1
[2,2]
ˆ
S
2
[2,2]
=
=
b
2
4
b
2
2
+ 2
b
2
2
+ 1
2
(6)
is greater than zero for any sign of the structural co-
efficients. Interestingly enough, the sign is preserved
for σ
1
= σ
2
= σ
3
also when the direction of the link
between x
1
and x
2
is inverted (see (18).
4.2 Chain Pattern
We suppose here that the data have been generated by
the triplet in Figure 3, where x
3
is part of the chain
pattern x
1
x
3
x
2
. This configuration is repre-
sented by the matrix
A =
0 0 0
b
1
0 b
2
b
3
0 0
Let us proceed by computing the quantity C in (6)
for such generative model under the assumption σ
1
=
σ
2
= σ
3
. For the sake of space we will report here
only the components of the submatrices
ˆ
S
1
[2 : 3,2 : 3]
and
ˆ
S
2
[2 : 3,2 : 3]. If data have been generated accord-
ing to the structure in Figure 3 and we fit the hypoth-
esis 1 we obtain
ˆ
S
1
[2 : 3,2 : 3] =
"
(b
1
+ b
2
b
3
)
2
+ 1
b
2
(b
2
2
+1)
+ b
3
(b
1
+ b
2
b
3
)
b
2
(b
2
2
+1)
+ b
3
(b
1
+ b
2
b
3
)
b
2
2
(b
2
2
+1)
2
+ b
2
3
+ 1
#
(7)
If data have been generated according to the struc-
ture in Figure 3 and we fit the hypothesis 2 we obtain
ˆ
S
2
[2 : 3,2 : 3] =
"
(b
1
+ b
2
b
3
)
2
+ b
2
2
+ 1
b
2
+ b
3
(b
1
+ b
2
b
3
)
b
2
+ b
3
(b
1
+ b
2
b
3
)
b
2
3
+ 1
#
(8)
It follows that
C(x
1
,x
2
,x
3
) =
=
ˆ
S
1
[3,3]
ˆ
S
2
[3,3] +
ˆ
S
1
[2,2]
ˆ
S
2
[2,2]
=
=
(b
4
2
(b
2
2
+ 2))
(b
2
2
+ 1)
2
(9)
This term is less than zero whatever the sign of the
structural coefficients b
i
in Figure 3.
Note that we do not discuss here the configuration
with the edge pointing from x
2
to x
1
since this is a
cyclic one.
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
162
4.3 Fork Pattern
Suppose now that observations are generated by the
triplet in Figure 3b, corresponding to the matrix
A =
0 0 b
3
b
1
0 b
2
0 0 0
Like in the previous section we report the compo-
nents of the submatrices
ˆ
S
1
[2 : 3,2 : 3] and
ˆ
S
2
[2 : 3,2 :
3]:
ˆ
S
1
[2 : 3,2 : 3] =
=
"
(b
1
b
2
3
+b
2
b
3
+b
1
)
2
(b
2
3
+1)
2
+ 1
b
2
(b
2
2
+b
2
3
+1)
+
(b
3
(b
1
b
2
3
+b
2
b
3
+b
1
))
(b
2
3
+1)
2
b
2
(b
2
2
+b
2
3
+1)
+
(b
3
(b
1
b
2
3
+b
2
b
3
+b
1
))
(b
2
3
+1)
2
b
2
2
(b
2
2
+b
2
3
+1)
2
+
b
2
3
(b
2
3
+1)
2
+ 1
#
(10)
ˆ
S
2
[2 : 3,2 : 3] =
"
(b
1
+ b
2
b
3
)
2
+ b
2
2
+ 1
b
2
+ b
3
(b
1
+ b
2
b
3
)
b
2
+ b
3
(b
1
+ b
2
b
3
)
b
2
3
+ 1
#
(11)
It follows that
C(x
1
,x
2
,x
3
) =
=
ˆ
S
1
[3,3]
ˆ
S
2
[3,3] +
ˆ
S
1
[2,2]
ˆ
S
2
[2,2]
=
= b
2
2
1
(b
2
2
+ b
2
3
+ 1)
2
1
(12)
This term is less than zero whatever the sign of the
structural coefficients b
i
6= 0 in Figure 3b. In the sup-
plementary material we compute the value of C when
the direction of the link between x
1
and x
2
is reversed
(matrix (19) in the supplement). From (22) we obtain
that this value remains negative when (b
2
2
+ b
2
3
) > b
2
1
,
for instance when the absolute value of one of the co-
efficients associated to the edges leaving x
3
is big-
ger than |b
1
|. In plain words if the cause-effect rela-
tionship between x
3
and the other variables is strong
enough, the statistics C takes a negative value.
The equations (6), (9) and (12) show that the com-
putation of the quantity C on the basis of observa-
tional data only can help in discriminating between
the collider configuration in Figure 1 where the nodes
x
1
and x
2
are direct causes of x
3
(C > 0) and non col-
lider configurations (i.e. fork or chain) (C < 0) in Fig-
ure 3a and 3b.
In other terms, given a completely connected
triplet of variables, the quantity C(x
1
,x
2
,x
3
) returns
useful information about the causal role of x
1
and x
2
with respect to x
3
whatever is the strength or the di-
rection of the link between x
1
and x
2
.
5 A RELEVANCE CAUSAL
ALGORITHM TO INFER
DIRECTIONALITY
The properties of the quantity C encourage its use
in an algorithm to infer directionality from obser-
vational data. We propose then a RC (Relevance
Causal) algorithm for linear causal modeling inspired
to the mIMR causal filter selection algorithm (Bon-
tempi and Meyer, 2010). The mIMR algorithm is
characterized by two terms, a relevance term, assess-
ing the relevance of each input variable with respect
to a target variable and a causation term, aiming to
prioritize causal variables by minimizing the interac-
tion of triplets of variables. The causation term is de-
signed in order to reward variables which belong to
a collider pattern and penalize variables within a fork
pattern. Let us suppose that we want to identify the
set of causes of a target variable y among a set X of
inputs. The mIMR is a forward selection algorithm
which given a set X
S
of d already selected variables,
updates this set by adding the d + 1th variable which
satisfies
x
d+1
= arg max
x
k
XX
S
h
(1 λ)I(x
k
;y)
λ
d
x
i
X
S
I(x
i
;x
k
;y)
i
(13)
where I(x
k
;y) denotes the mutual information be-
tween x
k
and y, I(x
i
;x
k
;y) denotes the interaction
information and the coefficient λ [0,1] is used to
weight the mutual information and the interaction
term.
As discussed previously, this algorithm might suf-
fer of bad performance when common causes are di-
rectly connected since the interaction term I(x
i
;x
k
;y)
could take positive values for a v-structure x
i
y
x
k
. For that reason we propose to replace the inter-
action term (to be minimized) with the criterion C (to
be maximized) to infer causal dependency from ob-
served data also in presence of completed connected
triplets. The resulting algorithm is a reformulation of
the mIMR where the update formula is now
x
d+1
= arg max
x
k
X
X
S
h
(1 λ)R({X
S
,x
k
};y)+
λ
d
x
i
X
S
C(x
i
;x
k
;y)
i
(14)
AStatisticCriterionforReducingIndeterminacyinLinearCausalModeling
163
where λ [0,1] weights the R and the C contribu-
tion, the R term quantifies the relevance of the subset
{X
S
,x
k
} and the C term quantifies the causal role of
an input x
k
with respect to the set of selected variables
x
i
X
S
.
The proposed RC algorithm is then a forward se-
lection algorithm which sequentially adds variables
according to the update rule (14). Note that for λ = 0
the algorithm boils down to a conventional forward
selection wrapper which assesses the subsets accord-
ing to the measure R. The RC algorithm is initialized
by selecting the couple of variables {x
i
,x
j
} maximiz-
ing the quantity
(1 λ)R({x
i
,x
j
};y) +
λ
d
x
i
X
S
C(x
i
;x
j
;y)
In the implementation used in the experimental
section, we adopt a linear leave-one-out measure to
quantify the relevance of a subset, i.e. R(X,y) is set
equal to the negative of linear leave-one-out mean-
squared-error of the regression with input X and tar-
get y. Also in order to have comparable values for the
R and the C terms, at each step these quantities are
normalized over the interval [0,1] before performing
their weighted sum.
6 EXPERIMENTS
In this section we assess the efficacy of the RC algo-
rithm by performing a set of causal network inference
experiments. The aim of the experiment is to reverse
engineer both linear and nonlinear scale-free causal
networks, i.e. networks where the distribution of the
degree follows a power law, from a limited amount
of observational data. We consider a set of networks
with a large number n = 5000 of nodes and where
the degree α of the power law ranges between 2.1
and 3. The inference is done on the basis of a small
amount of N = 200 observations. The structural co-
efficients of the linear dependencies have an absolute
value distributed uniformly between 0.5 and 0.8, and
the measurement error follows a standard Normal dis-
tribution. Nonlinear networks are obtained by trans-
forming the linear dependencies between nodes with
a sigmoid function.
We compare the accuracy of several algorithms in
terms of the mean F-measure (the higher, the better)
averaged over 10 runs and over all the nodes with a
number of parents and children larger equal than two.
The F-measure, also known as balanced F-score, is
the weighted harmonic mean of precision and recall
and is conventionally used to provide a compact mea-
sure of the quality of a network inference algorithm.
We considered the following algorithms for compari-
son: the IAMB algorithm (Tsamardinos et al., 2003)
implemented by the Causal Explorer software (Alif-
eris et al., 2003) which estimates for a given variable
the set of variables belonging to its Markov blanket,
the mIMR (Bontempi and Meyer, 2010) algorithm,
the mRMR (Peng et al., 2005) algorithm and three
versions of the RC algorithm with three different val-
ues λ = 0, 0.5, 1. Note that the RC algorithm with
λ = 0 boils down to a conventional wrapper algo-
rithm based on the leave-one-out assessment of the
variables’ subsets.
We also remark that the RC algorithms aims to
return for a given node a prioritization of the other
nodes according to their causal role while the Causal
Explorer implementation of IAMB returns a specific
subset (for a given pvalue). For the sake of compari-
son, we decided to compute the F-measure by setting
the number of putative causes to the number of vari-
ables returned by IAMB.
Tables 1 and 2 report the average F-measures for
different values of α in the linear and nonlinear case,
respectively.
The results show the potential of the criterion C
and of the RC algorithm in network inference tasks
where dependencies between parents are frequent be-
cause of direct links or common ancestors. According
to the F-measures reported in the Tables the RC accu-
racy with λ = 0.5 and λ = 1 is coherently better than
the ones of mIMR, mRMR and IAMB algorithms for
all the considered degrees distributions. However the
most striking result is the clear improvement with re-
spect to a conventional wrapper approach which tar-
gets only prediction accuracy (λ = 0 ) when a causal
criterion C is taken into account together with a pre-
dictive one (λ = 0.5). These results confirm previous
results (Bontempi and Meyer, 2010; Bontempi et al.,
2011) putting into evidence that an effective causal
inference task should combine a relevance criterion
targeting prediction accuracy with a causal term able
to prioritize direct cause and penalize effects.
7 CONCLUSIONS
Causal inference from complex large dimensional
data is taking a growing importance in machine learn-
ing and knowledge discovery. Currently, most of the
existing algorithms are limited by the fact that the dis-
covery of causal directionality is submitted to the de-
tection of a limited set of distinguishable patterns, like
unshielded colliders. However the scarcity of data and
the intricacy of dependencies in networks could make
the detection of such patterns so rare that the resulting
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
164
Table 1: Linear case: F-measure (averaged over all nodes with a number of parents and children 2 and over 10 runs) of the
accuracy of the inferred networks on the basis of N = 100 observations.
α IAMB mIMR mRMR RC
0
RC
0.5
RC
1
2.2 0.375 0.324 0.319 0.386 0.421 0.375
2.3 0.378 0.337 0.333 0.387 0.437 0.401
2.4 0.376 0.342 0.342 0.385 0.441 0.414
2.5 0.348 0.322 0.313 0.358 0.422 0.413
2.6 0.347 0.318 0.311 0.355 0.432 0.414
2.7 0.344 0.321 0.311 0.352 0.424 0.423
2.8 0.324 0.304 0.293 0.334 0.424 0.422
2.9 0.342 0.333 0.321 0.353 0.448 0.459
3.0 0.321 0.319 0.297 0.326 0.426 0.448
Table 2: Nonlinear case: F-measure (averaged over all nodes with a number of parents and children 2 and over 10 runs) of
the accuracy of the inferred network on the basis of N = 100 observations.
α IAMB mIMR mRMR RC
0
RC
0.5
RC
1
2.2 0.312 0.310 0.304 0.314 0.356 0.324
2.3 0.317 0.328 0.316 0.320 0.375 0.349
2.4 0.304 0.317 0.304 0.306 0.366 0.351
2.5
0.321 0.327 0.328 0.325 0.379 0.359
2.6 0.306 0.325 0.306 0.309 0.379 0.365
2.7 0.313 0.319 0.303 0.316 0.380 0.359
2.8 0.297 0.326 0.300 0.300 0.392 0.382
2.9 0.310 0.329 0.313 0.313 0.389 0.377
3.0 0.299 0.324 0.300 0.303 0.399 0.392
precision would be unacceptable. This paper shows
that it is possible to identify new statistical measures
helping in reducing indistinguishability under the as-
sumption of equal variances of the unexplained vari-
ations of the three variables. Though this assumption
could be questioned, we deem that it is important to
define new statistics to help discriminating between
causal structures for completely connected triplets in
linear causal modeling. Future work will focus on as-
sessing whether such statistic is useful in reducing in-
determinacy also when the assumption of equal vari-
ance is not satisfied.
REFERENCES
Aliferis, C., Tsamardinos, I., and Statnikov, A. (2003).
Causal explorer: A probabilistic network learning
toolkit for biomedical discovery. In The 2003 Inter-
national Conference on Mathematics and Engineer-
ing Techniques in Medicine and Biological Sciences
(METMBS ’03).
Anderson, R. and Vastage, G. (2004). Causal modeling
alternatives in operations research: overview and ap-
plication. Eurpean Journal of Operational Research,
156:92–109.
Bollen, K. (1989). Structural equations with latent vari-
ables. John Wiley and Sons.
Bontempi, G., Haibe-Kains, B., Desmedt, C., Sotiriou, C.,
and Quackenbush, J. (2011). Multiple-input multiple-
output causal strategies for gene selection. BMC
bioinformatics, 12(1):458.
Bontempi, G. and Meyer, P. (2010). Causal filter selection
in microarray data. In Proceeding of the ICML2010
conference.
Bowden, R. and Turkington, D. (1984). Instrumental Vari-
ables. Cambridge University Press.
Brown, G. (2009). A new perspective for information theo-
retic feature selection. In Proceedings of the 12th In-
ternational Conference on Artificial Intelligence and
Statistics (AISTATS).
Graybill, F. (1976). Theory and Application of the Linear
Model. Duxbury Press.
Guyon, I., Aliferis, C., and Elisseeff, A. (2007). Compu-
tational Methods of Feature Selection, chapter Causal
Feature Selection, pages 63–86. Chapman and Hall.
Hershberger, S. (2006). Structural equation modeling: a
second course, chapter The problems of equivalent
structural models, pages 13–41. Springer.
Janzing, D., Hoyer, P. O., and Scholkopf, B. (2010). Telling
cause from effect based on high-dimensional observa-
tions. In Proceeding of the ICML2010 conference.
Janzing, D., Sgouritsa, E., Stegle, O., Peters, J., and
Scholkopf, B. (2011). Detecting low-complexity un-
observed causes. In Conference on Uncertainty in Ar-
tificial Intelligence (UAI2011).
Koller, D. and Friedman, N. (2009). Probabilistic graphical
models. The MIT Press.
AStatisticCriterionforReducingIndeterminacyinLinearCausalModeling
165
Mulaik, S. (2009). Linear Causal Modelling with Structural
Equations. CRC Press.
Peng, H., Long, F., and Ding, C. (2005). Fea-
ture selection based on mutual information: Cri-
teria of max-dependency,max-relevance, and min-
redundancy. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(8):1226–1238.
Spirtes, P., Glymour, C., and Scheines, R. (2000). Causa-
tion, Prediction and Search. Springer Verlag, Berlin.
Stelzl, I. (1986). Changing a causal hypothesis without
changing the fit: Some rules for generating equiva-
lent path models. Multivariate Behavioral Research,
21:309?331.
Tsamardinos, I., Aliferis, C., and Statnikov, A. (2003). Al-
gorithms for large scale markov blanket discovery. In
Proceedings of the 16th International FLAIRS Con-
ference (FLAIRS 2003).
Watkinson, J., Liang, K., Wang, X., Zheng, T., and Anas-
tassiou, D. (2009). Inference of regulatory gene in-
teractions from expression data using three-way mu-
tual information. Annals of N.Y. Academy of Sciences,
1158:302–313.
APPENDIX
Let
A =
0 b
1
0
0 0 0
b
3
b
2
0
(15)
be the matrix associated to the collider pattern with
a edge heading from x
2
to x
1
. We compute the quan-
tity C for such generative model under the assumption
σ
1
= σ
2
= σ
3
.
If data have been generated according to the struc-
ture (15) and we fit the hypothesis 1 we obtain
ˆ
S
1
[2 : 3,2 : 3] =
"
b
2
1
(b
2
1
+1)
2
+ 1
b
2
+
(b
1
(b
3
b
2
1
+b
2
b
1
+b
3
))
(b
2
1
+1)
2
b
2
+
(b
1
(b
3
b
2
1
+b
2
b
1
+b
3
))
(b
2
1
+1)
2
(b
3
b
2
1
+b
2
b
1
+b
3
)
2
(b
2
1
+1)
2
+ b
2
2
+ 1
#
(16)
If we fit the hypothesis 2 we obtain
ˆ
S
2
[2 : 3,2 : 3] =
"
b
2
2
(b
2
1
+b
2
2
+1)
2
+
b
2
1
(b
2
1
+1)
2
+ 1
b
2
(b
2
1
+b
2
2
+1)
+
(b
1
(b
3
b
2
1
+b
2
b
1
+b
3
))
(b
2
1
+1)
2
b
2
(b
2
1
+b
2
2
+1)
+
(b
1
(b
3
b
2
1
+b
2
b
1
+b
3
))
(b
2
1
+1)
2
(b
3
b
2
1
+b
2
b
1
+b
3
)
2
(b
2
1
+1)
2
+ 1
#
(17)
It follows that
C(x
1
,x
2
,x
3
) =
=
ˆ
S
1
[3,3]
ˆ
S
2
[3,3] +
ˆ
S
1
[2,2]
ˆ
S
2
[2,2]
=
= b
2
2
b
2
2
(b
2
1
+ b
2
2
+ 1)
2
> 0 (18)
In other words the sign is positive also in case of a
link from x
2
to x
1
.
Let us consider now the fork pattern described by
the matrix
A =
0 b
1
b
3
0 0 b
2
0 0 0
(19)
If data have been generated according to the struc-
ture (19) and we fit the hypothesis 1 we obtain
ˆ
S
1
[2 : 3,2 : 3] =
"
(b
1
b
2
2
+b
3
b
2
+b
1
)
2
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
+ 1
(b
2
b
1
b
3
)
(b
2
2
+b
2
3
+1)
+
((b
3
+b
1
b
2
)(b
1
b
2
2
+b
3
b
2
+b
1
))
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
(b
2
b
1
b
3
)
(b
2
2
+b
2
3
+1)
+
((b
3
+b
1
b
2
)(b
1
b
2
2
+b
3
b
2
+b
1
))
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
(b
3
+b
1
b
2
)
2
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
+
(b
2
b
1
b
3
)
2
(b
2
2
+b
2
3
+1)
2
+ 1
#
(20)
If we fit the hypothesis 2 we obtain
ˆ
S
2
[2 : 3,2 : 3] =
"
(b
2
b
1
b
3
)
2
(b
2
1
+1)
2
+
(b
1
b
2
2
+b
3
b
2
+b
1
)
2
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
+ 1
(b
2
b
1
b
3
)
(b
2
1
+1)
+
((b
3
+b
1
b
2
)(b
1
b
2
2
+b
3
b
2
+b
1
))
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
(b
2
b
1
b
3
)
(b
2
1
+1)
+
((b
3
+b
1
b
2
)(b
1
b
2
2
+b
3
b
2
+b
1
))
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
(b
3
+b
1
b
2
)
2
(b
2
1
b
2
2
+b
2
1
+2b
1
b
2
b
3
+b
2
3
+1)
2
+ 1
#
(21)
It follows that
C(x
1
,x
2
,x
3
) =
=
ˆ
S
1
[3,3]
ˆ
S
2
[3,3] +
ˆ
S
1
[2,2]
ˆ
S
2
[2,2]
=
=
(b
2
b
1
b
3
)
2
(b
2
2
+ b
2
3
+ 1)
2
(b
2
b
1
b
3
)
2
(b
2
1
+ 1)
2
=
= (b
2
b
1
b
3
)
2
1
(b
2
2
+ b
2
3
+ 1)
2
1
(b
2
1
+ 1)
2
(22)
Note that this quantity is negative when (b
2
2
+ b
2
3
) >
b
2
1
.
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
166