INFERENCE OF GENE REGULATORY NETWORKS BY

EXTENDED KALMAN FILTERING USING GENE EXPRESSION

TIME SERIES DATA

Ramouna Fouladi

1

, Emad Fatemizadeh

1

and S. Shahriar Arab

2

1

Department of Electrical Engineering, Sharif University of Technology, Azadi ave., Tehran, Iran

2

Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran

Keywords: Gene expression, Extended Kalman filtering, Gene regulatory network modelling.

Abstract: In this paper, the Extended Kalman filtering (EKF) approach has been used to infer gene regulatory

networks using time-series gene expression data. Gene expression values are considered stochastic

processes and the gene regulatory network, a dynamical nonlinear stochastic model. Using these values and

a modified Kalman filtering approach, the model’s parameters and consequently the interactions amongst

genes are predicted. In this paper, each gene-gene interaction is modeled using a linear term, a nonlinear

one, and a constant term. The linear and nonlinear term coefficients are included in the state vector together

with the gene expressions’ true values. Through the extended Kalman filtering process, these coefficients

are updated in such a way that the predicted gene expressions follow the ones observed. Finally,

connections between each two genes are inferred based on these coefficients.

1 INTRODUCTION

Gene expression is a process in which gene products

are synthesized using inherent information in genes.

Regarding different expression levels of different

genes in a cell, proteins present in the cell will vary

both in amount and the kinds. Thus, the cell can be

in different states, e.g. growth or death. Different

genes’ products can affect the rate of expression of a

specific gene in a direct or indirect way. Gene

Regulatory Networks map these interactions in the

form of a network. One of the important challenges

is the development of efficient algorithms to infer

these underlying connections using gene expression

time series data without performing complicated

time-consuming laboratory experiments.

One of the methods for modelling gene

expression data is dynamic modelling of gene

regulatory networks. Some of these models are

Boolean network models (Chen and Aihara, 1999);

(D'haeseleer et al., 1999); (Holter et al., 2001),

(Wang et al., 2008a), Bayesian model (Ghahramani,

1998); (Liu et al., 2006); (Murphy and Mian, 1999),

state space models ; (Rangel et al., 2004); (Wu et al.,

2004)) and stochastic model (Cook et al., 1998);

(Tian and Burrage, 2003); (Wang et al., 2008b).

Several factors should be considered in

proposing methods for modelling gene regulatory

networks. First of all, it is widely accepted that gene

expression is a stochastic process, so the model

defining the interactions should be able to handle the

stochastic nature of the network. Nonlinearity of the

interactions is another issue which should be taken

into account. In addition, gene regulatory networks

are usually a function of a large number of variables

but the available time series data only consists of a

small number of observations. Another issue is the

inherent noise in gene expression data due to the

nature of the process in which DNA microarray

experiments are performed. A comprehensive model

is the one which handles all these issues. Still, most

available methods in the literature have not

considered all. The use of extended Kalman filtering

seems to be a proper solution.

In this paper, gene expression values are

considered as stochastic processes .each gene-gene

interaction is modelled using a linear term, a

nonlinear one, and a constant term. The linear and

nonlinear term coefficients are included in the state

vector together with the gene expressions’ values.

Through the extended Kalman filtering process,

these coefficients are updated in such a way that the

150

Fouladi R., Fatemizadeh E. and Shahriar Arab S..

INFERENCE OF GENE REGULATORY NETWORKS BY EXTENDED KALMAN FILTERING USING GENE EXPRESSION TIME SERIES DATA.

DOI: 10.5220/0003754801500155

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2012), pages 150-155

ISBN: 978-989-8425-90-4

Copyright

c

2012 SCITEPRESS (Science and Technology Publications, Lda.)

predicted gene expressions follow the ones

observed. Finally, connections between each two

genes are inferred based on these predicted

coefficients. Four real-world gene expression data

sets are used to demonstrate the effectiveness of the

proposed algorithm.

The paper is organized as follows: section 2

describes the parameter estimation using EKF. Our

proposed method is discussed in section 3. The

experimental evaluation and discussions are given in

section 4, followed by conclusion and future works

in the final section.

2 PARAMETER ESTIMATION

USING EKF

In general, the nonlinear system dynamics and a

measurement are described by (Wang et al., 2009):

( 1) ((), ) ()

x

kfxk k

(1)

() ((), ) ()zk gxk vk

(2)

()k

and

()vk

are the process and measurement

noises which are assumed to be drawn from zero

mean multivariate normal distributions with

covariances.

k

Q

and

k

R

respectively. These two

noises are two independent white noises.

is the

vector of the unknown parameters and is included in

the state vector

() [(), ()]

T

X

kxkk

(3)

In order to use the Kalman or Extended Kalman

filters, some assumptions should be made.

Alongside the properties said for process and

measurement noises, we should assume Gaussian

probability distributions for the state variables. The

resulting dynamic equations are

(1) (()) ()

X

kFXkwk

(4)

() ( ()) ()zk GX k vk

(5)

Where

() [ (),0]

TT

wk k

(6)

( ()) [ ((), ()), ()]

TTT

F

Xk f xk k k

(7)

( ( )) ( ( ), ( ))GXk gxk k

(8)

Equations (4) and (5) serve as the state transition and

observation models respectively. Through a two

phase estimation process, the state vector is updated

in each step regarding the observations available.

A gene regulatory network containing

n

genes is

described by the following discrete-time nonlinear

stochastic dynamical system (Chen and Aihara,

1999), Where

ij

a

identifies the linear regulatory

relationship between genes

i

and

j

and

ij

b

identifies

the nonlinear relationship between genes

i

and

j

,

11

(1) () ((),)

( )

1, 2,..., 0,1, 2,..., 1

nn

iijjijjjj

jj

oi i

xk ax k bf x k

Ik

ink m

(9)

1

(, )

1

j

jj j

x

fx

e

(10)

function

f

is a sigmoid function and is easily

differentiable. When a detailed description is

lacking, a sigmoid function is often used.

3 THE PROPOSED METHOD

In our algorithm, the model (9) is written for each

pair of genes in the network. So equation (9) for

genes 1 and 2 turns into:

11111221111

12 2 2 01 1

(1) () () (())

( ( )) ( )

x

kaxkaxkbfxk

bf xk I k

22112222111

22 2 2 02 2

(1) () () (())

( ( )) ( )

x

kaxkaxkbfxk

bfxk I k

(11)

Setting

11 2 2

( ( )) [ ( ( )), ( ( ))]

T

fxk fxk f x k

model (9)

can be written in vector form as follows:

0

(1) () (()) ()

x

k Axk Bf xk I k

(12)

() () ()

y

kxkvk

(13)

if

11 21 12 22

[, , , ]

T

Aaaaa

and

11 21 12 22

[, , , ]

T

Bbbbb

,

the vector of unknown parameters would be

0

[ ]

TTT

A

BI

. Regarding equation (11), the

expression value of gene 1 at time step

k

is a linear

and a nonlinear function of the expression value of

the same gene at time-step

1k

and a linear and

nonlinear function of the expression value of gene 2

at time-step

1k

. In this paper, the coefficients

11

a

,

22

a

,

11

b

,

22

b

are set to zero in each time-step so that

each gene is bound to construct its expression values

at each time step from the expression values of the

other gene at the previous step, not its own. After

running the algorithm, we would have 4 time-series,

12

a

,

21

a

,

12

b

,

21

b

. For deducing the effect of gene 2

on gene 1, we first added up the absolute values of

INFERENCE OF GENE REGULATORY NETWORKS BY EXTENDED KALMAN FILTERING USING GENE

EXPRESSION TIME SERIES DATA

151

12

a and

12

b and then took an average over all time-

steps. The calculated number is indicative of the

strength of the regulatory influence of gene 2 on

gene 1. We did the same for finding the effect of

gene 1 on gene 2. After performing this process on

all pairs of genes, we would have an

nn

matrix (n

is the number of genes in the network), let’s call it

M

.

(, )

M

ij

denotes the effect of gene

j

on gene

i

. The final interactions between genes are deduced

from the elements of the matrix

M

. By setting a

threshold, directed interactions would be inferred

based on these numbers. We should assert that in

each run of the algorithm (for each pair of genes),

the initial condition of the state vector and standard

deviation of the process and measurement noises are

kept constant so that the conditions are equal for all

cases.

The threshold is set in a way that at most

12%A

upper values of the elements of the matrix

are chosen. A is the percentile of true connections to

all possible connections. So, with an approximate

knowledge of the number of connections, nearly all

of them can be extracted by our method, See table 1.

Table 1: Threshold derivation based on percentile of real

interactions.

Data set name

percentile of

True

connections

percentile of

chosen

elements

Yeast Data Set 27.27% 25%

E-coli first Data Set 12.5% 18.75%

E-coli 2nd Data Set 51% 50%

IRMA (Switch on) 32% 44%

IRMA (Switch off) 32% 44%

4 RESULTS AND DISCUSSIONS

Our algorithm was evaluated and compared with

ARACNE (Margolin et al., 2006), TDARACNE

(Zoppoli et al., 2010), dynamical Bayesian Networks

implemented in the Banjo package (Yu et al., 2004)

and ODE implemented in the TSNI package (Bansal

et al., 2006), with gene expression data of yeast cell

cycle (Spellman et al., 1998), two SOS signalling

pathways in E. coli (Ronen et al., 2002); (Gardner,

2003) and an in vivo synthetic network, called

IRMA (Cantone et al., 2009).

The performance is measured in terms of

Positive Predictive Value (PPV), Recall and F-score.

PPV is the percentage of inferred connections which

are correct and Recall is the percentage of true

connections which are correctly inferred by the

algorithm. Suppose TP = number of True Positives,

FP = number of false positives and FN = number of

false negatives,

TP TP

PPV= Recall=

TP+FP TP+FN

The overall performance depends on both the PPV

and Recall. The F- score is the geometric mean of

PPV and Recall and is a good indicator of

performance:

2(PPV.Recall)

F-score

PPV+Recall

(14)

4.1 Yeast Data Set

Next, we selected an eleven gene network from

yeast S. Cerevisiae cell cycle.

Selected genes are Cln3, Cdc28, Mbp1, Swi4, Clb6,

Cdc6, Sic1, Swi6, Cln1, Cln2, and Clb5. Here the

cdc15 dataset was used as it has the maximum

number of gene expression measurements.After data

normalization and interpolation using cubic-spline

interpolation, the algorithm was run. The results

were evaluated using Pathway studio software and

summarized in Table 2.

Table 2: Comparison of our algorithm with previous methods. The displayed values are in percent.

Our method TD ARACNE TSNI BANJO

PPV Recall F-Score PPV Recall F-Score PPV Recall F-Score PPV Recall F-Score

Yeast Data set

41 38 39 41 22 29 29 19 23 43 28 34

E-coli SOS pathway (first data set)

33 50 40 85 75 80 13 25 17 18 38 24

IRMA network

Our Algorithm TD-ARACNE TSNI BANJO ARACNE

PPV Recall F-score PPV Recall F-score PPV Recall F-score PPV Recall F-score PPV Recall F-score

Switch-ON data

54 88 67 71 67 69 80 50 61 30 25 27 50 60 54

Switch-OFF data

55 75 63 37 60 46 60 38 46 60 38 46 25 33 28

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

152

Figure 1: Yeast cell cycle pathways in Pathway Studio and inferred by 4 algorithms; from top left to bottom right. Pathways

in Pathway Studio software, inferred graph by our algorithm, by TSNI, by TD-ARACNE, by Banjo. True connections are

shown with direct lines. The conections inferred with false direction are considered False positives and not displayed.

The network built by Pathway Studio Software and

the inferred network is displayed in Figure 1.The

color density of the lines define the number of

credible references acclaiming the connection. As

can be seen, our algorithm mostly recovers the most

confident interactions

4.2 e-Coli SOS Pathway (First Data

Set)

We also tested the proposed algorithm using eight

genes in E. coli SOS pathway. The SOS pathway is

activated in response to DNA damage in which the

cell cycle is arrested and DNA repair is induced. The

selected genes for this experiment are polB, uvrA,

lexA, uvrD, recA, uvrY, ruvA and umuDC. The

results are displayed in Table 2. The true network

and the inferred network are displayed in Figure 3.

4.3 e-Coli SOS Pathway (Second Data

Set)

We also tested the algorithm on nine other genes of

the SOS pathway in E-Coli. Selected genes are dinI,

rpoS, rpoD, umuDC, Ssb, recA, lexA, recF and

rpoH. We compared the network that we found with

the one that was identified in (Gardner, 2003) and

with a literature survey of the known interactions

among these nine genes (Fig. 2). Apart from self

feedbacks, the network has 43 connections. TSNI

algorithm could find 20 connections correctly

(Bansal et al., 2006) while NIR found 22

connections correctly out of 43 known connections.

We could predict 30 of the connections correctly

with PPV=69.7%, recall=71% and F-score=70%

Figure 2: Gene regulatory interactions between nine genes

of SOS network in E-Coli (second dataset) known in

literature. Positive effects are shown as line, and negative

ones as dashed lines.

4.4 IRMA Network

In (Cantone et al., 2009), a synthetic network was

built in the yeast Saccharomyces cerevisiae.

In this study, they tested the transcription of

network genes when culturing cells in galactose or

glucose. There are two sets of gene profiles, Switch

ON and Switch OFF. The first one corresponds to

shifting of the growing cells from glucose to

galactose and the second one corresponds to the

reverse shift. The inferred graph and true network

are displayed in Figure 4. The results are also

displayed in Table 2.

INFERENCE OF GENE REGULATORY NETWORKS BY EXTENDED KALMAN FILTERING USING GENE

EXPRESSION TIME SERIES DATA

153

Figure 3: e-Coli SOS true pathway and inferred by 4 algorithms; from top left to bottom right, Original pathways, inferred

graph by our algorithm, by Banjo by TD-ARACNE, by TSNI. True positives are shown by direct lines and false positives

by dashed lines. Missing verse on the connection means that the algorithm recovers the wrong verse.

Figure 4: Left: Yeast synthetic network, right: network by

our algorithm (switch OFF dataset). Missing verse on the

connection means that the algorithm recovers wrong verse

As can be seen, in case of Yeast, we have a

considerable increase in Recall, which means the

proposed method can infer more of the true

interactions. It should be noted that this increase in

Recall not only hasn’t caused the PPV to decrease

but also has led to a larger F-Score. In case of the

first E-Coli dataset, PPV, Recall and F-Score have

increased considerably comparing the results of

TSNI and Banjo. In case of IRMA switch ON data,

PPV, Recall and F-score is greater than those of

ARACNE, Banjo and TSNI. Although the proposed

method has an F-Score almost equal to that of TD

ARACNE, the Recall value is greater. In case of

IRMA Switch OFF dataset, F-Score has an increase

around 20% compared to the best result by other

methods.

5 CONCLUSIONS

An algorithm was developed in this paper using

extended Kalman filtering. Results were good for

medium networks, but as said, the interactions are

deduced based on a two by two process. The

algorithm should be extended so that inference of

much larger networks is possible without much

computational cost. Using a clustering method prior

to running the algorithm and performing the

algorithm in each cluster separately seems a good

solution. In addition, in the expression profiles, only

the mRNA concentration is measured, while with

taking into account other biological data, better

results can be gained.

REFERENCES

Bansal, M., Della Gatta, G. and DI Bernardo, D., 2006.

Inference of Gene Regulatory Networks and

Compound Mode of Action from Time Course Gene

Expression Profiles. Bioinformatics, vol. 22, no. 7,

815-822.

Cantone, L., Marucci, L., Lorio, F., Ricci, M., Belcastro,

V., Bansal, M., Santini, S., DI Bernardo, M., DI

Bernardo, D. and Cosma, M., 2009. A Yeast Synthetic

Network for In Vivo Assesment of Reverse-

Engineering and Modeling Approaches. Cell, 137,

172-181.

Chen, T. and Aihara, K. Year. Modeling Gene Expression

with Differential Equations. In: proc. pacific symp.

Biocomputing, 1999. 29-40.

Cook, D. L., Gerber, A. N. and Tapscott, S. J. Year.

Modeling Stochastic Gene Expression: Implications

for Haploinsufficiency. In: Proc. Nat'l Academy of

Science, USA, 1998. 15641-15646.

D'haeseleer, P., Wen, X., Fuhrman, S. and SOMOGYI, R.

Year. "Linear Modeling of mRNA Expression Levels

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

154

during CNS Development and Injury,". In: Proc.

Pacific Symp. Biocomputing, 1999. 41-52.

Gardner, T. 2003. Inferring genetic networks and

identifying compound mode of action via expression

profiling. Science, vol. 301, 102-105.

Ghahramani, Z. 1998. Learning Dynamic Bayesian

Networks. Adaptive Processing of Sequences and

Data Structures, Springer-Verlag, 168-197.

Holter, N. S., Maritanm, A., Cieplak, M., Fedoroff, N. V.

and Banavar, J. R. 2001. "Dynamic Modeling of Gene

Expression Data,". Proc. Nat'l Academy of Science.

USA.

Liu, T., Sung, W. and Mittal, A. 2006. Model Gene

Network by Semi-Fixed Bayesian Network. Expert

Systems with Applications, vol. 30, no.1, 42-49.

Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C.,

Stolovitzky, G., Dalla Favera, R. and Califano, A.

2006. ARACNE: an algorithm for the reconstruction

of gene regulatory networks in a mammalian cellular

context. BMC Bioinformatics, 7 Suppl 1, S7.

Murphy, K. and Mian, S. 1999. Modeling Gene

Expression Data Using Dynamic Bayesian Networks.

technical report, Univ. of California.

Rangel, C., Angus, J., Ghahramani, Z., Lioumi, M.,

Sotheran, E. A., Gaiba, A., Wild, D. L. and Falciani,

F. 2004. Modeling T-Cell Activation Using Gene

Expression Profiling and State Space Models.

Bioinformatics, vol. 20, no. 9, 1361-1372.

Ronen, M., Rosenberg, R., Shraiman, B. and Alon, U.

Year. Assigning Numbers to the Arrows:

Parameterizing a Gene Regulation Network by Using

Accurate Expression Kinetics. In: Proc Nat'l Academy

Science, USA, 2002. 10555-10560.

Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders,

K., Eisen, M., Brown, P., Botsein, D. and B, F. 1998.

Comprehensive Identification of Cell Cycleregulated

Genes of the Yeast Saccharomyces cerevisiae by

Microarray Hybridization. Molecular Biology of the

Cell, vol. 9, no. 12, 3273-3297.

Tian, T. and Burrage, K. Year. Stochastic Neural Network

Models for Gene Regulatory Networks. In: Proc. 2003

IEEE Congress Evolutionary Computation, 2003. 162-

169.

Wang, Z., Gao, H., Cao, J. and Liu, X., 2008a. “On

Delayed Genetic Regulatory Networks with Polytopic

Uncertainties: Robust Stability Analysis,”. IEEE

Trans. NanoBioscience, vol. 7, no. 2, 154-163.

Wang, Z., Liu, X., Liang, J. and Vinciotti, V., 2009. An

Extended Kalman Filtering Approach to Modeling

Nonlinear Dynamic Gene Regulatory Networks via

Short Gene Expression Time Series. IEEE/ACM

Transactions on Computational Biology and

BioinformaticS, vol. 6, no. 3, 410-419.

Wang, Z., Yang, F., Ho, D. W. C., Swift, S., Tucker, A.

and Liu, X. 2008b. Stochastic Dynamic Modeling of

Short Gene Expression Time Series Data. IEEE Trans.

NanoBioscience, vol. 7, no. 1, 44-55.

Wu, F., Zhang, W. and Kusalik, A. J. Year. Modeling

Gene Expression from Microarray Expression Data

with State-Space Equations. In: Proc. Pacific Symp.

Biocomputing, 2004. 581-592.

Yu, J., Smith, V., Wang, P. and Hartemink, A. 2004.

Advances to Bayesian Network Inference for

Generating Causal Networks from Observational

Biological Data. Bioinformatics, vol. 20, no. 18, 3594-

3603.

Zoppoli, P., Morganella, S. and Ceccarelli, M. 2010.

TimeDelay-ARACNE: Reverse engineering of gene

networks from time-course data by an information

theoretic approach. BMC Bioinformatics, 11, 154.

INFERENCE OF GENE REGULATORY NETWORKS BY EXTENDED KALMAN FILTERING USING GENE

EXPRESSION TIME SERIES DATA

155