Can Evolutionary Rate Matrices be Estimated from Allele Frequencies?

Conrad J. Burden

Mathematical Sciences Institute, Australian National University, Canberra, ACT 2601, Australia

Keywords:

Evolutionary Rate, Wright-Fisher Model, Fokker-Plank Equation.

Abstract:

This paper is a work in progress in which aims to combine the principles of population genetics and

continuous-time Markovian evolutionary models to estimate evolutionary rate matrices from the current ob-

served state of a single genome. A model is proposed in which sections of the genome which are not suscep-

tible to natural selection are considered to be a statistical ensemble of individual genomic sites. Each site is a

representative from a stationary distribution of allele frequencies 0 ≤ θ ≤ 1 within the population. Simulations

of this distribution via a ﬁnite-state Markov model based on a ﬁnite effective population size are compared

with the stationary solution to the continuum Fokker-Planck equation. Parameters of the evolutionary rate

matrix introduced via mutation rates within the Fokker-Planck equation are estimated for simulated data in a

number of exploratory examples.

1 INTRODUCTION

The rapidly reducing cost of high-throughput se-

quencing now allows for the acquisition of genome-

wide data proﬁling allele frequencies within pop-

ulations across large numbers of polymorphic

sites (Lynch, 2009). This paper is an exploratoryanal-

ysis of the feasibility of estimating evolutionary rate

matrices solely from the current observed state of al-

lele frequencies within a genome.

The evolutionary rate matrix at a given genomic

site is known to depend strongly on the site’s con-

text, that is, the nucleotide content of its ﬂanking

bases (Nevarez et al., 2010; Zhao and Boerwinkle,

2002). Herein we will assume the available data to

be sufﬁciently extensive that (1) individual mutation

rates can be ﬁtted independently for each context, and

(2) that we can restrict ourselves to sites expected to

evolve via a “neutral” theory in which the effects of

natural selection can be ignored. Within these con-

straints, the assumed data consists of a set of 4 num-

bers (θ

,θ

) at any given site in the genome

giving the relativeabundance 0 ≤ θ

≤ 1 of nucleotide

a ∈ {A,C, G, T} within the population at that site. For

the vast majority of sites this vector is observed to be

(immeasurably close to) one of (1,0,0,0), (0,1,0,0),

(0,0,1,0) or (0,0,0,1). Sites for which two or more

components are non-zero are referred to as single nu-

cleotide polymorphisms (SNPs), and for most SNPs

only two components are observed to be non-zero.

The nucleotides for which θ

is non-zero at a given

SNP are referrred to as alleles.

A similar approach to that set out here has also

been taken by Messer (Messer, 2009). However,

Messer’s approach differs in that he restricts the data

to alleles occurring with low frequency in the pop-

ulation, that is θ close to 0 or 1, in order to reduce

the effects of selection. In our approach we assume

the set of genomic sites can be reduced to those not

subject to selection and use the entire range of allele

frequencies.

2 THE MODEL

Our starting point is a discrete-time Markov model

which combines two fundamental ideas of population

genetics.

The ﬁrst of these is the discrete stochastic model

of genetic drift (see for instance (Ewens, 2004), Chap-

ter 3), deﬁned by a square, time-independent tran-

sition matrix p

deﬁned as follows: If, at a SNP

within the genome with two alleles A

and A

, Y(τ)

is the number of individuals in a diploid population

of size N with the allele A

at time-step (or genera-

tion) τ = 1, 2,..., then

Prob (Y(τ+1) = j|Y(τ) = i) = p

, i, j = 0,1, . . .,M,

where M = 2N. The canonical model generally con-

sidered is the Wright-Fisher model (Wright, 1931) for

183

J. Burden C..

Can Evolutionary Rate Matrices be Estimated from Allele Frequencies?.

DOI: 10.5220/0005253701830188

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 183-188

ISBN: 978-989-758-070-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

a monoecious population, for which









1−



(M− j)

This view of genetic drift readily generalises to

an alphabet L = {A,C,G,T} and a vector of random

variables Y(τ) = (Y

) with a probability

distribution

Prob (Y(τ) = i) = φ(i

;τ),

where

∑

a∈L

= M, and i

= 0,...,M. (1)

In general, for an alphabet of size d, the vector φ has



M+d−1

d−1



components. We interpret this distribution

as the relative frequency of genomic sites at which

alleles are present at the population frequency i/M.

Most of the components of φ will be very close to

zero in practice as SNPs with more than two alleles

are extremely rare within the genome. Furthermore

since the vast majority of genomic sites are not SNPs,

the distribution will be heavilydominated by the com-

ponents φ(M, 0, 0,0), φ(0,M, 0, 0), etc. The Wright-

Fisher model thus generalises to

= Prob(Y(τ+ 1) = j|Y(τ) = i) =



















The second fundamental idea is that genomic mu-

tations are modelled via a site-independent Markov

transition matrix

Π(t) = exp(tQ)

where the elements q

of Q, satisfying

∑

b∈L

= 0,

represent the instantaneous mutation rate from allele

a to allele b. Our ultimate aim is to estimate Q from

allele frequencies θ within the population at each of

the sites within the restricted set of genomic sites de-

scribed in the Introduction.

To make contact with the above model of genetic

drift, it is instructive to re-visit the Wright-Fisher

model. Assuming a two-step process in which inheri-

tance is followed by mutation at each generation, the

transition matrix of the Wright-Fisher model becomes

((Ewens, 2004), Chapter 3)





ψ(i)

(1− ψ(i))

M− j

, (2)

where

ψ(i) =

(1− u) +



1−



where u is the probability of mutation from allele A

to allele A

and v the probability of mutation from A

to A

in one generation. Here u and v are assumed to

be O(1/M). If we scale the continuous time accord-

ing to t = τ/M, this correspondsto a mutation Markov

matrix over one generation of

Π(1/M) =



1− u u

v 1− v



+ O(M

−2

where the ﬁrst row and column correspondto allele A

and the second row and column correspond to allele

By making the analogous substitutions in the gen-

eralised Wright-Fisher model, and deﬁning u

, a,b ∈

L , to be the rate of mutation from allele a to allele b

in one generation, one arrives at

= Prob(Y(τ+ 1) = j|Y(τ) = i) =





∏

a∈{A,C,G,T}

ψ(i,a)

, (3)

where

ψ(i,a) =

1−

∑

b6=a

∑

b6=a

The off-diagonal elements of the instantaneous evolu-

tionary rate matrix are

= Mu

The observed data, as described in the Introduction

are assumed to correspond to the stationary distribu-

tion of the matrix P

, thus allowing for an estimate of

the parameters q

3 TOY MODEL: 2-LETTER

ALPHABET

To explore the feasibility of estimating parameters of

the rate matrix from a data set, we begin with the

case of a 2-letter alphabet, described by the evolution

matrix P

deﬁned by Equation (2). In this case the

only parameters of the model are the off-diagonal el-

ements of the continuous-time evolutionary rate ma-

trix, q

= Mu and q

= Mv and (twice the) effective

population M. Figure 1 shows the stationary distribu-

tion

φ(i) = Prob(Y(τ = ∞) = i), (4)

obtained numerically for the parameter values stated

in the ﬁgure caption. The parameters q

and q

have

been chosen to some extent so the the distribution is

dominated by the end points i = 0 and M to mimic the

behaviour of real genomes in which the vast majority

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

184

0.0 0.2 0.4 0.6 0.8 1.0

0 20 40 60

θ = i M

Mφ(i), f(θ, ∞)

0.0 0.2 0.4 0.6 0.8 1.0

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

θ = i M

log

(Mφ(i)), log

f(θ, ∞)

Figure 1: Numerical stationary distribution φ(i) of the

Wright-Fisher model with two-way mutations for the case

M = 100, q

= 0.02 and q

= 0.005 plotted on a linear

(top) and logarithmic (bottom) scale. Also shown in red

is the stationary solution f (θ, ∞) of the continuum M → ∞

Fokker-Planck equation with the end points of the distribu-

tion approximated as explained in the text. The extra factor

of M in the vertical axis scale set to enable a comparison

with the continuum stationary distribution.

of sites are not SNPs. This point is discussed further

below.

The effective population size of 100 in this nu-

merical simulation is of course unrealistically low,

and we next consider the limit M → ∞. There is a

well established literature on building partial differ-

ential equations, known as Fokker-Planck or forward

Kolmogorov diffusion equations, to describe the time

evolution of the probability density of allele frequen-

cies θ for a given SNP. The continuum Fokker-Planck

equation for the probability density f(θ,t) is obtained

by setting θ = i/M, t = τ/M and taking the limit

M → ∞ of the discrete model to obtain (Ewens, 2004)

∂f

∂t

= −

∂

∂θ

(a(θ) f(θ,t)) +

∂

∂θ

(b(θ) f(θ,t)), (5)

where, for the current model,

a(θ) = −q

θ+ q

(1− θ),

b(θ) = θ(1− θ).

Setting the time derivative to zero to obtain

the stationary distribution and normalising so that

f(θ)dθ = 1 yields the well-known beta distribution

f(θ,∞) = B(2q

,2q

)θ

−1

(1− θ)

−1

. (6)

where

B(α,β) =

Γ(α+ β)

Γ(α)Γ(β)

This function is superimposed over the ﬁnite M solu-

tion over the range 1/M ≤ θ ≤ 1 − 1/M in Figure 1,

and clearly illustrates that the continuum limit is ap-

proached very rapidly. To obtain the end points plot-

ted at θ = 0 and 1 we used the approximations

Prob (Y(∞) = 0) ≈

1/M

f(θ,∞)dθ

≈

B(2q

,2q

)

−1

, (7)

and

Prob (Y(∞) = M) ≈

1−1/M

f(θ,∞)dθ

≈

B(2q

,2q

)

−1

. (8)

In summary, we see that the continuum limit M → ∞

is reached rapidly, the precise value of M is in some

sense irrelevant, and that for practical purposes its role

is to provide a lower bound 1/M on the frequency of

a rare allele before a genomic site is deemed to be a

SNP.

4 PARAMETER ESTIMATION

We consider next the problem of estimating the pa-

rameters of the rate matrix for the toy model described

in the previous section.

4.1 Entire Population Surveyed

We start with the following artiﬁcially contrived sit-

uation concerning a hypothetical life form whose

CanEvolutionaryRateMatricesbeEstimatedfromAlleleFrequencies?

185

Table 1: Mutation rates ˆq

and ˆq

estimated from syn-

thetic data generated from the two-letter alphabet model as-

suming the entire population of M chromosomes is geno-

typed at n

site

genomic sites.

M q

site

ˆq

100 0.02 0.005 10

0.0194 0.00462

0.0207 0.00519

0.0206 0.00504

100 0.05 0.005 10

0.0508 0.00485

0.0541 0.00488

0.0529 0.00509

200 0.02 0.005 10

0.0253 0.00607

0.0209 0.00517

0.0206 0.00509

genome is built from a two-letter alphabet: Suppose

we have a small monoecious, diploid population of

effective size M/2, and we are able to genotype the

entire population of M chromosomes at a complete

set of n

site

independent genomic sites, each assumed

chosen to be a site not susceptible to selective pres-

sures. The data from which we are to estimate the rate

parameters q

and q

consist of a set of observed al-

lele frequencies θ

,θ

,...,θ

site

each taking a value in

the set {0,1/M,2/M,...,1}. For the vast majority of

these sites we will observe θ = 0 or 1, however it is

important to retain these non-SNP sites in the data.

Table 1 gives maximum likelihood estimates ˆq

and ˆq

of mutation rates from synthetic data gener-

ated from the numerically determined stationary dis-

tribution of the Wright-ﬁsher model with mutation,

namely Equation (4). The log likelihood is calculated

from these data using the continuum limit stationary

distribution, Equations (6) to (8).

4.2 Population Sampled

More realistically one expects the effective population

to be large, and that the data will consist of a relatively

small read coverage at each site obtained by sequenc-

ing a sample of the population. We will assume a

uniform read coverage n

read

across n

site

independent

genomic sites, each assumed chosen to be a site not

susceptible to selective pressures.

In this case the data will consist of a set of num-

bers K

,...K

site

of type A

alleles, each taking a

value in the set {0,1, . . . , n

read

}, observed at the n

site

genomic sites. At any given site, conditional on the

population fraction θ of A

-type alleles at that site,

the observed number of A

alleles K will be a bino-

mial random variable:

K|θ ∼ bin(n

read

,θ),

where θ has the beta distribution Equation (6). Thus

Table 2: Mutation rates ˆq

and ˆq

estimated from syn-

thetic data generated from the two-letter alphabet model

with rate matrix parameters q

= 0.02, q

= 0.005 assum-

ing a sample of size n

read

is genotyped at n

site

genomic sites.

read

site

ˆq

20 10

0.0203 0.00579

0.0188 0.00474

0.0191 0.00484

50 10

0.0204 0.00510

0.0200 0.00480

0.0200 0.00508

Table 3: Same as Table 2, except using rate matrix parame-

ters q

= 0.002, q

= 0.0005.

read

site

ˆq

20 10

0.00218 0.000527

0.00156 0.000406

0.00187 0.000458

50 10

0.00138 0.000352

0.00160 0.000409

0.00192 0.000474

the unconditional distribution of K is beta-binomial

(see (Johnson et al., 1992), Chapter 6),

Prob (K = k) =



site



B(2q

+ k

,2q

+ n

site

− k)

B(2q

,2q

)

Tables 2 and 3 give maximum likelihood estimates

of ˆq

and ˆq

for synthetic data generated from

the above beta-binomial distribution for realistic read

coverages n

read

= 20 and 40, assuming the number of

independent genomic sites sampled is n

site

= 10

, 10

and 10

In both examples above, we see that reasonable

estimates of mutation rates are obtained with exper-

imentally feasible values for n

site

and n

read

, and that

the estimate generally improves slightly with the the

number of genomic sites observed.

5 FULL MODEL WITH

HASEGAWA-KISHINO-YANO

RATE MATRIX

Finally we demonstrate the form of the stationary so-

lution for the full model with a 4-letter alphabet for

the case of the Hasegawa-Kishino-Yano (HKY) rate

matrix (Hasegawa et al., 1985):

Q = α







... βπ

βπ

... βπ

βπ

... βπ

βπ

...







BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

186

with the diagonal elements set so that each row sums

to zero. This is a 5 parameter model, which, for sim-

plicity we will reduce to a 3-parameter model by as-

suming symmetry between the nucleotides A and T

and between the nucleotides C and G, that is, we set

= π

and π

= π

The stationary distribution of the matrix P

de-

ﬁned by Equation (3) for the HKY matrix with param-

eters α = 0.2, β = 0.5, π

= π

= 0.2 and π

= π

0.3 for an effective population size M = 30 is illus-

trated in Figure 2. Equation (1) implies that the argu-

ment of the stationary distribution φ(i

;∞)

lies on a simplex which, for a 4-letter alphabet, can

be represented as a tetrahedron. In Figure 2 the distri-

bution is represented by a small sphere at each set of

integer coordinates, the volume of each sphere being

proportional to the probability mass function at that

coordinate.

Figure 2: Stationary distribution of allele frequencies for the

HKY model with parameters α = 0.2, β = 0.5, π

= π

0.2 and π

= π

= 0.3. The corners labelled A, C, G and

T correspond to the coordinates i = (1,0,0,0), (0,1,0, 0),

(0,0, 1,0) and (0,0, 0,1) respectively, and the volume of the

sphere at each coordinate point is proportional to the prob-

ability mass function.

The distribution is dominated by the corners of the

tetrahedron, indicating that the majority of genomic

sites are not SNPs. Edges of the tetrahedron corre-

spond to 2-allele SNPs, the interiors of the four faces

correspond to 3-allele SNPs and the interior volume

of the tetrahedron corresponds to 4-allele SNPs. As

is observed in real genomes, 3- and 4-allele SNPs are

extremely rare. For illustrative purposes the parame-

ter α has been chosen larger than what one might ex-

pect in nature by a couple of orders of magnitude to

ensure the SNP probabilities are visible in the ﬁgure.

The stationary distribution along the edges for each

of the six possible 2-allele Figure 3 SNPs is plotted in

Figure 3.

In the above example, the small effective popu-

lation size M = 30 was chosen to enable a numeri-

cally tractable solution. For more realistic values of M

the size of the matrix P

grows asymptotically as M

However, this does not present a problem as is should

be feasible to develop a continuum Fokker-Planck

equation analogous to Equation (5) with a tactible so-

lution analogous to Equations (6) to (8) from which

to calculate a log-likelihood.

6 DISCUSSION AND

CONCLUSIONS

We have proposed an approach to estimating mutation

rates from observed allele frequencies across a popu-

lation at any set of independent genomic sites which

are believed not to be susceptible to the effects of nat-

ural selection. Our numerical estimates based on the

canonical textbook Wright-Fisher model of popula-

tion genetics suggest that reasonable estimates of mu-

tation rates can be obtained from as few as ∼ 10

such

sites.

The next step in this analysis is the technical prob-

lem of extending the continuum Fokker-Planck equa-

tion from the 2-letter genomic alphabet described in

Section 3 to an analogous equation deﬁned over the

higher dimensional simplex relevant to the 4-letter

genomic alphabet described in Section 5, and solv-

ing to ﬁnd the steady state solution. This should be

straightforward, at least for the Wright-Fisher model,

and will enable maximum likelihood estimates of evo-

lutionary rate matrices speciﬁed by parameterisations

such as the HKY model. Going beyond Wright-Fisher

to deal with species which are not diploid and mo-

noecious will presumably not present insurmountable

challenges provided the appropriate functions a(θ)

and b(θ) analogous to those occurring in Equation (5)

can be modelled.

An issue not addressed here at any level of rigour

is that of context-dependence. As mentioned in the

introduction, neutral mutation rates at a given site are

known to be subject to the nucleotide content of ﬂank-

ing bases. We have assumed that the set of indepen-

dent sites used to estimate rates are simply chosen to

share the same context. However, this ignores the fact

that the ﬂanking bases may themselves mutate, and

do so over timescales similar to the mutation rates we

seek to estimate. We end on a note of caution that

developing a non-local model which takes this into

account may prove to be a formidable mathematical

problem.

CanEvolutionaryRateMatricesbeEstimatedfromAlleleFrequencies?

187

0.0 0.2 0.4 0.6 0.8 1.0

0 1 2 3 4

A to C

i M

Mφ(i)

0.0 0.2 0.4 0.6 0.8 1.0

0 1 2 3 4

A to G

i M

Mφ(i)

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.5 1.0 1.5 2.0 2.5

A to T

i M

Mφ(i)

0.0 0.2 0.4 0.6 0.8 1.0

0 1 2 3 4

C to G

i M

Mφ(i)

0.0 0.2 0.4 0.6 0.8 1.0

0 1 2 3 4

C to T

i M

Mφ(i)

0.0 0.2 0.4 0.6 0.8 1.0

0 1 2 3 4

G to T

i M

Mφ(i)

Figure 3: Distribution along each of the edges of the tetrahedron in Figure 2 illustrating the stationary distribution of allele

frequencies in 2-allele SNPs within a population generated from the HKY model.

REFERENCES

Ewens, W. J. (2004). Mathematical population genetics,

volume 27 of Interdisciplinary Applied Mathematics.

Springer, New York, 2nd edition.

Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating of

the human-ape splitting by a molecular clock of mito-

chondrial dna. J Mol Evol, 22(2):160–74.

Johnson, N. L., Kotz, S., and Kemp, A. W. (1992). Uni-

variate Discrete Distributions. Wiley, New York, 2nd

edition.

Lynch, M. (2009). Estimation of allele frequencies from

high-coverage genome-sequencing projects. Genetics,

182(1):295–301.

Messer, P. W. (2009). Measuring the rates of spontaneous

mutation from deep and large-scale polymorphism

data. Genetics, 182(4):1219–32.

Nevarez, P. A., DeBoever, C. M., Freeland, B. J., Quitt,

M. A., and Bush, E. C. (2010). Context dependent

substitution biases vary within the human genome.

BMC Bioinformatics, 11:462.

Wright, S. (1931). Evolution in mendelian populations. Ge-

netics, 16:97–159.

Zhao, Z. and Boerwinkle, E. (2002). Neighboring-

nucleotide effects on single nucleotide polymor-

phisms: a study of 2.6 million polymorphisms across

the human genome. Genome Res, 12(11):1679–86.

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

188