Can Evolutionary Rate Matrices be Estimated from Allele Frequencies?
Conrad J. Burden
Mathematical Sciences Institute, Australian National University, Canberra, ACT 2601, Australia
Keywords:
Evolutionary Rate, Wright-Fisher Model, Fokker-Plank Equation.
Abstract:
This paper is a work in progress in which aims to combine the principles of population genetics and
continuous-time Markovian evolutionary models to estimate evolutionary rate matrices from the current ob-
served state of a single genome. A model is proposed in which sections of the genome which are not suscep-
tible to natural selection are considered to be a statistical ensemble of individual genomic sites. Each site is a
representative from a stationary distribution of allele frequencies 0 θ 1 within the population. Simulations
of this distribution via a finite-state Markov model based on a finite effective population size are compared
with the stationary solution to the continuum Fokker-Planck equation. Parameters of the evolutionary rate
matrix introduced via mutation rates within the Fokker-Planck equation are estimated for simulated data in a
number of exploratory examples.
1 INTRODUCTION
The rapidly reducing cost of high-throughput se-
quencing now allows for the acquisition of genome-
wide data profiling allele frequencies within pop-
ulations across large numbers of polymorphic
sites (Lynch, 2009). This paper is an exploratoryanal-
ysis of the feasibility of estimating evolutionary rate
matrices solely from the current observed state of al-
lele frequencies within a genome.
The evolutionary rate matrix at a given genomic
site is known to depend strongly on the site’s con-
text, that is, the nucleotide content of its flanking
bases (Nevarez et al., 2010; Zhao and Boerwinkle,
2002). Herein we will assume the available data to
be sufficiently extensive that (1) individual mutation
rates can be fitted independently for each context, and
(2) that we can restrict ourselves to sites expected to
evolve via a “neutral” theory in which the effects of
natural selection can be ignored. Within these con-
straints, the assumed data consists of a set of 4 num-
bers (θ
A
,θ
C
,θ
G
,θ
G
) at any given site in the genome
giving the relativeabundance 0 θ
a
1 of nucleotide
a {A,C, G, T} within the population at that site. For
the vast majority of sites this vector is observed to be
(immeasurably close to) one of (1,0,0,0), (0,1,0,0),
(0,0,1,0) or (0,0,0,1). Sites for which two or more
components are non-zero are referred to as single nu-
cleotide polymorphisms (SNPs), and for most SNPs
only two components are observed to be non-zero.
The nucleotides for which θ
a
is non-zero at a given
SNP are referrred to as alleles.
A similar approach to that set out here has also
been taken by Messer (Messer, 2009). However,
Messer’s approach differs in that he restricts the data
to alleles occurring with low frequency in the pop-
ulation, that is θ close to 0 or 1, in order to reduce
the effects of selection. In our approach we assume
the set of genomic sites can be reduced to those not
subject to selection and use the entire range of allele
frequencies.
2 THE MODEL
Our starting point is a discrete-time Markov model
which combines two fundamental ideas of population
genetics.
The first of these is the discrete stochastic model
of genetic drift (see for instance (Ewens, 2004), Chap-
ter 3), defined by a square, time-independent tran-
sition matrix p
ij
defined as follows: If, at a SNP
within the genome with two alleles A
1
and A
2
, Y(τ)
is the number of individuals in a diploid population
of size N with the allele A
1
at time-step (or genera-
tion) τ = 1, 2,..., then
Prob (Y(τ+1) = j|Y(τ) = i) = p
ij
, i, j = 0,1, . . .,M,
where M = 2N. The canonical model generally con-
sidered is the Wright-Fisher model (Wright, 1931) for
183
J. Burden C..
Can Evolutionary Rate Matrices be Estimated from Allele Frequencies?.
DOI: 10.5220/0005253701830188
In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 183-188
ISBN: 978-989-758-070-3
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
a monoecious population, for which
P
ij
=
M
j
i
M
j
1
i
M
(M j)
.
This view of genetic drift readily generalises to
an alphabet L = {A,C,G,T} and a vector of random
variables Y(τ) = (Y
A
,Y
C
,Y
G
,Y
T
) with a probability
distribution
Prob (Y(τ) = i) = φ(i
A
,i
C
,i
G
,i
T
;τ),
where
aL
i
a
= M, and i
a
= 0,...,M. (1)
In general, for an alphabet of size d, the vector φ has
M+d1
d1
components. We interpret this distribution
as the relative frequency of genomic sites at which
alleles are present at the population frequency i/M.
Most of the components of φ will be very close to
zero in practice as SNPs with more than two alleles
are extremely rare within the genome. Furthermore
since the vast majority of genomic sites are not SNPs,
the distribution will be heavilydominated by the com-
ponents φ(M, 0, 0,0), φ(0,M, 0, 0), etc. The Wright-
Fisher model thus generalises to
P
ij
= Prob(Y(τ+ 1) = j|Y(τ) = i) =
M
j
A
j
C
j
G
j
T
i
A
M
j
A
i
C
M
j
C
i
G
M
j
G
i
T
M
j
T
.
The second fundamental idea is that genomic mu-
tations are modelled via a site-independent Markov
transition matrix
Π(t) = exp(tQ)
where the elements q
ab
of Q, satisfying
bL
q
ab
= 0,
represent the instantaneous mutation rate from allele
a to allele b. Our ultimate aim is to estimate Q from
allele frequencies θ within the population at each of
the sites within the restricted set of genomic sites de-
scribed in the Introduction.
To make contact with the above model of genetic
drift, it is instructive to re-visit the Wright-Fisher
model. Assuming a two-step process in which inheri-
tance is followed by mutation at each generation, the
transition matrix of the Wright-Fisher model becomes
((Ewens, 2004), Chapter 3)
P
ij
=
M
j
ψ(i)
j
(1 ψ(i))
M j
, (2)
where
ψ(i) =
i
M
(1 u) +
1
i
M
v,
where u is the probability of mutation from allele A
1
to allele A
2
and v the probability of mutation from A
2
to A
1
in one generation. Here u and v are assumed to
be O(1/M). If we scale the continuous time accord-
ing to t = τ/M, this correspondsto a mutation Markov
matrix over one generation of
Π(1/M) =
1 u u
v 1 v
+ O(M
2
),
where the first row and column correspondto allele A
1
and the second row and column correspond to allele
A
2
.
By making the analogous substitutions in the gen-
eralised Wright-Fisher model, and defining u
ab
, a,b
L , to be the rate of mutation from allele a to allele b
in one generation, one arrives at
P
ij
= Prob(Y(τ+ 1) = j|Y(τ) = i) =
M
j
A
j
C
j
G
j
T
a∈{A,C,G,T}
ψ(i,a)
j
a
, (3)
where
ψ(i,a) =
i
a
M
1
b6=a
u
ab
!
+
b6=a
i
b
M
u
ba
.
The off-diagonal elements of the instantaneous evolu-
tionary rate matrix are
q
ab
= Mu
ab
.
The observed data, as described in the Introduction
are assumed to correspond to the stationary distribu-
tion of the matrix P
ij
, thus allowing for an estimate of
the parameters q
ab
.
3 TOY MODEL: 2-LETTER
ALPHABET
To explore the feasibility of estimating parameters of
the rate matrix from a data set, we begin with the
case of a 2-letter alphabet, described by the evolution
matrix P
ij
defined by Equation (2). In this case the
only parameters of the model are the off-diagonal el-
ements of the continuous-time evolutionary rate ma-
trix, q
12
= Mu and q
21
= Mv and (twice the) effective
population M. Figure 1 shows the stationary distribu-
tion
φ(i) = Prob(Y(τ = ) = i), (4)
obtained numerically for the parameter values stated
in the figure caption. The parameters q
12
and q
21
have
been chosen to some extent so the the distribution is
dominated by the end points i = 0 and M to mimic the
behaviour of real genomes in which the vast majority
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
184
0.0 0.2 0.4 0.6 0.8 1.0
0 20 40 60
θ = i M
Mφ(i), f(θ, )
0.0 0.2 0.4 0.6 0.8 1.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
θ = i M
log
10
(Mφ(i)), log
10
f(θ, )
Figure 1: Numerical stationary distribution φ(i) of the
Wright-Fisher model with two-way mutations for the case
M = 100, q
12
= 0.02 and q
21
= 0.005 plotted on a linear
(top) and logarithmic (bottom) scale. Also shown in red
is the stationary solution f (θ, ) of the continuum M
Fokker-Planck equation with the end points of the distribu-
tion approximated as explained in the text. The extra factor
of M in the vertical axis scale set to enable a comparison
with the continuum stationary distribution.
of sites are not SNPs. This point is discussed further
below.
The effective population size of 100 in this nu-
merical simulation is of course unrealistically low,
and we next consider the limit M . There is a
well established literature on building partial differ-
ential equations, known as Fokker-Planck or forward
Kolmogorov diffusion equations, to describe the time
evolution of the probability density of allele frequen-
cies θ for a given SNP. The continuum Fokker-Planck
equation for the probability density f(θ,t) is obtained
by setting θ = i/M, t = τ/M and taking the limit
M of the discrete model to obtain (Ewens, 2004)
f
t
=
∂θ
(a(θ) f(θ,t)) +
1
2
2
∂θ
2
(b(θ) f(θ,t)), (5)
where, for the current model,
a(θ) = q
12
θ+ q
21
(1 θ),
b(θ) = θ(1 θ).
Setting the time derivative to zero to obtain
the stationary distribution and normalising so that
R
1
0
f(θ)dθ = 1 yields the well-known beta distribution
f(θ,) = B(2q
12
,2q
21
)θ
2q
21
1
(1 θ)
2q
12
1
. (6)
where
B(α,β) =
Γ(α+ β)
Γ(α)Γ(β)
.
This function is superimposed over the finite M solu-
tion over the range 1/M θ 1 1/M in Figure 1,
and clearly illustrates that the continuum limit is ap-
proached very rapidly. To obtain the end points plot-
ted at θ = 0 and 1 we used the approximations
Prob (Y() = 0)
Z
1/M
0
f(θ,)dθ
B(2q
12
,2q
21
)
2q
21
M
2q
21
1
, (7)
and
Prob (Y() = M)
Z
1
11/M
f(θ,)dθ
B(2q
12
,2q
21
)
2q
12
M
2q
12
1
. (8)
In summary, we see that the continuum limit M
is reached rapidly, the precise value of M is in some
sense irrelevant, and that for practical purposes its role
is to provide a lower bound 1/M on the frequency of
a rare allele before a genomic site is deemed to be a
SNP.
4 PARAMETER ESTIMATION
We consider next the problem of estimating the pa-
rameters of the rate matrix for the toy model described
in the previous section.
4.1 Entire Population Surveyed
We start with the following artificially contrived sit-
uation concerning a hypothetical life form whose
CanEvolutionaryRateMatricesbeEstimatedfromAlleleFrequencies?
185
Table 1: Mutation rates ˆq
12
and ˆq
21
estimated from syn-
thetic data generated from the two-letter alphabet model as-
suming the entire population of M chromosomes is geno-
typed at n
site
genomic sites.
M q
12
q
21
n
site
ˆq
12
ˆq
21
100 0.02 0.005 10
3
0.0194 0.00462
10
4
0.0207 0.00519
10
5
0.0206 0.00504
100 0.05 0.005 10
3
0.0508 0.00485
10
4
0.0541 0.00488
10
5
0.0529 0.00509
200 0.02 0.005 10
3
0.0253 0.00607
10
4
0.0209 0.00517
10
5
0.0206 0.00509
genome is built from a two-letter alphabet: Suppose
we have a small monoecious, diploid population of
effective size M/2, and we are able to genotype the
entire population of M chromosomes at a complete
set of n
site
independent genomic sites, each assumed
chosen to be a site not susceptible to selective pres-
sures. The data from which we are to estimate the rate
parameters q
12
and q
21
consist of a set of observed al-
lele frequencies θ
1
,θ
2
,...,θ
n
site
each taking a value in
the set {0,1/M,2/M,...,1}. For the vast majority of
these sites we will observe θ = 0 or 1, however it is
important to retain these non-SNP sites in the data.
Table 1 gives maximum likelihood estimates ˆq
12
and ˆq
21
of mutation rates from synthetic data gener-
ated from the numerically determined stationary dis-
tribution of the Wright-fisher model with mutation,
namely Equation (4). The log likelihood is calculated
from these data using the continuum limit stationary
distribution, Equations (6) to (8).
4.2 Population Sampled
More realistically one expects the effective population
to be large, and that the data will consist of a relatively
small read coverage at each site obtained by sequenc-
ing a sample of the population. We will assume a
uniform read coverage n
read
across n
site
independent
genomic sites, each assumed chosen to be a site not
susceptible to selective pressures.
In this case the data will consist of a set of num-
bers K
1
,K
2
,...K
n
site
of type A
1
alleles, each taking a
value in the set {0,1, . . . , n
read
}, observed at the n
site
genomic sites. At any given site, conditional on the
population fraction θ of A
1
-type alleles at that site,
the observed number of A
1
alleles K will be a bino-
mial random variable:
K|θ bin(n
read
,θ),
where θ has the beta distribution Equation (6). Thus
Table 2: Mutation rates ˆq
12
and ˆq
21
estimated from syn-
thetic data generated from the two-letter alphabet model
with rate matrix parameters q
12
= 0.02, q
21
= 0.005 assum-
ing a sample of size n
read
is genotyped at n
site
genomic sites.
n
read
n
site
ˆq
12
ˆq
21
20 10
3
0.0203 0.00579
10
4
0.0188 0.00474
10
5
0.0191 0.00484
50 10
3
0.0204 0.00510
10
4
0.0200 0.00480
10
5
0.0200 0.00508
Table 3: Same as Table 2, except using rate matrix parame-
ters q
12
= 0.002, q
21
= 0.0005.
n
read
n
site
ˆq
12
ˆq
21
20 10
3
0.00218 0.000527
10
4
0.00156 0.000406
10
5
0.00187 0.000458
50 10
3
0.00138 0.000352
10
4
0.00160 0.000409
10
5
0.00192 0.000474
the unconditional distribution of K is beta-binomial
(see (Johnson et al., 1992), Chapter 6),
Prob (K = k) =
n
site
k
B(2q
12
+ k
1
,2q
21
+ n
site
k)
B(2q
12
,2q
21
)
.
Tables 2 and 3 give maximum likelihood estimates
of ˆq
12
and ˆq
21
for synthetic data generated from
the above beta-binomial distribution for realistic read
coverages n
read
= 20 and 40, assuming the number of
independent genomic sites sampled is n
site
= 10
3
, 10
4
and 10
5
.
In both examples above, we see that reasonable
estimates of mutation rates are obtained with exper-
imentally feasible values for n
site
and n
read
, and that
the estimate generally improves slightly with the the
number of genomic sites observed.
5 FULL MODEL WITH
HASEGAWA-KISHINO-YANO
RATE MATRIX
Finally we demonstrate the form of the stationary so-
lution for the full model with a 4-letter alphabet for
the case of the Hasegawa-Kishino-Yano (HKY) rate
matrix (Hasegawa et al., 1985):
Q = α
... βπ
C
π
G
βπ
T
βπ
A
... βπ
G
π
T
π
A
βπ
C
... βπ
T
βπ
A
π
C
βπ
G
...
,
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
186
with the diagonal elements set so that each row sums
to zero. This is a 5 parameter model, which, for sim-
plicity we will reduce to a 3-parameter model by as-
suming symmetry between the nucleotides A and T
and between the nucleotides C and G, that is, we set
π
A
= π
T
and π
C
= π
G
.
The stationary distribution of the matrix P
ij
de-
fined by Equation (3) for the HKY matrix with param-
eters α = 0.2, β = 0.5, π
A
= π
T
= 0.2 and π
C
= π
G
=
0.3 for an effective population size M = 30 is illus-
trated in Figure 2. Equation (1) implies that the argu-
ment of the stationary distribution φ(i
A
,i
C
,i
G
,i
T
;)
lies on a simplex which, for a 4-letter alphabet, can
be represented as a tetrahedron. In Figure 2 the distri-
bution is represented by a small sphere at each set of
integer coordinates, the volume of each sphere being
proportional to the probability mass function at that
coordinate.
Figure 2: Stationary distribution of allele frequencies for the
HKY model with parameters α = 0.2, β = 0.5, π
A
= π
T
=
0.2 and π
C
= π
G
= 0.3. The corners labelled A, C, G and
T correspond to the coordinates i = (1,0,0,0), (0,1,0, 0),
(0,0, 1,0) and (0,0, 0,1) respectively, and the volume of the
sphere at each coordinate point is proportional to the prob-
ability mass function.
The distribution is dominated by the corners of the
tetrahedron, indicating that the majority of genomic
sites are not SNPs. Edges of the tetrahedron corre-
spond to 2-allele SNPs, the interiors of the four faces
correspond to 3-allele SNPs and the interior volume
of the tetrahedron corresponds to 4-allele SNPs. As
is observed in real genomes, 3- and 4-allele SNPs are
extremely rare. For illustrative purposes the parame-
ter α has been chosen larger than what one might ex-
pect in nature by a couple of orders of magnitude to
ensure the SNP probabilities are visible in the figure.
The stationary distribution along the edges for each
of the six possible 2-allele Figure 3 SNPs is plotted in
Figure 3.
In the above example, the small effective popu-
lation size M = 30 was chosen to enable a numeri-
cally tractable solution. For more realistic values of M
the size of the matrix P
ij
grows asymptotically as M
4
.
However, this does not present a problem as is should
be feasible to develop a continuum Fokker-Planck
equation analogous to Equation (5) with a tactible so-
lution analogous to Equations (6) to (8) from which
to calculate a log-likelihood.
6 DISCUSSION AND
CONCLUSIONS
We have proposed an approach to estimating mutation
rates from observed allele frequencies across a popu-
lation at any set of independent genomic sites which
are believed not to be susceptible to the effects of nat-
ural selection. Our numerical estimates based on the
canonical textbook Wright-Fisher model of popula-
tion genetics suggest that reasonable estimates of mu-
tation rates can be obtained from as few as 10
4
such
sites.
The next step in this analysis is the technical prob-
lem of extending the continuum Fokker-Planck equa-
tion from the 2-letter genomic alphabet described in
Section 3 to an analogous equation defined over the
higher dimensional simplex relevant to the 4-letter
genomic alphabet described in Section 5, and solv-
ing to find the steady state solution. This should be
straightforward, at least for the Wright-Fisher model,
and will enable maximum likelihood estimates of evo-
lutionary rate matrices specified by parameterisations
such as the HKY model. Going beyond Wright-Fisher
to deal with species which are not diploid and mo-
noecious will presumably not present insurmountable
challenges provided the appropriate functions a(θ)
and b(θ) analogous to those occurring in Equation (5)
can be modelled.
An issue not addressed here at any level of rigour
is that of context-dependence. As mentioned in the
introduction, neutral mutation rates at a given site are
known to be subject to the nucleotide content of flank-
ing bases. We have assumed that the set of indepen-
dent sites used to estimate rates are simply chosen to
share the same context. However, this ignores the fact
that the flanking bases may themselves mutate, and
do so over timescales similar to the mutation rates we
seek to estimate. We end on a note of caution that
developing a non-local model which takes this into
account may prove to be a formidable mathematical
problem.
CanEvolutionaryRateMatricesbeEstimatedfromAlleleFrequencies?
187
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4
A to C
i M
Mφ(i)
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4
A to G
i M
Mφ(i)
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4
C to G
i M
Mφ(i)
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4
C to T
i M
Mφ(i)
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4
Figure 3: Distribution along each of the edges of the tetrahedron in Figure 2 illustrating the stationary distribution of allele
frequencies in 2-allele SNPs within a population generated from the HKY model.
REFERENCES
Ewens, W. J. (2004). Mathematical population genetics,
volume 27 of Interdisciplinary Applied Mathematics.
Springer, New York, 2nd edition.
Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating of
the human-ape splitting by a molecular clock of mito-
chondrial dna. J Mol Evol, 22(2):160–74.
Johnson, N. L., Kotz, S., and Kemp, A. W. (1992). Uni-
variate Discrete Distributions. Wiley, New York, 2nd
edition.
Lynch, M. (2009). Estimation of allele frequencies from
high-coverage genome-sequencing projects. Genetics,
182(1):295–301.
Messer, P. W. (2009). Measuring the rates of spontaneous
mutation from deep and large-scale polymorphism
data. Genetics, 182(4):1219–32.
Nevarez, P. A., DeBoever, C. M., Freeland, B. J., Quitt,
M. A., and Bush, E. C. (2010). Context dependent
substitution biases vary within the human genome.
BMC Bioinformatics, 11:462.
Wright, S. (1931). Evolution in mendelian populations. Ge-
netics, 16:97–159.
Zhao, Z. and Boerwinkle, E. (2002). Neighboring-
nucleotide effects on single nucleotide polymor-
phisms: a study of 2.6 million polymorphisms across
the human genome. Genome Res, 12(11):1679–86.
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
188