Search of Periodicity Regions in the Genome A.thaliana
Periodicity Regions in the A.thaliana Genomes
E. V. Korotkov
1
, F. E. Frenkel
1
and M. A. Korotkova
2
1
Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences,
Leninsky Ave. 33, bld. 2, 119071, Moscow, Russia
2
National Research Nuclear University “MEPhI”, Kashirskoe shosse, 31. Moscow 115409, Russia
Keywords: Sequence, Dynamic Programming, Repeat, Genome, Matrix.
Abstract: A mathematical method was developed in this study to determine tandem repeats in a DNA sequence. A
multiple alignment of periods was calculated by direct optimization of the position-weight matrix (PWM)
without using pairwise alignments or searching for similarity between periods. Random PWMs were used to
develop a new mathematical algorithm for periodicity search. The developed algorithm was applied to
analyze the DNA sequences of A.thaliana genome. 13997 regions having a periodicity with length of 2 to 50
bases were found. The average distance between regions with periodicity is ~9000 nucleotides. A
significant portion of the revealed regions have periods consisting of 2 nucleotide, 10-11 nucleotides and
periods in the vicinity of 30 nucleotides. No more than ~30% of the periods found were discovered early.
The sequences found were collected in a data bank from the website: http://victoria.biengi.ac.ru/cgi-
in/indelper/index.cgi. This study discussed the origin of periodicity with insertions and deletions.
1 INTRODUCTION
Periodicity is one of the structural regularities of
sequences and is widely represented in DNA
sequences (Korotkov et al. 2003). A periodicity is
considered as latent, if the similarity between any
two periods is not statistically significant or if it
belongs to the twilight zone (Durbin et al. 1998).
Perfect periodicity can become latent periodicity, if
it accumulates over 1.0 mutation per nucleotide in
the studied DNA sequence (Suvorova et al. 2014).
The distinctive property of latent periodicity is that it
cannot be detected by pairwise comparisons of
nucleotide sequences. However, latent periodicity
can be found if a mathematical method is applied to
directly detect the multiple alignment of nucleotide
sequences without constructing pairwise alignments.
The periods of a sequence with latent periodicity are
sequences for multiple alignment and this multiple
alignment may be the statistically significant without
the statistical importance of any pair alignment. The
aim of this study was to develop a mathematical
method which allows finding the periodicity of DNA
sequences as well as latent periodicity.
At present, there is a significant gap in the
mathematical approaches developed in search for
periodicities in symbolic and numeric sequences
(sequence-based methods). Spectral approaches
enable the finding of adequate "fuzzy" periodicity in
nucleotide sequences without the insertion(s) or
deletion(s) of nucleotides. Fourier transform,
Wavelet transform, information decomposition and
some other methods can be attributed to the number
of spectral methods (Lobzin & Chechetkin 2000;
Kravatskaya et al. 2011; Korotkov et al. 2003; Meng
et al. 2013; Afreixo et al. 2004; Kumar et al. 2006).
However, these approaches have a significant
limitation – they do not allow the detection of a
periodicity with insertions and deletions.
On the other hand, methods based on pairwise
alignment can accurately find insertions and
deletions (Benson 1999; Parisi et al. 2003).
However, these methods cannot detect a latent
periodicity, in a situation where the statistical
significance of similarity between any two periodic
sequences is small (Korotkov et al. 2003; Turutina et
al. 2006). This is due to the fact that the periodicity
of DNA sequences (with the number of periods
greater than or equal to 4) is detected by pairwise
similarity between periods. In the absence of
statistically significant pairwise similarity, these
approaches are incapable of finding latent
periodicity. First, it involves algorithms and
Korotkov E., Frenkel F. and Korotkova M.
Search of Periodicity Regions in the Genome A.thaliana - Periodicity Regions in the A.thaliana Genomes.
DOI: 10.5220/0006106001250132
In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), pages 125-132
ISBN: 978-989-758-214-1
Copyright
c
2017 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
125
programs, such as TRF (Benson 1999), Mreps
(Kolpakov et al. 2003), TRStalker (Pellegrini et al.
2010), ATRHunter (Wexler et al. 2005), T-REKS
(Jorda & Kajava 2009), IMEX (Mudunuri et al.
2010; Mudunuri & Nagarajaram 2007), CRISPRs
(Grissa et al. 2007), SWAN (Boeva et al. 2006) and
some others (Lim et al. 2013; Moniruzzaman et al.
2016), because the similarity between different
periods is very low in the case of latent periodicity.
It is true for algorithmic methods too (Domaniç &
Preparata 2007; Sokol & Tojeira 2014). This leads to
lack of seeds and identical short strings. Therefore,
this study proposes a mathematical method that
considers this gap and finds the latent periodicity of
any symbolic sequence in the presence of insertions
and deletions (in unknown positions of the analyzed
sequence) and in the absence of a known position-
weight matrix (PWM).
Any periodicity of the sequence S with length N
can be characterized by either the frequency matrix
(E. V. Korotkov et al. 2003) or created on its base,
the PWM M (Shelenkov et al. 2006). Each row of
the matrix is associated to a nucleotide and the signs
of the columns are the positions of the period. The
element of this matrix m(i,j) indicates the weight
m(i,j) which has the nucleotide i in position j of the
period. The positions of the period vary from 1 to n.
The sequence S
1
of length N, which is an artificial
periodic sequence 1,2,...,n,1,2,...,n,... is introduced.
Here, the numbers are treated as symbols and
columns in the matrix M are consistent with them.
For period equal to n, the sequence S corresponds to
a certain frequency matrix and PWM M(4,n). The
problem is formulated as follows: This study has a
sequence S with length N. It is necessary to find such
optimal PWM M
0
, where the local alignment
(Durbin et al. 1998) of sequences S
1
and S have the
greatest statistical significance. Under the statistical
significance, the probability P is that F
r
>mF
max
,
where mF
max
is the maximum weight of a local
alignment of sequences S and S
1
, using the optimal
matrix M
0
. Here, F
r
represents the maximum weight
of a local alignment randomly mixed sequence S and
sequence S
1,
using the optimal matrix M
r
. The search
is for matrix M
0
, which has the lowest probability P.
It is always possible to set the threshold level of the
probability P
0
and if the probability P(F
r
>mF
max
)
will be less than P
0,
then the local alignment found
of sequences S and S
1,
using the optimum matrix M
0
can be considered as statistically significant. It is
possible to use a local alignment algorithm for
alignment of the nucleotide sequence S and an
artificial periodic sequence S
1
, relative to the known
PWM (Smith & Waterman 1981). It is necessary to
find the optimal PWM M
0
by any means. Therefore,
the aim of this study was to develop a mathematical
approach for finding the matrix M
0
, as well as a
method for assessing the probability P. To determine
the optimal PWM, an optimization procedure was
used, as well as a local alignment algorithm in order
to account for insertions or deletions. To estimate
the probability P, the Monte Carlo method was used.
Instead of P
0
we used F
0
for which P(F
r
>F
0
)P
0
.
A mathematical method was developed in this
study to find more than 4 tandem repeats in the
DNA sequence. The multiple alignment of periods
was calculated by direct optimization of the PWM
without using pairwise alignments or a search for
similarity between periods. This means that for each
n, a matrix M
0
was found, the probability P was
estimated and the alignment of the sequences S and
S
1
was built using the M
0
matrix. It is not the goal of
this study to analyze all the known DNA sequences,
since the developed method requires large computer
resources. The developed algorithm was applied to
search for periodicity with insertions and deletions
in the A.thaliana genome. This study showed the
presence of periodicity with insertions and deletions
in the A.thaliana genome regions for which the
presence of periodicity was not previously known.
2 METHODS AND ALGORITHMS
In this study, a window which equals 630 base pairs
was used to search for periods in the chromosomes
of A.thaliana genome. This window moved with
step equal to 10 base pairs from the beginning to the
end of each chromosome of A.thaliana. The DNA
sequences in the window were denoted as S. To
search for periodicity with insertions and deletions
in sequence S, the algorithm shown in Fig. 1 was
used. As seen from the algorithm, firstly, a set of
random matrices Q
n
(Fig. 1, step 2) of size 4xn was
generated, where n is the length of the period, and 4
is the alphabet size of the studied sequence. Then,
the matrices were optimized since the distribution of
the similarity function F
max
for each of the matrices
in the set of all random sequences (set Sr, paragraph
2.5) ought to be similar. Then, a local alignment of
the studied sequence S was built relative to each
optimized random matrix (Fig. 1, step 4). Local
alignment was used to determine the similarity
function F
max
for each optimized matrix. The
optimized matrix having the highest value of the
similarity function F
max
, with the studied sequence S,
was chosen. Thereafter, this matrix was optimized to
achieve the highest value of the similarity function
BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms
126
F
max
(mF
max
) with the studied sequence S (Fig. 1,
step 5) and the optimized matrix was called M
0
.
Figure 1: The main stages of the algorithm used for
calculation mF
max
(n) for analyzed sequence S.
If mF
max
(n)
is more than the cutoff level F
0
then the
sequence S contains the region with periodicity
equal to n. In this study, periodicity in the interval
from 2 to 50 base pairs was evaluated. If several
periods have mF
max
(n)>F
0,
n which has the
maximum value of mF
max
(n) was selected (Fig. 1,
step 6). Selection of the level of F
0
is considered in
paragraph 2.6. Subsequently, the window was
moved for 10 base pairs along the A.thaliana
chromosome and the calculations were repeated
(Fig. 1, step 7). As a result of the algorithm, the
dependence of mF
max
on n was obtained for
sequence S with help of a local alignment. This
means that the boundaries of the regions with
mF
max
(n) may differ from the beginning and end of
the sequence S. It also means that the values of
mF
max
(n) for different n can be obtained for different
fragments of the studied sequence S. The boundaries
of the fragments, obtained for relevant values of
mF
max
(n) are shown. Subsequently, each step of the
algorithm shown in Fig. 1 was examined in more
detail.
2.1 Creation of a Set Q
n
of Random
Matrices with Length N
Random matrices Q
n
with dimension 4xn were used,
where n is the length of the period (Fig. 1, step 2).
Each matrix can be viewed as a point in space 4xn
and elements of a matrix are real random numbers.
A set of random matrices Q
n
was created when the
distance between them in the space 4xn was not less
than a certain value. To calculate the differences
between the two matrices m
1
(i,j)
and m
2
(i,j), the
information measure was used (Kullback 1997):
20
12 1 1
1
20
22
1
20
12 12
1
12 12
112 2
(, ) (,)ln((,))
(, )ln( (, ))
((,) (,))ln((,) (,))
(() ())ln(() ())
()ln( ()) ()ln( ())
j
i
i
i
IMM mij mij
mij mij
mijmij mijmij
sj sj sj s j
sj sj sj sj



(1)
where
20
1
() (,)
kk
i
s
jmij
. 2I
j
has an asymptotic chi-
square distribution with 3-th degrees of freedom
(Kullback 1997). Then we calculated:
12 12
1
(, ) (, )
n
j
j
IM M I M M
(2)
Hence,
12
2( , )
I
MM has an approximate
2
()df
,
and df equal to 3n since
11 2
(, )
I
MM ,
212
(, )
I
MM ,…,
11 2
(, )
n
I
MM
are independent and
12
(, )
n
I
MM is
completely determined by
11 2
(, )
I
MM ,
212
(, )
I
MM
,…,
11 2
(, )
n
I
MM
(Kullback 1997). Then the chi-
square distribution was approximated by means of
the normal distribution:
12 12
(, ) 4(, ) 2 1xM M IM M df

(3)
The value
12
(, )~(0,1)xM M N , где N(0,1) is the
standard normal distribution. N(0,1) is very useful as
a measure of the differences between matrices m
1
(i,j)
and m
2
(i,j). The probability p=P(x>x(M
1
,M
2
)) shows
that differences between the matrices m
1
(i,j) and
m
2
(i,j) are determined by random factors. If the
difference between the matrices m
1
(i,j)
and m
2
(i,j)
increases, then
12
(, )
MM becomes larger. The
difference between matrices L=
12
(, )
MM not less
than 1.0 was chosen.
Here, an algorithm was used to generate the
matrices. Each element of the matrix m(i,j), i=1,…,4,
j=1,…,n was randomly filled with equal probability
of either 0 or 1. The matrix was then compared with
all matrices that were already included in the set Q
n
.
If at least one matrix has a difference less than
L=1.0, than the generated matrix was not included in
the set Q
n
. If the difference was greater than L=1.0
for all matrices from the set Q
n
, then the matrix is
included in the set Q
n
. The 10
6
of such matrices were
created for each period length n.
Search of Periodicity Regions in the Genome A.thaliana - Periodicity Regions in the A.thaliana Genomes
127
2.2 Optimizing of Random Matrixes
For every matrix M from the set Q
n,
the values R and
K
d
were calculated as:
4
22
11
(, )
n
ij
Rmij


(4)
4
11
(, ) ()( )
n
d
ij
Kmijfitj


(5)
where f(i)=b(i)/N, b(i) is the number of nucleotides
of type i in the sequence S, t(j) is the probability
symbol "j" in the sequence S
1
. In this case, t(j)=1/n.
N is the total number of nucleotides in the sequence
S, N=630. To calculate the alignment, a optimized
matrix
'M
is needed. Calculations of
'M
was
described early in (Pugacheva, V., Korotkov, A. and
Korotkov 2016; Pugacheva V.M. et al. 2016).
2.3 Alignment of Nucleotide Sequence
with Optimized Random Matrices
A local alignment of sequences S
1
and S (Durbin et
al. 1998) was conducted using the PWM (Sinha
2006) and affine function penalty for insertions and
deletions to search for F
max
and the matrix M
0
(Durbin et al. 1998). To construct the alignment, the
matrices for similarity functions F, F
1
and F
2
were
filled for each matrix M from the set Q
n
. The matrix
M changed and turned into a optimized matrix M'.
The principles of this optimization are shown in
paragraph 2.2 and local alignment was described in
(Pugacheva, V., Korotkov, A. and Korotkov 2016;
Pugacheva V.M. et al. 2016).
2.4 Optimization of a Random Matrix
with the Largest Value of
Similarity Function
For all matrices from the set Q
n
, the modified matrix
max(m'), which had the highest value of the
similarity function F
max
was determined. Let call this
value as mF
max
. Thus, the alignment was calculated
and the coordinates of the alignment were
determined (Fig. 1, step 5). However, despite the use
of a very large number of matrices, the matrix
max(m') may have the value mF
max
, which is not the
largest for a sequence S and for length of period n.
This indicates that the largest value can be achieved
for matrix M
0
, which lies at some distance from the
matrix max(m'), that is less than the chosen threshold
L=1.0 (paragraph 2.1). Therefore, approximately 10
6
matrices were created, having distance L from the
matrix max(m') from 1.0-0.1*i to 0.9-0.1*i (for i=0).
These matrices were also used as indicated in
paragraph 2 and a new matrix max(m') was chosen
which had the highest value mF
max
. This procedure
was repeated for i from 1 to 9 and max(m') for i=9
was chosen as M
0
matrix.
2.5 Generation of Random Sequences
and Selection of F
0
A set Sr of random sequences was created by
random shuffling of the sequence S and the set Sr
containing 200 sequences. To generate one random
symbolic sequence, a random number sequence of
length N=630 was generated by the random number
generator. Then, a random number sequence was
arranged in ascending order, storing the generated
permutations. The produced permutations were used
to mix the sequence S, and as a result of this mixing,
the random symbolic sequence from the set Sr was
created.
Figure 2: Length distribution of the periods found in
genome A.thaliana. Np is a number of periods, n is a
period length
.
In this study, threshold F
0
was determined as
follows: Firstly, the sequences of A.thaliana
chromosomes were obtained and mixed randomly as
carried out during the creation of set Sr. Thereafter,
using the algorithm illustrated in Fig. 1, we
determined the number of sequences Hr(F), which
have mF
max
(n)>F
for every n in the range of 2 to 50
bases. F runs from 200.0 to 500.0. The length of the
window, as in the case of the analysis of A.thaliana
chromosomes, was equal to 630 nucleotides.
Simultaneously, the number of sequences H(F),
which have mF
max
(n)>F for sequences of the
A.thaliana chromosomes was determined. After that,
F
0,
which has the ratio Hr(F
0
)/H(F
0
)0.05, was
BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms
128
chosen. This choice of F
0
gives the number of false
positives (errors of the first kind) less than 5%. In
this study, F
0
=390.0 and it provides
Hr(F
0
)/H(F
0
)0.05, for analysis of the A.thaliana
genome.
This study did not analyze the period which had
3 nucleotides. This means that each window was
checked for the presence of a period which equals 3
nucleotides. To do this, the mutual information
between the sequence S and artificial periodic
sequence S
2
={123}
200
was calculated. Thereafter, the
matrix of the triplet periodicity was calculated and
with the help of this matrix, the correlation between
S and S
2
sequences was determined as shown
previously (Frenkel & Korotkov 2008). For the
measurement of correlation, the argument of normal
distribution X was selected. The higher value of X
corresponds to higher correlation between sequences
S и S
2
. It was identified that if X<3.0, it indicates the
absence of a period equal to 3 bases in the sequence
S and the search for periods was carried out using
this study's algorithm (Fig. 1). However, X3.0
indicated that the sequence S was not analyzed and
the window was shifted by 10 nucleotides.
3 RESULTS AND DISCUSSION
In general, 5 chromosomes with a total length some
more 116 million bases were analyzed in this study.
Sequences were obtained from the website
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genb
ank/A_thaliana/OLD/. The calculations were
performed at the supercomputer cluster of the
Russian Academy of Sciences (http://
www.jscc.ru/eng/index.shtml). In A.thaliana
genome, 13997 regions having a periodicity with
length of 2 to 50 bases were found. On the average,
a periodicity of ~9000 nucleotides was found to be
associated with each region. The sequences found
were collected in a data bank from the website:
http://victoria.biengi.ac.ru/cgi-in/indelper/index.cgi.
It is interesting to consider the distribution of the
lengths of periods found in A.thaliana. This
distribution is shown in Fig. 2. From this figure, it is
obvious that the distribution is very nonuniform and
a significant portion of the revealed regions have
lengths of periods equal to 2, 11, 30 and 31
nucleotides. The small peak represents a period
equal to 35 bases. Fig. 2 also shows the absence of a
significant number of regions with period equal to 3
bases. This is due to the fact that DNA with period
equal to 3 bases was not analyzed because it related
with coding regions. In this study, some number of
regions with triplet periodicity were determined in a
situation in which the original X was less than 3.0,
and the period equal or multiple to 3 bases arose
after the creation of alignment with insertions or
deletions.
Also, the repeatability of regions with periods in
A.thaliana genome was studied using the Blast
program. To do this, there was a search for similarity
in the regions found with the A.thaliana genome
sequences having e-value equal to 10
-6
. It was found
that the 5287 regions represent a single copy, 2957
regions had a copy number which ranged from 2 to
5, and 8244 regions had more than 5 copies. We
observed maximum number of copies equal to 1585.
This shows that a significant part of the detected
sequences belongs to the dispersed repeats(Mehrotra
& Goyal 2014).
Figure 3: mF
max
(n) spectrum for fragment of the sequence
NC_003074.1 from chromosome 3 of the A.thaliana
genome. The coordinates of fragment are: 13905712-
13906329.
In this study, one region with period were
considered as examples. The region has a period
length equal to 4 nucleotides, and this period can be
detected only in the presence of deletions or
insertions. The spectrum of mF
max
(n) is shown in
Fig. 3. This region was found in the third
chromosome of the A.thaliana genome, in sequence
NC_003074.1. mF
max
(4)=660.52. This period was
not detected by TRF (Benson 1999), T-REKs (Jorda
& Kajava 2009) programs. These programs revealed
an insignificant periodicity equal to 13, 30 and 40
bases. TRF found 2.9 periods while T-REKs found 3
periods equal to 30 nucleotides. Mreps (Kolpakov et
al. 2003) found three periods equal to 5 bases In this
sequence, the program ATR hunter (Wexler et al.
2005) found 3 periods with length of 30 bases and 2
periods with length of 26 bases and completely did
not see a period equal to 4 bases. Program TRStalker
(Pellegrini et al. 2010) found 3 repeats with length
Search of Periodicity Regions in the Genome A.thaliana - Periodicity Regions in the A.thaliana Genomes
129
of 13 bases and 2.5 repeats with length 60 bases but
did not find 4 base repeats. The program Repfind
(Betley et al. 2002) found 10 dispersed perfect
repeats TCGG, 9 GATC and 11 GGAT. But these
repeats had a lower level of statistical significance.
The BWT program (Pokrzywa & Polanski 2010)
found no repeats in the sequence. According to this
study's estimates, mF
max
(4)=660.3, it corresponds to
P(mF
max
>660.3)<10
-30
, because the average value of
mF
max
for random sequences Sr is about 136.8 and σ
~ 54.2. The resulting alignment and the resulting
matrix M
0
can be received from
http://victoria.biengi.ac.ru/cgi-in/indelper/index.cgi.
A consensus period with length equal to 4
nucleotides is (T/C)CGA. This period was repeated
more than 140 times in the region found and the
period equal to 4 bases had the highest statistical
significance.
Figure 4: Influence of base changes on mF
max
(20) for
sequences 400 and 600 base pairs. X is the number of base
changes per 1 nucleotide. The period length equals to 10
b.p.
In this study, the influence of random base
substitutions on the mF
max
level was evaluated. To
do this, sequences with lengths 600 and 400
nucleotides long and period equal to 20 nucleotides
were used. Random positions were selected in these
sequences and random replacements of the
nucleotides were made on any of a, t, c, and g with
equal probability. Thereafter, mF
max
(20) was
calculated. The resulting function is shown in Fig. 4.
It can be seen that F
0
=390 is equal to approximately
1.6 and 1.0 random substitutions per nucleotide, for
sequences with lengths equal to 600 and 400
nucleotides, respectively. This result shows the
upper boundary of the accumulation of random
substitutions in the discovered regions and this
bound is 1.6 substitutions per nucleotide.
The results of this study were compared with
that of the T-REKs program. To this end, intervals
were introduced: 500-600, 900-1000, 1400-1500,
1900-2000, 2400-2500, 2900-3000. For these
intervals, all the sequences with periods found in this
study were chosen. For each sequence, the period
length n was found. Thereafter, the periods in these
sequences were searched by the program T-REKs.
T-REKs is one of the best tools for finding tandem
repeats in DNA sequences. It is believed that the T-
REKs program reveals the same period, if it detects
a period length which has a difference of no more
than ±1 base from our period. This interval was
chosen, due to the fact that we have developed a
method which may make insertions, deletions and
closed periods to have statistically important mF
max
.
It was also felt that the program T-REKs, finds the
same period, if the number of detectable periods is
not less than L/2n, where L is the length of the
sequence with period equal to n. As a result, the
proportion of regions detected by the program T-
REKs for different intervals was calculated. This
function is shown in Fig. 5. From this graph, it is
clear that before mF
max
=1500, the program T-REKs
can find less than 30% regions and only for
mF
max
>2200 did the program reveal more than 50%
of the regions.
There is a natural question about the biological
significance of the periods found. It applies
primarily to periods of 10 and 11 nucleotides long,
as well as to the nucleotides of multiple periods.
There are earlier suggestions that the periodicity
length of 10 and 11 nucleotides has a relationship
with the α-helices in proteins, as well as with the
processes of DNA compaction (Herzel et al. 1999;
Larsabal & Danchin 2005). In this study, sequences
without period equal to 3 bases were analyzed which
is specific for the protein-coding regions. This
means that most parts of the detected regions could
be linked with DNA compaction (Schieg & Herzel
2004; Kumar et al. 2006). Also, this study identified
regions with periods (with insertions and deletions)
which are impossible to detect by the methods of
searching for correlations in DNA (Herzel et al.
1999; Larsabal & Danchin 2005). It is very likely
that work regions with periods ranging from 9 to 11
bases and associated with the formation of
chromatin loops, are found in this study. If we take
into account that the number of these regions is
about 1,4x10
3
(Fig. 2) and we have analyzed about
1,16x10
8
bases, the average distance between these
regions (having periods in interval from 9 to11
nucleotides) is about 9x10
4
. This is consistent with
the size of 30 nm chromatin loops (Kadauke &
Blobel 2009). These regions could be "hot spots" for
chromosomal rearrangements also (Kantidze &
BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms
130
Razin 2009). At the same time, regions were found
with periods which could be micro- and minisatellite
sequences (Richard et al. 2008). In this case, classic
micro and mini minisatellites were identified with
insertions and deletions of nucleotides which have
mF
max
>2000. According to Fig. 4, in this case the
number of substitutions is not more than 50% per
nucleotide. When mF
max
<2000, ancient copies of
micro- and minisatellite sequences were discovered
that have accumulated a considerable number of
nucleotide substitutions, insertions and deletions of
nucleotides.
Figure 5: Comparison of developed algorithm with the
program T-REKs (Jorda & Kajava 2009). ID shows the
part of periodicities regions which can find the T-REKs.
We can assume that the results are the same if the T-REKs
detects at least 50% of the number of periods and the
period length differs not by more than one base.
It is also interesting to estimate the part of the
A.thaliana genome which has period regions. The
average length of the region which was found with
the periods is 400 bases and the number of regions
found is 13997. This corresponds to a total length
equal to about 6,6x10
6
nucleotides, which is ~5% of
the total length of the A.thaliana genome.
There are the limits of applicability of the
method developed in this study. As was noted earlier
(paragraph 2.2.1), an average value,
l
=150, was
chosen using the random sequences. This means that
micro and mini satellite sequences less than this
length are detected as not very good by this method.
The fact is that these lengths can not overcome the
threshold F
0
= 390.0;thus, these sequences can be
missed by this study's method. This means that even
perfect micro- and minisatellites may be skipped, if
they have a length equal to or less than 150
nucleotides. On the basis of this limitation, a
comparison can be made between the earlier work
on the search for micro and minisatellite and the
results of this study. Previously, micro- and
minisatellite sequences from A.thaliana genome
were investigated (Richard et al., 2008; Tóth et al.,
2000) and mathematical methods for finding the
micro and mini satellites sequences shown in
Moniruzzaman et al. (2016.).
Above, the approach of this study was compared
with the main methods used, when searching for
micro and minisatellite sequences (Moniruzzaman et
al. 2016). The programs used included TRF (Benson
1999), T-REKs (Jorda & Kajava 2009), Mreps
(Kolpakov et al. 2003), BWTRs (Pokrzywa and
Polanski, 2010), ATR hunter (Wexler et al. 2005),
Repfind (Betley et al. 2002). Therefore, it can be
assumed that the developed approach misses perfect
micro and minisatellite sequences which have a
length of less than 100 bases. However, the method
used in this study was able to find a highly diverged
periodic region which have a considerable length
(200 or more bases) and which passed by previously
developed approaches. This study's method is
suitable when it comes to searching for highly
divergent tandem repeats, having a total length of
more than 200 nucleotides.
This work was supported by the Russian Science
Foundation.
REFERENCES
Afreixo, V., Ferreira, P.J.S.G. & Santos, D., 2004. Fourier
analysis of symbolic data: A brief review. Digital
Signal Processing, 14(6), pp.523–530.
Benson, G., 1999. Tandem repeats finder: a program to
analyze DNA sequences. Nucleic acids research,
27(2), pp.573–580.
Betley, J.N. et al., 2002. A ubiquitous and conserved
signal for RNA localization in chordates. Current
biology: CB, 12(20), pp.1756–61.
Boeva, V. et al., 2006. Short fuzzy tandem repeats in
genomic sequences, identification, and possible role in
regulation of gene expression. Bioinformatics (Oxford,
England), 22(6), pp.676–84.
Domaniç, N.O. & Preparata, F.P., 2007. A novel approach
to the detection of genomic approximate tandem
repeats in the Levenshtein metric. Journal of
computational biology a journal of computational
molecular cell biology, 14(7), pp.873–891.
Durbin, R. et al., 1998. Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids,
Cambridge University Press.
Frenkel, F.E. & Korotkov, E. V, 2008. Classification
analysis of triplet periodicity in protein-coding regions
of genes. Gene, 421(1-2), pp.52–60.
Grissa, I., Vergnaud, G. & Pourcel, C., 2007.
CRISPRFinder: a web tool to identify clustered
regularly interspaced short palindromic repeats.
Search of Periodicity Regions in the Genome A.thaliana - Periodicity Regions in the A.thaliana Genomes
131
Nucleic acids research, 35(Web Server issue),
pp.W52–7.
Herzel, H., Weiss, O. & Trifonov, E.N., 1999. 10-11 bp
periodicities in complete genomes reflect protein
structure and DNA folding. Bioinformatics, 15(3),
pp.187–193.
Jorda, J. & Kajava, A. V, 2009. T-REKS: identification of
Tandem REpeats in sequences with a K-meanS based
algorithm. Bioinformatics (Oxford, England), 25(20),
pp.2632–8.
Kadauke, S. & Blobel, G.A., 2009. Chromatin loops in
gene regulation. Biochimica et biophysica acta,
1789(1), pp.17–25.
Kantidze, O.L. & Razin, S. V, 2009. Chromatin loops,
illegitimate recombination, and genome evolution.
BioEssays: news and reviews in molecular, cellular
and developmental biology, 31(3), pp.278–86.
Kolpakov, R., Bana, G. & Kucherov, G., 2003. mreps:
Efficient and flexible detection of tandem repeats in
DNA. Nucleic acids research, 31(13), pp.3672–8.
Korotkov, E.V., Korotkova, M.A. & Kudryashov, N.A.,
2003. The informational concept of searching for
periodicity in symbol sequences. Molekuliarnaia
Biologiia, 37(3), pp.436–451.
Korotkov, Korotkova & Kudryashov, 2003. Information
decomposition method to analyze symbolical
sequences. Physics Letters, Section A: General,
Atomic and Solid State Physics, 312(3-4), pp.198–210.
Kravatskaya, G.I. et al., 2011. Coexistence of different
base periodicities in prokaryotic genomes as related to
DNA curvature, supercoiling, and transcription.
Genomics, 98(3), pp.223–231.
Kullback, S., 1997. Information Theory and Statistics S.
Kullback, ed., New York: Dover publications.
Kumar, L., Futschik, M. & Herzel, H., 2006. DNA motifs
and sequence periodicities. In silico biology, 6(1-2),
pp.71–8.
Larsabal, E. & Danchin, A., 2005. Genomes are covered
with ubiquitous 11 bp periodic patterns, the “class A
flexible patterns”. BMC bioinformatics, 6, p.206.
Lim, K.G. et al., 2013. Review of tandem repeat search
tools: a systematic approach to evaluating algorithmic
performance. Briefings in bioinformatics, 14(1),
pp.67–81.
Lobzin, V. V. & Chechetkin, V.R., 2000. Order and
correlations in genomic DNA sequences. The spectral
approach. Uspekhi Fizicheskih Nauk, 170(1), p.57.
Mehrotra, S. & Goyal, V., 2014. Repetitive sequences in
plant nuclear DNA: types, distribution, evolution and
function. Genomics, proteomics & bioinformatics,
12(4), pp.164–71.
Meng, T. et al., 2013. Wavelet analysis in current cancer
genome research: a survey. IEEE/ACM transactions
on computational biology and bioinformatics / IEEE,
ACM, 10(6), pp.1442–59.
Moniruzzaman, M. et al., 2016. Development of
Microsatellites: A Powerful Genetic Marker. The
Agriculturists, 13(1), p.152.
Mudunuri, S.B. et al., 2010. G-IMEx: A comprehensive
software tool for detection of microsatellites from
genome sequences. Bioinformation, 5(5), pp.221–3.
Mudunuri, S.B. & Nagarajaram, H.A., 2007. IMEx:
Imperfect Microsatellite Extractor. Bioinformatics
(Oxford, England), 23(10), pp.1181–7.
Parisi, V., De Fonzo, V. & Aluffi-Pentini, F., 2003.
STRING: Finding tandem repeats in DNA sequences.
Bioinformatics, 19(14), pp.1733–1738.
Pellegrini, M., Renda, M.E. & Vecchio, A., 2010.
TRStalker: an efficient heuristic for finding fuzzy
tandem repeats. Bioinformatics (Oxford, England),
26(12), pp.i358–66.
Pokrzywa, R. & Polanski, A., 2010. BWtrs: A tool for
searching for tandem repeats in DNA sequences based
on the Burrows-Wheeler transform. Genomics, 96(5),
pp.316–21.
Pugacheva V.M., Korotkov A.E & Korotkov E.V., 2016.
Search of latent periodicity in amino acid sequences
by means of genetic algorithm and dynamic
programming. Statistical application in genetics and
molecular biology, 15(4).
Pugacheva, V., Korotkov, A. and Korotkov, E., 2016.
Search for Latent Periodicity in Amino Acid
Sequences with Insertions and Deletions. In
Proceedings of the 9th International Joint Conference
on Biomedical Engineering Systems and Technologies
(BIOSTEC 2016). SCITEPRESS – Science and
Technology Publications, Lda., pp. 117–127.
Richard, G.-F., Kerrest, A. & Dujon, B., 2008.
Comparative genomics and molecular dynamics of
DNA repeats in eukaryotes. Microbiology and
molecular biology reviews: MMBR, 72(4), pp.686–
727.
Schieg, P. & Herzel, H., 2004. Periodicities of 10-11bp as
indicators of the supercoiled state of genomic DNA.
Journal of molecular biology, 343(4), pp.891–901.
Shelenkov, A., Skryabin, K. & Korotkov, E., 2006. Search
and classification of potential minisatellite sequences
from bacterial genomes. DNA research: an
international journal for rapid publication of reports
on genes and genomes, 13(3), pp.89–102.
Sinha, S., 2006. On counting position weight matrix
matches in a sequence, with application to
discriminative motif finding. In Bioinformatics.
Smith, T.F. & Waterman, M.S., 1981. Identification of
common molecular subsequences. Journal of
Molecular Biology, 147, pp.195–197.
Sokol, D. & Tojeira, J., 2014. Speeding up the detection of
tandem repeats over the edit distance. Theoretical
Computer Science, 525, pp.103–110.
Suvorova, Y.M., Korotkova, M.A. & Korotkov, E. V,
2014. Comparative analysis of periodicity search
methods in DNA sequences. Computational biology
and chemistry, 53 Pt A, pp.43–48.
Turutina, V.P. et al., 2006. Identification of Amino Acid
Latent Periodicity within 94 Protein Families. Journal
of Computational Biology, 13(4), pp.946–964.
Wexler, Y. et al., 2005. Finding approximate tandem
repeats in genomic sequences. Journal of
computational biology: a journal of computational
molecular cell biology, 12(7), pp.928–42.
BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms
132