A TASTE OF YEAST MOBILOMICS
Giulia Menconi
1
, Giovanni Battaglia
2
, Roberto Grossi
2
, Nadia Pisanti
2
and Roberto Marangoni
2,3
1
Istituto Nazionale di Alta Matematica, Roma, Italia
2
Dipartimento di Informatica, Universit´a di Pisa, Pisa, Italia
3
Istituto di Biofisica, CNR, Pisa, Italia
Keywords:
Mobile elements, Multiple genome comparison.
Abstract:
Mobilomics calls for detecting all the mobile elements in a genome so as to understand their dynamic behavior.
We devise and apply a method that extends a pairwise strain comparison tool for mobile genetic elements
(MGE) inference, and perform experiments on a whole dataset of 39 complete genomes of as many yeast
(S.cerevisiae) strains. We locate a priori all the MGEs regions that are annotated in the reference sequence
at hand, and map all the putative MGEs in all the other (non-annotated) strains. Interestingly, evolutionary
relation among the strains based on the presence/absence of candidate MGEs, turns out to be quite close to that
inferred by classic phylogenetic methods based on SNPs analysis.
1 INTRODUCTION
The mobilome (Siefert, 2009) is the whole collection
of the mobile genetic elements (MGEs) hosted in a
genome. MGEs can vary in length, sequence content,
and copy number. They behave as parasites as they
replicate by exploiting the resources of the host (Kid-
well and Lisch, 2001). They can even express their
own genes, and by doing this they can destabilize
the host organism, as the mutations induced by their
jumps or replications can result in gene inactivation or
modification. The relation between MGEs and the host
genome is complex and still debated: there are also
other (than the usual parasitic relation) kinds of inter-
actions, such as direct competition or, at the opposite,
cooperation towards synergizing MGEs and their host
(see (Leonardo and Nuzhdin, 2002) for a detailed dis-
cussion on this subject).
Mobilomics, that is the task of investigating the
mobilome, is a comprehensiveapproach for providing
rapid and exhaustive methods and tools for the identi-
fication of all the mobilome elements in an organism,
for tracking their movements (including replications
and deletions) during evolution, and - as a long term
goal - for developing dynamic models able to forecast
the fate of the relations between the mobilome and the
host genome.
In the literature, population genomists study the
mobilome paying particular attention to the dynamics
of the MGEs, while evolutionary biologists attempt to
define the contribution of the mobilome in the evolu-
tion of the host organisms. Some evidence supporting
the conjecture that the mobilome has a great impact
on the fate of the host, has been found in all the liv-
ing kingdom, from prokaryotes (Rankin et al., 2010)
to higher eukaryotes (Koszul et al., 2004; Bennet-
zen, 2000; Bourque, 2009) including human (Brittten,
2010). The authors of (Menconi et al., 2011) designed
a tool for finding the mobilome of two genomes by
performing a suitable pairwise comparison and ex-
tracting and mapping the complement of the shared
(thus assumed immotile) DNA. In this paper we de-
vise a pipeline (Section 2.1) that actually extends the
mobilome inference method for two-genomes to the
case of a collection of genomes.
The method aligns homologouschromosomes and
marks the non-homologous “island” surrounded by
homologous sequences as putative mobile elements
(PMEs) to indicate the possibility of an occurrence
of an MGE m. This approach is prone to errors: on
one hand, an MGE m that did not move within a small
set of organisms, would not be marked this way; on
the other hand, a chromosomal mutation uncorrelated
with the mobilome could be marked as PME. By
progressively extending the above alignment to other
strains (or even organisms), as we do in this paper, the
set of PMEs becomes more and more populated by all
the MGEs that actually moved or replicated.
Furthermore, in this paper we show (Section 2.2)
the results of the application of our approach to the
271
Menconi G., Battaglia G., Grossi R., Pisanti N. and Marangoni R..
A TASTE OF YEAST MOBILOMICS.
DOI: 10.5220/0003702302710274
In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2012), pages 271-274
ISBN: 978-989-8425-90-4
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
whole data set of 39 strains of the yeast Saccha-
romyces cerevisiae, the genome sequences recently
released by (Liti et al., 2009). They have a low cov-
erage (one-to-fourfold), and so they are unannotated
and rich of unresolved regions (i.e. sequences of un-
specified bases). Our choice of studying the yeast is
motivated by the large availability of its sequenced
strains and by the observation that it is probably the
most known organism from a molecular point of view.
To have a referral point, we adopt the
S288C
strain,
called
RefSeq
hereafter, as it is fully sequenced and
annotated in the SGD database (SGD, 2010), along
with its MGEs. We obtain a mapping of all the PMEs in
all the strains (Section 2) that turns out to include all
the annotated MGEs. We remark that this method to
detect the mobilome makes use of sequence informa-
tion only, can deal with whole chromosome at a time,
and does not require any preprocessing of the data
(like for example a partition of coding and non cod-
ing DNA fragments) and does not use any database
information, and therefore it is independent from the
type of organism it deals with. In particular, for exam-
ple, the applicability of this tool are more general than
those of mGenomeSubtractor (Shao et al., 2010) that
is specifically for bacteria and requires a preprocess-
ing of the data for the different purpose of detecting
genetic variants.
Finally, we perform some mobilomics experi-
ments by doing comparative analysis of the strains
based on their PMEs. Interestingly, clustering
the binary vectors obtained by marking the pres-
ence/absence of candidate MGEs in each of the strains
provides an evolutionary relation among the strains
that is quite close to that inferred by classic phyloge-
netic methods based on SNPs analysis (Section 3).
2 MOBILOME INFERENCE ON
39 STRAINS
In this section we explain (Subsection 2.1) how the
method of REGENDER (Menconi et al., 2011) can be
extended to be applied to data sets larger than two
genomes, and we describe (Subsection 2.2) an appli-
cation of the new method to the whole available data
set of 39 yeast strains.
2.1 Finding the Mobilome in more than
Two Genomes
The pipeline we devise in this paper is aimed at ex-
tracting the putative mobilome from a vast collection
of genomes, and mapping the PMEs on each strain.
This method does not require any template sequence
for the sought MGEs and it can be applied to infer
MGEs also for low coverage genomes with unspec-
ified bases, where traditional approaches are largely
ineffective.
REGENDER (Menconi et al., 2011) was designed
for detecting mobile elements that could be inferred
from the comparison of two genomes. It performs
a two-phase processing of all the possible chromo-
somes’ pairs: first, it finds the common n-grams and,
second, it aggregates consecutive n-grams in a greedy
fashion using some user-defined parameters that con-
trol when the next conserved region begins.
The extension to more than two strains is criti-
cal due to the high computational cost of the multi-
ple alignment that should actually be performed; fur-
thermore, the presence in all the input sequences (ex-
cept
RefSeq
) of unresolved bases makes the whole
task even more arduous. Taking advantage of the effi-
ciency of REGENDER, we actually perform a progres-
sive star-like multiple alignment centered at
RefSeq
.
The main steps are the following, that are performed
separately for each one of the 16 chromosomes:
1. REGENDER is applied to
RefSeq
against each one
of the other 38 strains, in a progressive way.
2. Complement the output of Step 1 and find the ge-
nomic coordinates of non-conserved segments in
all the strains of the collection.
3. Remove from the list the non-conserved segments
(in any strains) which in
RefSeq
are in telomeric
regions.
4. Remove very short indels/mutations: non-
conserved segments shorter than 200 nucleotides
in all strains.
Step 1 performs the simultaneous extraction of
conserved regions from the whole collection, using
the
RefSeq
chromosome as an outgroup to align all
the others. Once the segment-based pairwise align-
ments between
RefSeq
and each other input chromo-
some have been computed by REGENDER, we only
report the segments that are conserved in all the in-
put chromosomes, by intersecting the conserved seg-
ments. This choice, in particular, implies that the re-
sult is independent from the order in which the strains
are taken into account.
Step 2 filters out everything that is not conserved
as this is presumed to be resident genome (i.e. not
mobilome).
Step 3 was motivated by the fact that telomeres of
any chromosome of any strain different from
RefSeq
mainly contain unresolved bases, because the pres-
ence of long repeats is a source of noise for the as-
sembly phase.
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
272
Finally, Step 4 is due to the fact that very short in-
dels or mutations are known not to be related to MGEs
nor to chromosomal rearrangements.
2.2 Application to the Yeast Dataset
We applied the pipeline described in Section 2.1 to
the whole dataset of 39 S.cerevisiae strains (Liti et al.,
2009). The richness of the data sets gives a new -
somehow more realistic - insight on the characteriza-
tion of PMEs as actual MGEs, with respect to the case
of complementing only the fragments of genomes that
result immotile after a single pairwise comparison.
After numbering the 16 chromosomes by N =
1, . . . , 16, we take into account the 39 homologous
chromosomes N, denoted ChrN
1
, . .., ChrN
k
, ...,
ChrN
39
for as many as strains k = 1, 2, . . . , 39 (where
strain 1 is
RefSeq
). We mark a large set of PMEs,
which vary in their length. To collect information
about their real linkage with the MGEs, and also to
deal with unspecified bases, we again refer to the ac-
curate annotation available for
RefSeq
. We therefore
map on
RefSeq
all the sequences that are detected as
PMEs, and examine their possible annotations.
The MGEs in
RefSeq
are almost all LTR-
retrotransposons, that we denote with Ty. Instead,
we simply denote with LTR (Long Terminal Repeats)
what is often called solo-LTR: the sequences of about
300b delimiting both ends of a LTR-retrotransposon.
We distinguish PMEs basing on their length: PME-LTR
candidates, having length 300b (compatible with LTR
elements), and PME-Ty candidates, longer than 4000b
(compatible with a complete Ty element).
Mapping the PME-LTR candidates to the anno-
tated LTRs in
RefSeq
leads to an uncertain situation,
since only about 44% of the known LTRs are actually
marked as putative LTRs. This might have two mo-
tivations. First, the large amount of undetected LTR
elements derives from the low probability that a LTR
moves. Second, it is not rare to have a chromosomal
mutation that spans from 300b to 4000b in a dataset
of 39 strains, and this populates the class of puta-
tive LTRs that do not match LTR annotations. There-
fore, the comparative genomics approach is ineffec-
tive for discovering LTRs, while motif-search based
approaches might perform better.
The scenario for PME-Ty candidates is much dif-
ferent instead: we are able to detect 77 non-conserved
regions longer than 4000b that are also in
RefSeq
.
Our careful inspection pays a particular attention to
the annotations involved in genomic mutations or re-
arrangements, apart from the MGE annotations al-
ready taken into account. In particular, we have
considered: meiotic recombination hotspots (Ger-
ton et al., 2000), evolutive and experimental break-
points (Di Rienzi et al., 2010), autonomously repli-
cating sequences (Di Rienzi et al., 2010), tRNA
genes (SGD, 2010), γ-H2A rich loci (Szilard et al.,
2010), and replication termination loci (Fachinetti
et al., 2010). Only 2 regions do not host any fea-
ture. Out of the remaining 75 regions, 44 of them
host at least one full-Ty annotation, 12 of them at least
one LTR annotation, and 19 of them host some of the
above markers of genome rearrangements, different
from Ty and LTR. Many regions (31) do not involve
active MGEs but correspond to loci prone to chromo-
somic recombination, rearrangement or fragility. We
remark that all the known Tys are correctly marked as
PMEs: the only Ty not recognized is the unique copy
of Ty5 that appears in the telomere of Chromosome
III and, because of its localization, it is ruled out from
this investigation.
We then inspected the frequency of movements,
by building a binary matrix B as follows: for each
PME-Ty candidate i = 1, 2, . . . , 77 and for each yeast
strain k = 1, 2, . . . , 39, we report ’1’ in B[i, k] if this
candidate occurs in that strain, and ’0’ if it does not.
Note that we report a ’1’ whenever we find a unde-
fined (or highly mutated) sequence of length compat-
ible with a Ty. If we sum up the ’1’ values in B by
candidates (rows), and sort the candidates according
to their sums, we observe that 33 out of 44 candidates
correspond to annotated Tys with score strictly less
than 39 (i.e. there is at least one strain where the can-
didate is missing), whereas 31 out of 33 of the non-
Ty annotated PMEs have a full score of 39. In other
words, the non-Ty annotated PMEs do not moveacross
all the examined strains, even though their sequence
is not conserved in the genomes. Among the anno-
tated Tys, there are some of them that appear to main-
tain their position across the genomes, possibly with
a change in their sequence, but the large gap between
the frequencies of jumps allows us to conclude that
false positive candidates tend to be resident.
3 YEAST MOBILOMICS
Our last and most interesting goal is to employ the
data gathered in Section 2 to compare and possibly
cluster the yeast strains according to the topology of
their MGEs. The fact that different sequences marked
as PME-Ty candidates have a different degree of pres-
ence in the different strains, suggests us to try to un-
derstand the dynamics of these movements. We rep-
resent the topology of the MGEs of the 39 strains by
creating as many binary vectors as follows.
Let p
N
denote the number of non-conserved
A TASTE OF YEAST MOBILOMICS
273
regions within chromosome ChrN. Let S
k
(N) =
(ChrN
k
[i
1
, j
1
], . . . , ChrN
k
[i
p
, j
p
N
]) denote the se-
quence of non-conserved regions in left-to-right
order in chromosome ChrN
k
of the kth strain. For
each k = 1, . . . , 39, and for each N = 1, . . . , 16, we
construct a binary vector
ˆ
S
k
(N) of the same size as
S
k
(N), where the nth component is
0
if the segment
ChrN
k
[i
n
, j
n
] is smaller than the user-supplied size
threshold d, and
1
otherwise. We use default thresh-
olds d = 4000 (called Ty-c threshold) and d = 300
(called LTR-c threshold) according to whether we
want to detect just Tys or also LTRs, respectively.
Finally, let
ˆ
S
k
=
ˆ
S
k
(1)
ˆ
S
k
(2)· · ·
ˆ
S
k
(16) be the bi-
nary sequence corresponding to the concatenation of
the 16 chromosomes of the kth strain. Note that,
since the number of conserved regions is the same
in each input chromosome, then also the number
of non-conserved regions is the same over all the
chromosomes. It follows that the 39 binary vectors
ˆ
S
1
,
ˆ
S
2
, . . . ,
ˆ
S
39
have the same size. Using them as rep-
resentatives of the mobilome topology of the corre-
sponding 39 yeast strains, we apply the clustering
package of the
scipy
scientific library (Jones et al.,
2001) to perform a hierarchical clustering. The cho-
sen metric is the Hamming distance, while the se-
lected linkage method is
UPGMA
. In this way, we gen-
erate a tree which we call the mobilome tree.
The resulting mobilome tree reveals the clusters
among strains obtained by minimizing the movements
of PMEs. It is really interesting to compare the mo-
bilome tree with the tree obtained by standard phy-
logenetic approaches based on SNPs comparison on
a set of suitably identified genes (Liti et al., 2009).
Almost all of the clades determined by the two trees
coincide: this would support the recently established
paradigm that Tys are able to drive the evolution of
organisms, as reported in (Kazian, 2004). It is re-
markable that the amount of information needed for
our approach is really minimal, and can be obtained
a priori. An interesting side observation is that the
mobilome tree does not change when employing the
binary vectors
ˆ
S
1
,
ˆ
S
2
, . . . ,
ˆ
S
39
using the LTR-c thresh-
old d = 300 (rather than the Ty-c threshold): also in
this case, the large majority of clades are identical to
those of the classical phylogenetic tree.
REFERENCES
Bennetzen, J. (2000). Transposable elements contribution
to plant gene and genome evolution. Plant Mol Biol,
42:251–269.
Bourque, G. (2009). Transposable elements in gene regula-
tion and in the evolution of vertebrate genomes. Cur-
rent Opinion in Genetics and Development, 19:607–
612.
Brittten, R. (2010). Transposable element insertions
have strongly affected human evolution. PNAS,
107:19945–19948.
Di Rienzi, S., Collingwood, D., Raghuraman, M., and
Brewer, B. (2010). Fragile genomic sites are associ-
ated with origins of replication. Genome Biology and
Evolution, 1(0):350.
Fachinetti, D., Bermejo, R., Cocito, A., Minardi, S., Ka-
tou, Y., Kanoh, Y., Shirahige, K., Azvolinsky, A., Za-
kian, V., and Foiani, M. (2010). Replication Termina-
tion at Eukaryotic Chromosomes Is Mediated by Top2
and Occurs at Genomic Loci Containing Pausing Ele-
ments. Molecular Cell, 39(4):595–605.
Gerton, J., DeRisi, J., Shroff, R., Lichten, M., Brown, P.,
and Petes, T. (2000). Global mapping of meiotic re-
combination hotspots and coldspots in the yeast Sac-
charomyces cerevisiae. Proceedings of the National
Academy of Sciences of the United States of America,
97(21):11383.
Jones, E., Oliphant, T., Peterson, P., et al. (2001). SciPy:
Open source scientific tools for Python.
Kazian, H. H. (2004). Mobile elements: Drivers of genome
evolution. Science, 303:1626–1632.
Kidwell, M. G. and Lisch, D. R. (2001). Perspective: trans-
posable elements, parasitic dna, and genome evolu-
tion. Evolution, 55:1–24.
Koszul, R., Caburet, S., Dujon, B., and Fischer, G. (2004).
Eucaryotic genome evolution through the spontaneous
duplication of large chromosomal segments. EMBO
J., 23:234–243.
Leonardo, T. and Nuzhdin, S. (2002). Mobile elements and
disease. Genet Res, 80:155–161.
Liti, G., Carter, D. M., Moses, A. M., and et al. (2009). Pop-
ulation genomics of domestic and wild yeast. Nature,
458:337–341.
Menconi, G., Battaglia, G., Grossi, R., Pisanti, N., and
Marangoni, R. (2011). Inferring mobile elements
in S.cerevisiae strains. In International Conference
on Bioinformatics Models, Methods and Algorithms.
SciTePress. ISBN 978-989-8425-36-2.
Rankin, D., Bichsel, M., and Wagner, A. (2010). Mobile
dna can drive lineage extinction in prokaryotic pop-
ulations. Journal of Evolutionary Biology, 23:2422
2431.
SGD (2010). Saccharomyces Genome Database. http://
www.yeastgenome.org/.
Shao, Y., He, X., Harrison, E. M., Tai, C., Ou, H.-Y., Ra-
jakumar, K., and Den, Z. (2010). mgenomesubtrac-
tor: a web-based tool for parallel in silico subtractive
hybridization analysis of multiple bacterial genomes.
Nucleic Acids Research, 38:W194–W200.
Siefert, J. L. (2009). Defining the mobilome. Methods in
Molecular Biology, 532:13–27.
Szilard, R., Jacques, P., Laram´ee, L., Cheng, B., Galicia,
S., Bataille, A., Yeung, M., Mendez, M., Bergeron,
M., Robert, F., et al. (2010). Systematic identification
of fragile sites via genome-wide location analysis of
γ-H2AX. Nature Structural & Molecular Biology.
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
274