Supervised Classiﬁcation of Metatranscriptomic Reads Reveals the

Existence of Light-dark Oscillations During Infection of Phytoplankton

by Viruses

Enzo Acerbi

1,∗

, Caroline Chenard

2,∗

, Stephan C. Schuster

and Federico M. Lauro

1,2

Singapore Centre on Environmental Life Sciences Engineering (SCELSE), Nanyang Technological University,

60 Nanyang Dr, 637551, Singapore

Asian School of the Environment, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore

∗

These authors contributed equally.

Keywords:

Support Vector Machines, Empirical Mode Decomposition, Marine Microbial Ecology.

Abstract:

In the era of next generation sequencing technologies microbial species identiﬁcation is typically performed

using sequence similarity and sequence phylogeny based approaches. Particularly challenging is the discri-

mination of closely related sequences such as auxiliary metabolic genes (AMGs) in cyanobacteria and their

viruses (cyanophages). Here we developed a method which combines Support Vector Machine based clas-

siﬁcation of AMGs short fragments and Empirical Mode Decomposition of periodic features in time-series.

We applied this method to investigate the transcriptional dynamics of viral infection in the ocean, using data

extracted from a previously published metatranscriptome proﬁle of a naturally occurring oceanic bacterial

assemblage sampled Lagrangially over 3 days. We discovered the existence of light-dark oscillations in the

expression patterns of AMGs in cyanophages which follow the harmonic diel transcription of both oxygenic

photoautotrophic and heterotrophic members of the community. These ﬁndings suggest that viral infection

might provide the link between light-dark oscillations of microbial populations in the North Paciﬁc Subtropi-

cal Gyre.

1 INTRODUCTION

Like most of life sciences, marine microbial ecology

has also been revolutionized by the advent of next

generation sequencing (NGS) technologies. Taxo-

nomical identiﬁcation of DNA/RNA sequences (also

referred to as fragments) is typically performed by

checking for the existence of a similarity by alignment

to the genomes of all the known microbial species in a

brute force kind of fashion. If matches are found, phy-

logenetic analysis may subsequently be conducted to

unravel the evolutionary relationships among the spe-

cies. This similarity based identiﬁcation approach has

two main limitations: Firstly, the extensive number

of comparison to be performed makes the task com-

putationally expensive, which translate in high costs

and long processing time. Secondly, NGS technolo-

gies can generate sequences of limited length (typi-

cally 250 bp). While this may not represent an issue

with distantly related sequences, it can make the dis-

crimination of closely related sequences (high simila-

rity) difﬁcult. This is the case for auxiliary metabo-

lic genes (AMGs) in cyanobacteria and their associa-

ted viruses (also referred to as cyanophages). AMGs

are encoding key metabolic functions such as photo-

synthesis, carbon metabolism, etc. Upon infection,

viruses take over the control of the bacterial cell and

keep it alive expressing the AMGs that they are car-

rying, for which an high similarity with the bacterial

AMGs is required for a successful infection.

Being able to determine whether an AMG short

fragment (fragments that does not cover the entire

length of the AMG) belongs to a bacterial host or

its associated viruses is an extremely valuable re-

source in microbial ecology. In the past, alterna-

tive methods exploiting speciﬁc genetic content have

been used to characterize the origin of a gene (Sand-

berg et al., 2001). Recently, Tzahor et al. (Tzahor

et al., 2009) used a multi-class Support Vector Ma-

chine (SVM) to rapidly classify core-photosystem-

II gene and transcripts fragments coming from ma-

rine samples based on their oligonucleotide frequen-

cies. Here, we applied a SVM-based approach to

classify short fragments of two diffent AMGs, na-

Acerbi, E., Chenard, C., Schuster, S. and Lauro, F.

Supervised Classiﬁcation of Metatranscriptomic Reads Reveals the Existence of Light-dark Oscillations During Infection of Phytoplankton by Viruses.

DOI: 10.5220/0006763200690077

In Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2018) - Volume 3: BIOINFORMATICS, pages 69-77

ISBN: 978-989-758-280-6

mely psbA and phoH, retrieved from a previously pu-

blished metatranscriptome proﬁling of multiple natu-

rally occurring oceanic bacterial populations sampled

in situ over 3 days (Ottesen et al., 2014). The short

psbA and phoH fragments (with length ranging bet-

ween 150 and 300bp) were classiﬁed based on their

origin (viral vs host) using GC content, mono-, di-,

tri-, tetra-nucleotide frequencies as discriminating fe-

atures.

The outputs were further processed using Empiri-

cal Mode Decomposition (EMD) to identify the un-

derlying frequency dynamics of transcription during

viral infection in environmental cyanobacterial popu-

lations. EMD can deconvolve natural time-series data

into composing frequencies also known as intrinsic

mode functions (IMFs). EMD has been used success-

fully to analyse time series data in other scientiﬁc

domains including the detection of periodically ex-

pressed genes in microarray data (Chen et al., 2014),

analysis of seismic data (Han and van der Baan,

2013), electrocardiograms (Chang, 2010) and ano-

maly in sea-surface height (Li et al., 2012). Our data

demonstrated an existence of light-dark oscillation in

the expression pattern of AMGs in cyanophages sug-

gesting that viral infection patterns might provide the

dynamic coupling between light-dark oscillations of

autotropic and heterotrophic microbial populations in

the North Paciﬁc Subtropical Gyre.

2 BIOLOGICAL BACKGROUND

Marine picocyanobacteria from the genera Synecho-

coccus and Prochlorococcus are major primary pro-

ducers in the ocean at low- and mid-latitude and con-

tribute signiﬁcantly to the global carbon cycle (Hess,

2004; Partensky et al., 1999). Given that light repre-

sents the main source of energy for cyanobacteria, it

also determines the tempo of carbon ﬁxation, metabo-

lic and physiological activity such as the timing of cell

division, amino acid uptake, nitrogen ﬁxation, photo-

synthesis and respiration (Ni and Zeng, 2016; Golden

et al., 1997). For example, nearly half of all Prochlo-

rococcus population in the North Paciﬁc Subtropical

Gyre demonstrated a transcriptional diel cycle (Otte-

sen et al., 2014). Viral infection in the oceans might

also be synchronized with the light cycle. Some stu-

dies suggest a diel cycle in the number of infective vi-

ruses which could mainly be attributed to UV damage

during the peak of sunlight (Wilhelm et al., 1998;

Suttle and Chen, 1992). However, the timing of vi-

ral replication can also be inﬂuenced by the presence

and absence of light. For example, a temporal study

at a station in the Indian Ocean revealed a strong in-

crease in viral abundance in the middle of the night

(Clokie et al., 2006a). Logically, the synchronicity

between light and replication is especially important

for viruses infecting cyanobacteria, the cyanophages.

Indeed, it was shown that light inﬂuences viral ﬁtness

as light is required for viral infection as it might in-

ﬂuence the degradation of host genomic DNA, viral

transcription or production of new progeny (Thomp-

son et al., 2011).

In addition, cyanophages harbour AMGs which

are usually homologs of host genes involved in pho-

tosynthesis and carbon metabolism pathways. These

AMGs likely play a role in viral infection (Breitbart

et al., 2007). For example, the psbA gene which en-

code D1 protein involved in the photosystem II (PSII)

reaction centre is prevalent in cyanophage genomes.

PSII is particularly sensitive to photodamage cau-

sing a high turnover rate for the D1 protein. It was

shown that viral psbA can highly be transcribed du-

ring infection which suggests that viral-psbA expres-

sion might maintain the photosynthetic activity of the

infected host cells and therefore provide energy for

cyanophage replication (Clokie et al., 2006b; Lin-

dell et al., 2007). This is supported by the notion

that during the lytic cycle of infection of Prochlo-

rococcus MED4 by cyanophage P-SSP7 most of the

psbA transcript in infected cells were from viral origin

after 6hrs (Lindell et al., 2007). Similarly, the phoH

gene which is involved in phosphate metabolism was

previously found in both heterotroph and autotroph

phages (Goldsmith et al., 2011).

Based on sequence phylogeny, viral psbA can

generally be distinguished from Synechococcus and

Prochlorococcus (Sullivan et al., 2006; Chenard and

Suttle, 2008), while phoH genes cluster together ba-

sed on their origin (i.e. autotroph bacteria, hetero-

troph bacteria, cyanophage, heterotroph phage and

eukaryotic viruses (Goldsmith et al., 2011; Goldsmith

et al., 2015)).

3 METHODS

3.1 Support Vector Machines

Support Vector Machines (SVMs) are a supervised le-

arning algorithm in machine learning, ﬁrst introdu-

ced by Vapnik (Stitson et al., 1996) and based on the

principle of structural risk minimization. Given la-

beled data (in our case, DNA fragment whose origin

is known), a SVM can be trained to individuate the

optimal hyperplane separating (classifying) new ex-

amples.

BIOINFORMATICS 2018 - 9th International Conference on Bioinformatics Models, Methods and Algorithms

Figure 1: (A) Training and validation of the SVM-based classiﬁer (B) Extraction of psbA fragments from the metatranscrip-

tome. (C) Classiﬁcation of new sequences using the SVM-based classiﬁer. The same procedure was performed for the phoH

gene.

3.2 Training Set Generation

Labelled DNA sequences for psbA and phoH genes

are scarce in the literature. In order to generate the

psbA training set for the SVM-based multiclassiﬁ-

cator, we collected 203 psbA DNA sequences from

NCBI (including those used by Tzahor S. et al. (Tza-

hor et al., 2009)). Sequences in the dataset were la-

belled as belonging to one of the following ﬁve cate-

gories: Synechococcus bacteria (77), Synechococcus

virus (42), Prochlorococcus bacteria high-light (38),

Prochlorococcus bacteria low-light (26) and Prochlo-

rococcus virus (20). Sequences were long, having dif-

ferent lengths ranging from 720bp to 1083bp, with a

median of 1080bp. Analogously, phoH training set

was created by collecting 84 DNA sequences from

NCBI. Sequences in the dataset were labelled as be-

longing to one of the following ﬁve categories: Au-

totrophic host (i.e. cyanobacteria) (17), cyanophage

(29), heterotrophic host (13), heterotrophic phage

Supervised Classiﬁcation of Metatranscriptomic Reads Reveals the Existence of Light-dark Oscillations During Infection of Phytoplankton

by Viruses

(14), phytoeukaryotic virus (10). PhoH sequences

were also long, having different lengths ranging from

603bp to 1770bp, with a median of 766.5 bp. In their

work, Tzahor et al. trained the SVM classiﬁer using

full length psbA sequences and subsequently used it

for classiﬁcation of other full length psbA sequences

(Tzahor et al. also tested their classiﬁer on short psbA

fragments, but only limited to binary classiﬁcation).

If the aim is to use the classiﬁer on shorter psbA frag-

ments, the approach of Tzahor et al. may not always

give the best results. In fact, classiﬁcation accuracy

may be low if the training is performed on clean sam-

ples (psbA full genes, rather than fragments) and the

real-world classiﬁcation is performed on noisy sam-

ples (psbA fragments. i.e. sequences containing a part

of the psbA gene and other base pairs not related to the

psbA gene, as in metagenomic/metatranscriptiomic

data). For this reason, psbA and phoH training sets

were generated by randomly extracting sequences of

length 300bp from the original sets. In addition, to

deal with the original dataset being slightly unbalan-

ced (not equal number of instances for each class) a

combination of undersampling and oversampling was

applied to both psbA and phoH sequences in order to

obtain balanced training sets: 100 sequences for each

of the ﬁve classes (total of 500 sequences) for psbA

training set and 50 sequences for each of the ﬁve clas-

ses (total of 250 sequences) for phoH training set.

3.3 Feature Generation

GC, mono-, di-, tri-, tetra-nucleotide frequencies were

calculated for each sequence of the training sets (for

a total of 341 features per sequence) and used as in-

put feature vector for the SVM classiﬁers. When in-

cluding penta- and/or hexa-nucleotide frequencies, no

signiﬁcant improvements in prediction accuracy were

observed (data not shown).

3.4 The SVM-based Model

Although through the document it will be referred to

as SVM-based classiﬁer, the model is composed by

5 different SVM one-against-all classiﬁers with linear

kernel, each of them trained to separate one of the

ﬁve taxonomical categories (positive class) from all

the remaining ones (collapsed as negative class). As-

sociated with the prediction on whether a given se-

quence belongs or not to the positive class for which

it was trained, each classiﬁer returns a numeric va-

lue representing the probability estimate of the pre-

diction. For each new sequence, the classiﬁer retur-

ning the prediction associated with the highest proba-

bility estimate is the one determining the category of

the sequence.

3.5 Model Training, Parameter

Optimization and Validation

The LIBSVM toolbox (v3.22) (Chang and Lin, 2011)

for MATLAB was used for the experiments. The

SVM-based classiﬁer was trained using the 500 and

250 sequences of the psbA and phoH training sets re-

spectively. In order to assess the ability of the mo-

del to correctly assign each sequence to the respective

taxonomical category, a cross validation of type leave-

one-out was performed. This validating procedure ite-

rates over all of the N sequences of a training set, each

time using N-1 sequences to train the classiﬁer and 1

sequence as a test. Performances were assessed in

terms of precision, recall and f-measure, which for

binary classiﬁcation are deﬁned as follows:

precision =

true positive

true positive + false positive

(1)

recall =

true positive

true positive + false negative

(2)

= 2 ·

precision · recall

precision + recall

(3)

In statistics the recall is referred to as sensitivity

and the precision as positive predicted value. For

psbA, cross validation mean precision resulted to be

equal to 0.95, mean recall 0.92 and mean F

0.93. In

this phase, the optimal value for the parameter Cost

(cost of misclassiﬁcation) was empirically established

as being equal to 8, while the optimal probability es-

timate cut-off resulted to be equal to 0.3 (resulting in

3.8% of the sequences being marked as unassigned).

A direct comparison of the SVM-based classiﬁer with

the previous work of Tzahor et al. is not feasible as

in their study, Tzahor et al. performed the multi-class

classiﬁcation using full length psbA sequences, while

tests with short fragments (100, 200 and 300 bp) were

limited to binary classiﬁcation only. For phoH, the

mean precision resulted to be equal to 0.98, mean

recall 0.98 and mean F

0.98. No optimal probabi-

lity estimate cut-off was applied for phoH sequences

while the parameter cost was set to 8.

3.6 Classiﬁcation of Sequences

Extracted from the

Metatranscriptiome

In order to extract psbA and phoH fragments from the

metatranscriptome data of Ottesen et al., we ﬁrstly

BIOINFORMATICS 2018 - 9th International Conference on Bioinformatics Models, Methods and Algorithms

Figure 2: Example of Empirical Mode Decomposition of a

simulated signal. Each IMF represents a different oscilla-

tory component/frequency of the original signal and can be

used to identify the different underlying processes that are

responsible for generating the original signal. Oscillatory

components are extracted in decreasing order of frequency.

built in-house PsbA and PhoH protein databases. The

PsbA database was composed by 11,962 protein se-

quences retrieved from UniProtKB, while the PhoH

database was composed by 100,501 protein sequen-

ces downloaded from NCBI. Subsequently, metat-

ranscriptome data was screened for homology against

these databases using Rapsearch2 (Zhao et al., 2011)

(which allows protein similarity search by transla-

ting DNA/RNA queries into protein), retaining mat-

ches with alignment-length greater than 150 bp and

e-value smaller than 0.000001 Figure 1B). For each

time point, psbA and phoH fragments were then clas-

siﬁed using the SVM-based classiﬁer. Counts were

normalized using the cumulative sum scaling (CSS)

method (Paulson et al., 2013).

3.7 Empirical Mode Decomposition

Empirical mode decomposition (Huang et al., 1998),

is a data-driven method for analysing of non linear

and non stationary time frequency data, such as natu-

ral signals.

The EMD process iteratively decomposes the ori-

ginal signal into a ﬁnite number of intrinsic mode

functions (IMF), that is, functions with a single

mode/frequency 2. Each IMF represents a different

component/frequency of the original signal and can

be used to identify the different underlying processes

that are responsible for generating the original signal

(Figure 2). The EMD procedure works as follows:

ﬁrst, all the time series local minima and maxima are

identiﬁed. Interpolation then is applied to connect the

local minima among them, generating the lower en-

velope of the data. The same is performed on the

local maxima, generating the upper envelope. Sub-

sequently, the mean value of the envelope is calcu-

lated and subtracted from the original signal. This

procedure takes the name of sifting and produces an

IMF. To be considered valid, an IMF needs to satisfy

the following conditions: (i) the number of extrema

and of zero-crossings must differ by no more than

one (ii) both lower and upper envelopes must have

a mean equal to zero. The sifting procedure is repe-

ated until no further IMF can be extracted from the

original signal or the speciﬁed terminating criterion

is met. IMFs are extracted in decreasing frequencies

levels. EMD is a relatively recent approach that still

holds some drawbacks. For instance, EMD may be

prone to suffer from sampling errors as those could

lead to incorrect placement of extema and therefore

lead to inaccurate IMFs. Similarly, the usage of diffe-

rent interpolation methods can also lead to slight dif-

ferences in the algorithm results, particularly in terms

of ﬂexibly and smoothness of the IMFs (Bagherzadeh

and Sabzehparvar, 2015). IMFs are also challenging

to interpret in absence of knowledge about the under-

lying system, and IMFs of different orders on diffe-

rent time series may not be capturing the same phe-

nomena. In addition, analogous information may end

up being contained in multiple IMFs and there can

sometimes be lower-order IMFs that are just spurious

ﬂuctuations (which have the purpose of correct errors

on other IMFs, so that they can sum up to the original

signal) (Chambers, 2015). In this work EMD was run

using the EMD R package (Kim and Oh, 2009), allo-

wing a maximum number of sift iterations equal to 50,

with a periodic type of boundary and constructing the

envelops by interpolation. No meaningful changes in

the results were observed when different boundaries

and interpolation methods were tested.

4 RESULTS AND DISCUSSION

A total of 20,235 psbA and 5,008 phoH short se-

quences (length ranging from 150 to 300b) extrac-

ted from the high-resolution metatranscriptome time

course (Ottesen et al., 2014) were classiﬁed using the

SVM-based classiﬁer. As expected, retrieved psbA

transcripts were most abundant in samples collected

around mid-day and less abundant in samples col-

lected around mid-night. After SVMs-based classiﬁ-

cation, most of the psbA sequences were identiﬁed as

Prochlorococcus, including high light and low light

ecotypes (54% and 12% on average respectively and

subsequently merged in one category for downstream

analysis) while only a minor fraction was classiﬁed

as Synechococcus psbA (1% on average). The re-

Supervised Classiﬁcation of Metatranscriptomic Reads Reveals the Existence of Light-dark Oscillations During Infection of Phytoplankton

by Viruses

maining sequences were classiﬁed as Synechococcus

virus (27% on average) and Prochlorococcus virus

(6% on average). Given that Synechococcus repre-

sents only a small fraction of the bacterioplankton in

the time series of the North Paciﬁc Subtropical Gyre

(Ottesen et al., 2014), we hypothesized that the viral-

psbA clustering into this group were not from Syne-

chococcus viruses but instead from viruses that in-

fect both Prochlorococcus and Synechococcus. In-

deed, some cyanophages have shown to have a broad

host range and infect strain from both genera (Sul-

livan et al., 2003). For example, the psbA found in

the cyanophage P-SSM1, which was isolated using

the Prochlorococcus strain MIT 9313 clustered with

the Synechococcus virus group (Sullivan et al., 2006).

Consequently, the two virus groups were redeﬁned as

Virus Group I (VG1; Synechococcus virus) and Virus

Group II (VG2; Prochlorococcus virus). Prochloro-

coccus and Synechococcus are referred to in the ﬁgu-

res respectively as Pro Bac and Syn Bac.

Figure 3: Empirical Mode Decomposition of Virus Group

II (VG2) proﬁle (Prochlorococcus virus).

PhoH sequences retrieved from the metatranscrip-

tome did not show a change in abundance based on

the time of the day (data not shown). The lack of

variation for the phoH gene is probably due to the

fact that light has a limited inﬂuence on the phosp-

hate uptake. After SVMs-based classiﬁcation, most

phoH transcripts were assigned to heterotrophic pha-

ges (HP, 31% on average) and autotrophic bacteria

host (A Host, 31% on average), followed by hete-

rotrophic bacterial host (H Host, 21% on average)

and autotrophic phages (AP, 14% on average). Few

phoH sequences were classiﬁed as eukaryotic phy-

toplankton viruses (2% on average) in agreement with

the notion that most of the primary production in the

North Paciﬁc Subtropical was supported by cyano-

bacteria. As a consequence, the phoH group inclu-

ding viruses infecting eukaryotic phytoplankton was

removed from downstream analysis.

After classifying the sequences into their re-

spective subgroups, we performed an empirical

mode decomposition (EMD) (see Methods) on the

transcriptional proﬁles of the subgroups, from which

different diel patterns emerge within psbA and phoH

transcripts. EMD decomposed each proﬁle of classi-

ﬁed AMGs into simpler harmonic waveforms or in-

trinsic mode functions (IMFs). While the EMD met-

hod lacks a formal procedure to associate a correspon-

ding meaning/phenomena to each IMFs, complicating

their interpretation in situations of total absence of

knowledge about the system, in this case the under-

lying driving forces were not completely unknown.

We identiﬁed a plausible biological explanation to

each IMF to support our ﬁndings (in Figure 3 the

complete EMD results for VG2 is shown as example).

We hypothesize that the 1

order of IMFs captured

most of the stochastic noise in each time series. The

IMFs of the 3

order could clearly identify a diel pat-

tern in the expression of all AMGs (Figure 4), while

the 2

order IMF is able to detect what we believe

are differences in the population heterogeneity among

different groups (Figure 5).

4.1 Diel Pattern Difference between

Synechoccocus and Prochloroccocus

The IMFs of the 3

order identiﬁed a diel pat-

tern in the expression of all AMGs (Figure 4).

The peak of expression of psbA from Prochloro-

coccus consistently occurred in the morning while

a less-pronounced cycling of Synechococcus psbA

transcripts occurred later in the afternoon (Figure

4A). This is consistent with a published dataset obtai-

ned using real-time PCR (Mella-Flores et al., 2012).

The remarkable difference between the psbA expres-

sion levels of Synechococcus and Prochlorococcus is

probably caused by the fact that Prochlorococcus is

not able to withstand solar irradiance as high as Sy-

nechococcus. Consequently, as Prochlorococcus D1

protein is more sensitive to light, psbA needs to be

highly expressed during the time of the day with the

more irradiance (Mella-Flores et al., 2012).

Both VG1 and VG2 peaked during the daytime but

the peak for VG2 was precisely matched to the peak

in Prochlorococcus psbA. On the other hand, maxi-

mal expression of VG1 was more variable, possibly

as a result of the variable proportions of the infected

strains of Prochlorococcus and Synechococcus hosts.

While transcripts from psbA-carrying virus had a very

tight coupling to the time of the day, much less diel

BIOINFORMATICS 2018 - 9th International Conference on Bioinformatics Models, Methods and Algorithms

Figure 4: Intrinsic mode functions of 3

order of psbA

and phoH groups. X-Axis shows the time of the day, with

the colored bar indicating the light-dark cycle. (A) psbA

groups. (B) phoH groups.

effect was observed in phoH transcripts (Figure 4B) .

VG1 and VG2 peaked at similar times as their hosts.

On the contrary the HP transcripts consistently pea-

ked later than the hosts homologs and, in general, the

cycling of the phoH transcripts was less strong in tune

with the daily light cycle.

4.2 The Population Structure Shows

Different Viral Groups

Underlying the primary diel harmonics of the 3

or-

der IMF, the 2

order was able to detect differences in

the population heterogeneity among different groups.

This was particularly evident during the ﬁnal 30 hours

of the time series, when a signiﬁcant increase in tem-

perature and salinity had been previously associated

to a change in the transcriptional proﬁle of SAR324

(Ottesen et al., 2014). Here, the shift in environmental

conditions is detected as increased frequency in IMF2

for both Prochlorococcus and Synechococcus (Figure

5A), suggesting that the population heterogeneity of

the 2 groups has expanded to include more ecotypes.

Differences in population heterogeneity are also

observed as differences in IMF2 frequencies in VG1,

VG2, AP and HP (Figure 5B). Interestingly the hig-

hest population heterogeneity (i.e. the highest IMF2

frequency) was observed in VG2, which also had the

strongest correlation between AMG peak and the time

Figure 5: Intrinsic mode functions of 2

order of psbA

and phoH groups. X-Axis shows the time of the day, with

the colored bar indicating the light-dark cycle. (A) popula-

tion heterogeneity for Synechoccocus and Prochloroccocus

groups. (B) Differences in population heterogeneity for all

viral groups (both psbA and phoH).

of the day. Taken together, these observations suggest

that VG2 includes many viral quasispecies with very

narrow host ranges and a very precise control replica-

tion timing as a function of the time of the day and

the low stability of the Prochlorococcus-type D1 pro-

tein. VG1 was less diverse but also appeared to have

broader host range and the expression peak shifted in

response to increased host diversity towards the end

of the time-series.

The timing of the peak expression of the viral

psbA also provided insights into the timing of cell ly-

sis. Based on studies conducted in the laboratory, the

viral psbA is transcribed at the end of the lytic cycle

(Clokie et al., 2006b) and if replication of viral DNA

is delayed until light, viral psbA expression should be

minimal in the morning and slowly increase throug-

hout the day.

Therefore, notwhitstanding the differences bet-

ween VG1 and VG2, our ﬁndings are consistent with

the hypothesis that most of the cell lysis and viral

shedding occurs in the end of the day. Using SeaF-

low cytometry which gives real-time continuous ob-

servations of cells abundance, Ribalet et al. (Ribalet

et al., 2015) showed that Prochlorococcus cell num-

bers were higher in the day and sharply decrease at

night, suggesting predation (viral and grazing) as the

cause of the oscillation. Our data further support their

Supervised Classiﬁcation of Metatranscriptomic Reads Reveals the Existence of Light-dark Oscillations During Infection of Phytoplankton

by Viruses

hypothesis that viral infection might be a player in the

synchronized oscillations of Prochlorococcus abun-

dance from surface water of a Paciﬁc Gyre (Ribalet

et al., 2015).

5 CONCLUSIONS

Using a combination of SVMs based classiﬁcation of

short DNA sequences and EMD, we identiﬁed the ex-

istence of light-dark oscillations in the viral infection

of cyanobacterial populations in the North Paciﬁc

Subtropical Gyre which can affect Prochlorococcus

cell numbers and activity from surface water. Ottesen

et al. (Ottesen et al., 2014) have observed diel cycling

in the expression of genes from heterotrophic bacteri-

oplankton and suggested that factors other than light

might be linking the diel behaviours of autotrophs and

heterotrophs.

Here we expand on that study to include cycling

of expression of the viral assemblages and, based on

the expression patterns of psbA and phoH, are able to

identify major viral groups that differed in their re-

sponse to light-dark cycles, population structure and

their host range. Undoubtedly many other groups of

cyanophages exist which do not carry those AMGs

and which might or might not display circadian cy-

cling, but a previous study suggests that as much as

88% of cyanophages do indeed carry a copy of the

psbA gene (Sullivan et al., 2006). Because cyanop-

hage ecotypes carrying similar AMGs are likely to

occupy very similar ecological niches, it is conceiva-

ble that the existence of different replication patterns

might allow coexistence of multiple ecotypes by sup-

pressing competitive exclusion. We posit that cyanop-

hages of the VG1-type overcome competititve exclu-

sion with a broad host range and being able to initi-

ate the replication cycle at different times of the day.

On the other hand, the tight coupling between repli-

cation and lower stability of the Prochlorococcus-like

D1 protein requires VG2 to have a much larger gene-

tic diversity and a higher degree of specialization in

host targets.

In addition to the ecological insights that this ap-

proach has provided in understanding cyanophage po-

pulations and their hosts, similar classiﬁcation and

decomposition analyses may be used to identify fun-

damental frequencies of natural processes from other

time series data that otherwise would be overlooked.

ACKNOWLEDGEMENTS

The authors would like to acknowledge ﬁnancial sup-

port from Singapores Ministry of Education Acade-

mic Research Fund Tier 3 under the research grant

MOE2013-T3-1-013, Singapores National Research

Foundation under its Marine Science Research and

Development Programme (Award No. MSRDP-P13)

and the Singapore Centre on Environmental Life

Sciences Engineering (SCELSE), whose research is

supported by the National Research Foundation Sin-

gapore, Ministry of Education, Nanyang Technolo-

gical University and National University of Singa-

pore, under its Research Centre of Excellence Pro-

gram. The authors would like to thank Fabio Stella,

Rohan Williams and James Houghton for their valua-

ble feedbacks.

REFERENCES

Bagherzadeh, S. A. and Sabzehparvar, M. (2015). A local

and online sifting process for the empirical mode de-

composition and its application in aircraft damage de-

tection. Mechanical Systems and Signal Processing,

54:68–83.

Breitbart, M., Thompson, L. R., Suttle, C. A., and Sulli-

van, M. (2007). Exploring the vast diversity of marine

viruses. Oceanography, 20(SPL. ISS. 2):135–139.

Chambers, D. P. (2015). Evaluation of empirical mode de-

composition for quantifying multi-decadal variations

and acceleration in sea level records. Nonlinear Pro-

cesses in Geophysics, 22(2):157–166.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library

for support vector machines. ACM Transactions on

Intelligent Systems and Technology, 2:27:1–27:27.

Chang, K.-M. (2010). Ensemble empirical mode

decomposition for high frequency ecg noise re-

duction. Biomedizinische Technik/Biomedical Engi-

neering, 55(4):193–201.

Chen, C.-R., Shu, W.-Y., Chang, C.-W., and Hsu, I. C.

(2014). Identiﬁcation of under-detected periodicity in

time-series microarray data by using empirical mode

decomposition. PloS one, 9(11):e111719.

Chenard, C. and Suttle, C. A. (2008). Phylogenetic diversity

of sequences of cyanophage photosynthetic gene psba

in marine and freshwaters. Applied and environmental

microbiology, 74(17):5317–5324.

Clokie, M. R., Millard, A. D., Mehta, J. Y., and Mann, N. H.

(2006a). Virus isolation studies suggest short-term va-

riations in abundance in natural cyanophage populati-

ons of the indian ocean. Journal of the Marine Biolo-

gical Association of the United Kingdom, 86(03):499–

505.

Clokie, M. R., Shan, J., Bailey, S., Jia, Y., Krisch, H. M.,

West, S., and Mann, N. H. (2006b). Transcription

of a photosynthetict4-type phage during infection of

a marine cyanobacterium. Environmental Microbio-

logy, 8(5):827–835.

BIOINFORMATICS 2018 - 9th International Conference on Bioinformatics Models, Methods and Algorithms

Golden, S. S., Ishiura, M., Johnson, C. H., and Kondo, T.

(1997). Cyanobacterial circadian rhythms. Annual

review of plant biology, 48(1):327–354.

Goldsmith, D. B., Crosti, G., Dwivedi, B., McDaniel, L. D.,

Varsani, A., Suttle, C. A., Weinbauer, M. G., Sandaa,

R.-A., and Breitbart, M. (2011). Development of phoh

as a novel signature gene for assessing marine phage

diversity. Applied and environmental microbiology,

77(21):7730–7739.

Goldsmith, D. B., Parsons, R. J., Beyene, D., Salamon, P.,

and Breitbart, M. (2015). Deep sequencing of the viral

phoh gene reveals temporal variation, depth-speciﬁc

composition, and persistent dominance of the same vi-

ral phoh genes in the sargasso sea. PeerJ, 3:e997.

Han, J. and van der Baan, M. (2013). Empirical mode de-

composition for seismic time-frequency analysis. Ge-

ophysics, 78(2):O9–O19.

Hess, W. R. (2004). Genome analysis of marine photosynt-

hetic microbes and their global role. Current opinion

in biotechnology, 15(3):191–198.

Huang, N. E., Shen, Z., Long, S. R., Wu, M. C., Shih, H. H.,

Zheng, Q., Yen, N.-C., Tung, C. C., and Liu, H. H.

(1998). The empirical mode decomposition and the

hilbert spectrum for nonlinear and non-stationary time

series analysis. In Proceedings of the Royal Society of

London A: Mathematical, Physical and Engineering

Sciences, volume 454, pages 903–995. The Royal So-

ciety.

Kim, D. and Oh, H.-S. (2009). Emd: a package for empiri-

cal mode decomposition and hilbert spectrum. The R

Journal, 1(1):40–46.

Li, F., Jo, Y.-H., Liu, W. T., and Yan, X.-H. (2012). A dipole

pattern of the sea surface height anomaly in the north

atlantic: 1990s–2000s. Geophysical Research Letters,

39(15).

Lindell, D., Jaffe, J. D., Coleman, M. L., Futschik, M. E.,

Axmann, I. M., Rector, T., Kettler, G., Sullivan, M. B.,

Steen, R., Hess, W. R., et al. (2007). Genome-wide

expression dynamics of a marine virus and host reveal

features of co-evolution. Nature, 449(7158):83–86.

Mella-Flores, D., Six, C., Ratin, M., Partensky, F., Boutte,

C., Le Corguill

e, G., Marie, D., Blot, N., Gourvil, P.,

Kolowrat, C., et al. (2012). Prochlorococcus and sy-

nechococcus have evolved different adaptive mecha-

nisms to cope with light and uv stress.

Ni, T. and Zeng, Q. (2016). Diel infection of cyanobacteria

by cyanophages. Frontiers in Marine Science, 2:123.

Ottesen, E. A., Young, C. R., Gifford, S. M., Eppley, J. M.,

Marin, R., Schuster, S. C., Scholin, C. A., and De-

Long, E. F. (2014). Multispecies diel transcriptional

oscillations in open ocean heterotrophic bacterial as-

semblages. Science, 345(6193):207–212.

Partensky, F., Hess, W. R., and Vaulot, D. (1999). Prochlo-

rococcus, a marine photosynthetic prokaryote of glo-

bal signiﬁcance. Microbiology and molecular biology

reviews, 63(1):106–127.

Paulson, J. N., Stine, O. C., Bravo, H. C., and Pop, M.

(2013). Differential abundance analysis for microbial

marker-gene surveys. Nature methods, 10(12):1200–

1202.

Ribalet, F., Swalwell, J., Clayton, S., Jim

enez, V., Sudek,

S., Lin, Y., Johnson, Z. I., Worden, A. Z., and Arm-

brust, E. V. (2015). Light-driven synchrony of pro-

chlorococcus growth and mortality in the subtropical

paciﬁc gyre. Proceedings of the National Academy of

Sciences, 112(26):8008–8012.

Sandberg, R., Winberg, G., Br

anden, C.-I., Kaske,

A., Ernberg, I., and C

oster, J. (2001). Captu-

ring whole-genome characteristics in short sequences

using a naive bayesian classiﬁer. Genome research,

11(8):1404–1409.

Stitson, M., Weston, J., Gammerman, A., Vovk, V., and

Vapnik, V. (1996). Theory of support vector machi-

nes. Technical Report, CSD-TR-96–17, Computatio-

nal Intelligence Group, University of London.

Sullivan, M. B., Lindell, D., Lee, J. A., Thompson, L. R.,

Bielawski, J. P., and Chisholm, S. W. (2006). Preva-

lence and evolution of core photosystem ii genes in

marine cyanobacterial viruses and their hosts. PLoS

Biol, 4(8):e234.

Sullivan, M. B., Waterbury, J. B., and Chisholm, S. W.

(2003). Cyanophages infecting the oceanic cyano-

bacterium prochlorococcus. Nature, 424(6952):1047–

1051.

Suttle, C. A. and Chen, F. (1992). Mechanisms and rates

of decay of marine viruses in seawater. Applied and

Environmental Microbiology, 58(11):3721–3729.

Thompson, L. R., Zeng, Q., Kelly, L., Huang, K. H., Sin-

ger, A. U., Stubbe, J., and Chisholm, S. W. (2011).

Phage auxiliary metabolic genes and the redirection of

cyanobacterial host carbon metabolism. Proceedings

of the National Academy of Sciences, 108(39):E757–

E764.

Tzahor, S., Man-Aharonovich, D., Kirkup, B. C., Yogev, T.,

Berman-Frank, I., Polz, M. F., B

a, O., and Mandel-

Gutfreund, Y. (2009). A supervised learning appro-

ach for taxonomic classiﬁcation of core-photosystem-

ii genes and transcripts in the marine environment.

BMC genomics, 10(1):229.

Wilhelm, S. W., Weinbauer, M. G., Suttle, C. A., and Jef-

frey, W. H. (1998). The role of sunlight in the removal

and repair of viruses in the sea. Limnology and Ocea-

nography, 43(4):586–592.

Zhao, Y., Tang, H., and Ye, Y. (2011). Rapsearch2: a fast

and memory-efﬁcient protein similarity search tool

for next-generation sequencing data. Bioinformatics,

28(1):125–126.

Supervised Classiﬁcation of Metatranscriptomic Reads Reveals the Existence of Light-dark Oscillations During Infection of Phytoplankton

by Viruses