Ab initio Splice Site Prediction with

Simple Domain Adaptation Classiﬁers

Nic Herndon and Doina Caragea

Computing and Information Sciences, Kansas State University, Manhattan, KS 66506, U.S.A.

Keywords:

Splice Site Prediction, Domain Adaptation, Imbalanced Data, Logistic Regression, Na

ıve Bayes.

Abstract:

The next generation sequencing technologies (NGS) have made it affordable to sequence any organism, open-

ing the door to assembling new genomes and annotating them, even for non-model organisms. One option

for annotating a genome is to assemble RNA-Seq reads into a transcriptome and aligning the transcriptome to

the genome assembly to identify the protein-encoding genes. However, there are a couple of problems with

this approach. RNA-Seq is error prone and therefore the gene models generated with this technique need to

be validated. In addition, this method can only capture the genes expressed at the time of sequencing. Ma-

chine learning can help address both of these problems by generating ab initio gene models that can provide

supporting evidence to the models generated with RNA-Seq, as well as predict additional genes that were not

expressed during sequencing. However, machine learning algorithms need large amounts of labeled data to

learn accurate classiﬁers, and newly sequenced, non-model organisms have insufﬁcient labeled data. This can

be addressed by leveraging the abundant labeled data from a related model-organism (the source domain) and

use it in conjunction with the little labeled data from the organism of interest (the target domain) to train a

classiﬁer in a domain adaptation setting. The method we propose uses this approach and generates accurate

classiﬁcation on the task of splice site prediction – a difﬁcult and essential step in gene prediction. It is sim-

ple – it combines source and target labeled data, with different weights, into one dataset, and then trains a

supervised classiﬁer on the combined dataset. Despite its simplicity it is surprisingly accurate, with highest

areas under the precision-recall curve between 53.33% and 83.57%. Out of the domain adaptation classiﬁers

evaluated (SVM, na

ıve Bayes, and logistic regression) this method produced the best results in 12 out of the

16 cases studied.

1 INTRODUCTION

Recently a number of domain adaptation algorithms

have been proposed to address the lack of labeled data

in the domain of interest, the target domain, by lever-

aging plentiful labeled data from a related domain, the

source domain, and in some cases the large volume of

unlabeled data available from the target domain. One

application that meets this criteria – lacking labeled

data and with abundant labeled data in a similar do-

main – that is tackled by these algorithms is splice

site prediction. This was enabled by the next genera-

tion sequencing technologies, which allow faster and

cheaper sequencing of DNA and RNA than the previ-

ously used Sanger technology, leading to advances in

the ﬁeld of genomics.

With NGS, short DNA read fragments are used to

generate genome assemblies. Similarly, RNA frag-

ments are assembled into transcriptomes. The tran-

scriptome is then used as evidence when annotating a

genome, by mapping it along that genome. This helps

determine the location and structure of the protein-

encoding genes. One of the disadvantages of this

method is that RNA-Seq reads are generated only

from the genes expressed at the time of sequencing in

the tissue analyzed, leaving out of the transcriptome

some of the protein-encoding genes.

In addition, assembling the transcriptome is not

error proof. NGS technologies speed up the sequenc-

ing of DNA and RNA molecules, but do so at the

expense of read length and accuracy. They gener-

ate shorter reads than previous sequencing technolo-

gies (e.g., Sanger) with much higher error rates. The

common practice to address these issues is to trim the

low quality ends of the reads, remove reads with low

scores, and require higher depth of coverage. The re-

maining reads are then assembled into a genome (for

DNA reads) or transcriptome (for RNA reads). These

assemblies are not 100% accurate. Therefore, anno-

tating a genome with a transcriptome generated from

Herndon, N. and Caragea, D.

Ab initio Splice Site Prediction with Simple Domain Adaptation Classiﬁers.

DOI: 10.5220/0005710502450252

In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 3: BIOINFORMATICS, pages 245-252

ISBN: 978-989-758-170-0

245

RNA-Seq reads should be validated by independent

methods (Steijger et al., 2013).

Domain adaptation algorithms can provide a

means to validate the gene models produced with this

technique, as well as generate gene models for genes

missed by RNA-Seq. Such classiﬁers can accom-

plish this despite the difﬁcult nature of the splice site

prediction problem – a highly imbalanced problem,

where only a small ratio of the GT and AG dimers

within a genome are splice sites (Sonnenburg et al.,

2007). These are the donor and acceptor canonical

splice sites, respectively. Even though this type of

problem is difﬁcult for every type of classiﬁer – su-

pervised or semi-supervised – not just domain adap-

tation, the existing algorithms, discussed in Section 2,

achieved good results.

In this work, we propose an algorithm that sur-

passes these results. It is a simple, yet surprisingly

accurate method, presented in detail in Section 3.1.

With this method we combine data from two organ-

isms into one dataset. Each organism is assigned a

different weight. The resulting dataset is then used

to train a supervised classiﬁer for the organism of in-

terest. To evaluate this algorithm we tested it with

data from C.elegans, the source domain, and four tar-

get domains at increasing evolutionary distance from

this source: C.remanei, P.paciﬁcus, D.melanogaster,

and A.thaliana – data described in Section 3.2. From

the results shown in Section 4, we can infer that this

method is a viable ab initio splice site prediction tech-

nique. It generated highest average areas under the

precision-recall curve (auPRC) for the positive class

between 53.33% for distantly related organisms and

83.57% for closely related ones.

2 RELATED WORK

Most of the research on ab initio splice prediction fo-

cused on supervised learning approaches. Most meth-

ods proposed used either support vector machines,

(Baten et al., 2006; Li et al., 2012; Sonnenburg et al.,

2007; Zhang et al., 2006), Bayesian networks, (Cai

et al., 2000), hidden Markov models, (Baten et al.,

2007), or Bahadur expansion truncated at the second

order, (Arita et al., 2002). However, as these methods

employ supervised classiﬁers, they generally require

lots of labeled data to generate accurate predictions.

Other methods evaluated the use of semi-

supervised classiﬁers for this task. One study in-

vestigated one of the main factors that affects the

performance of an expectation-maximization semi-

supervised algorithm – the highly imbalanced class

distribution (Stanescu and Caragea, 2014b). The au-

thors studied the effects of the level of imbalance

on the accuracy of the classiﬁer, and recommended

different ways to address it: adding only instances

from the minority class at each iteration, balancing

the class ratio through oversampling, and splitting the

data into balanced subsets by undersampling and then

training an ensemble of classiﬁers on these datasets.

In their subsequent study (Stanescu and Caragea,

2014a), they further analyzed ensemble-based semi-

supervised learning approaches, and recommended

using an ensemble of self-training classiﬁers that add

at each iteration only instances from the minority

class. However, considering the highly imbalanced

nature of the problem and the lack of sufﬁcient la-

beled data, the accuracy of these classiﬁers was not

very high, with the highest auPRC of 54.78% for the

best classiﬁer.

For domain adaptation setting, there are several

studies. One proposed an iterative domain adaptation

algorithm derived from na

ıve Bayes, that used source

data, and target labeled and unlabeled data (Herndon

and Caragea, 2014b). Although it performed well

on the task of protein localization, it produced un-

satisfactory results for splice site prediction. Their

ﬁrst updated version (Herndon and Caragea, 2014a)

produced promising results for splice site prediction

with highest auPRC between 43.20% for distant do-

mains and 78.01% for related domains. Later, they

achieved even better results, with best auPRCs be-

tween 50.83% and 82.61%, (Herndon and Caragea,

2015). Another study for splice site prediction pro-

posed a modiﬁed version of the k-means cluster-

ing algorithm that considered the commonalities be-

tween the source and target domains (Giannoulis

et al., 2014). This algorithm was not very accurate

though. Its best area under receiver operating char-

acteristic curve (auROC) was below 70%. One of

the best methods for splice site prediction in this set-

ting, used a support vector machine classiﬁer, with

highest auPRC values between 49.75% and 79.02%,

(Schweikert et al., 2009).

There are also evidence-based methods, such as

TWINSCAN (Korf et al., 2001), CONTRAST (Gross

et al., 2007), TrueSight (Li et al., 2013), and us-

ing single-molecule transcript sequencing (Minoche

et al., 2015). It is however unfair to compare these

with ab initio methods, as they use mRNA evidence

to generate their models, whereas ab initio methods

do not.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

246

3 METHOD AND MATERIALS

3.1 Proposed Method

Let the set of independently generated training in-

stances be represented by X ∈ R

m×n

and their corre-

sponding labels by y ∈ Y

, Y = {0, 1}, where m is

the number of training instances and n is the number

of features.

Given a set of training instances from the source

domain, D

= (X

, y

), where X

∈ R

×n

and y

∈

, and a set of training instances from the tar-

get domain, D

= (X

, y

), where X

∈ R

×n

and

∈ Y

, create an empty dataset D = (X, y), where

X ∈ R

)×n

and y ∈ Y

. For each instance

, y

) ∈ D

multiply its weight by w

, then add

this instance to the new dataset, D. Similarly, for

each instance (x

, y

) ∈ D

multiply its weight by

, then add it to the new dataset, D. Then, train

a supervised classiﬁer on this combined dataset, D.

In our experiments we used WEKA implementations

of the regularized logistic regression (Le Cessie and

Van Houwelingen, 1992), and na

ıve Bayes (John and

Langley, 1995) classiﬁers.

3.2 Data Sets

We evaluated our proposed method using the same

dataset

as in the previous related domain adap-

tation studies, dataset that was ﬁrst introduced in

(Schweikert et al., 2009). It contains DNA sequences

from one source organism, C.elegans, and four target

organisms at increasing evolutionary distance from

source, C.remanei, P.paciﬁcus, D.melanogaster, and

A.thaliana. For the source organism there is one

dataset with 100,000 instances, and for each target or-

ganism there are three folds of 1,000, 2,500, 6,500,

16,000, 25,000, 40,000, and 100,000 instances used

for training, and three folds of 20,000 instances used

for testing. Each instance is a 141 nucleotides long

DNA sequence, with the AG dimer at the sixty-ﬁrst

position, and the label for each instance indicates

whether this dimer is an acceptor splice site (positive)

or not (negative). In each ﬁle about 1% of instances

are positive and the remaining are negative.

3.3 Experimental Setup

From this dataset, to compare our proposed method

with previous methods, we used the three folds of

Downloaded from ftp://ftp.tuebingen.mpg.de/

fml/cwidmer/

2,500, 6,500, 16,000, and 40,000 instances as tar-

get labeled data. Note that the method proposed by

(Herndon and Caragea, 2014a), also uses the three

folds of 100,000 instances from the target organisms

as unlabeled data, which have the potential to increase

the accuracy of the classiﬁer.

We use the same representation for the data as in

(Herndon and Caragea, 2014a; Herndon and Caragea,

2015). Namely, we convert the DNA sequences into

two types of features, nucleotides and trimers, along

with their position within the sequence. In one set

of experiments we represent the data with nucleotide

features only, and in the other we represent it with

both types of features. For example, using nucleotide

and trimer features, a DNA sequence starting with

TTCTAAGCG. . . and class 1 would be represented in

WEKA ARFF format as:

@RELATION rel

@ATTRIBUTE NUCLEOTIDE 1 {A,C,G,T}

@ATTRIBUTE NUCLEOTIDE 141 {A,C,G,T}

@ATTRIBUTE TRIMER 1 {AAA,AAC,. . .,TTT}

@ATTRIBUTE TRIMER 139 {AAA,AAC,. . .,TTT}

@ATTRIBUTE cls {1,-1}

@DATA

T,T,C,T,A,A,G,C,G,. . .,TTC,TCT,CTA,. . .,1

We would like to note that the trimer features are

not independent of each other. Each trimer has nu-

cleotides in common with the overlapping neighbor-

ing trimers – two to ﬁve neighbors, depending on the

position of the trimer. The trimers at each end of a se-

quence have nucleotides in common with two neigh-

boring trimers. The trimers in the middle, have nu-

cleotides in common with at most ﬁve neighbors. This

does not violate the independence assumption of the

ıve Bayes classiﬁers. These classiﬁers still assume

that all features are independent of each other.

To ﬁnd the optimal parameters’ values we did a

grid search for w

, w

∈ {0.1, 0.2, . . . , 1}, using the

target datasets of 100,000 instances for validation

(same as was done in the method proposed by (Hern-

don and Caragea, 2015)). For our proposed method

we:

1. Trained the classiﬁer with labeled data from the

source and target domains.

2. Evaluated on the validation dataset and picked the

values for w

and w

that generated best auPRC.

3. Tested the classiﬁer with these parameters’ values

on the target domain.

For the source domain we used the only dataset, with

Ab initio Splice Site Prediction with Simple Domain Adaptation Classiﬁers

247

100,000 instances. For the target domain, for each

organism we used:

• For training, one of the three folds of 2,500, 6,500,

16,000 or 40,000 instances.

• For validation, the corresponding fold of 100,000

instances.

• For testing, the corresponding fold of 20,000 in-

stances.

As baselines, we used the na

ıve Bayes and the

logistic regression with regularized parameters clas-

siﬁers, trained on either 100,000 from C.elegans, or

one of the three folds of 2,500, 6,500, 16,000, or

40,000 from the target organisms, and tested them

on the corresponding fold for that organism. We ex-

pect the results of the baseline classiﬁers will be the

lower bound for our proposed method, as we hypoth-

esize that adding data from a related organism should

improve the accuracy of the classiﬁer. Note, that

whenever we used the logistic regression classiﬁer,

for baselines or for our proposed method, we set the

ridge parameter to 1,000, as this value led to the best

results in (Herndon and Caragea, 2015).

All results are reported in Table 1 as averages of

three random train-test splits, to ensure the results are

not biased. For evaluating the classiﬁers we used the

area under the precision-recall curve for the minority

class, which is the class of interest, since the data are

so highly imbalanced (Davis and Goadrich, 2006).

This experimental setup allowed us to evaluate:

1. How the following factors inﬂuence the perfor-

mance of our classiﬁer: features, amount of tar-

get labeled data, distance between domains, and

weights used for source and target data.

2. How our proposed method (when using na

ıve

Bayes or regularized logistic regression) com-

pares to other domain adaptation classiﬁers for the

task of splice site prediction, namely, the SVM

classiﬁer proposed by (Schweikert et al., 2009),

the na

ıve Bayes classiﬁer proposed by (Herndon

and Caragea, 2014a), and the regularized logis-

tic regression proposed by (Herndon and Caragea,

2015).

4 RESULTS AND DISCUSSION

In Table 1 we show the auPRC averages over three

folds and their standard deviations for the four target

organisms for:

• Our proposed method (LR SL

S+T

when using the

regularized logistic regression and NB SL

S+T

when

using the na

ıve Bayes classiﬁer).

• Supervised classiﬁers used as baselines (LR SL

and LR SL

when using the regularized logistic

regression classiﬁer trained on source and target

data, respectively, and NB SL

and NB SL

when

using the na

ıve Bayes classiﬁer trained on source

and target data, respectively).

• The domain adaptation with na

ıve Bayes classi-

ﬁer proposed by (Herndon and Caragea, 2014a)

(NB DA

S+T+U

). Note that this is the only classiﬁer,

from the ones we compared, that used the target

unlabeled data in addition to the source and target

labeled data.

• The domain adaptation with regularized logistic

regression proposed by (Herndon and Caragea,

2015) (LR cc).

• The domain adaptation with SVM classiﬁer pro-

posed by (Schweikert et al., 2009) (SVM). Note

that this classiﬁer used other features to represent

the DNA sequences (i.e., it did not represent them

with nucleotides and trimers along with their po-

sitions).

Based on these results we make the following ob-

servations:

1. In terms of the different factors that inﬂuence the

performance of the classiﬁer:

(a) Features: we notice a similar trend for our

proposed method as with previous classiﬁers

(Herndon and Caragea, 2014a; Herndon and

Caragea, 2015), namely, using simple features

(the nucleotides) leads to more accurate clas-

siﬁers when the source and target domains are

distant and there is scarce labeled data in the

target domain. Using a combination of simple

and complex features (nucleotides and trimers)

leads to more accurate classiﬁers when the

source and target domains are closed and there

is enough target labeled data. This is expected

as trimer features are sparser than nucleotide

features, and with less labeled data the classi-

ﬁer performs worse with trimer features as it

does not have enough data to learn an accurate

classiﬁer.

(b) Amount of Target Labeled Data: as the

amount of target labeled data increases the

accuracy of our proposed method increases

as well, with one exception, though. For

D.melanogaster, when using nucleotide and

trimer features, we observe the auPRC de-

creases as the amount of target labeled data in-

creases from 16,000 to 40,000, regardless of

the type of supervised classiﬁer we used, na

ıve

Bayes, or regularized logistic regression. It is

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

248

Table 1: auPRC values for the minority (i.e., positive) class for four target organisms based on the number of labeled target

instances used for training: 2,500, 6,500, 16,000, and 40,000. The LR SL classiﬁer is the logistic regression classiﬁer trained

on 100,000 instances from the source domain, C.elegans (ﬁrst and tenth rows); trained on target labeled data (second and

eleventh rows); and a combination of source and target labeled data (third and twelfth rows), respectively. LR cc (rows

forth and thirteenth) is the domain adaptation classiﬁer trained on a combination of source labeled and target labeled data in

(Herndon and Caragea, 2015). SVM (ninth rows) is the best overall classiﬁer in (Schweikert et al., 2009), namely SVM

S,T

Note that the SVM classiﬁer used different features. The NB SL classiﬁer is the na

ıve Bayes classiﬁer trained on 100,000

instances from the source domain (ﬁfth and thirteenth rows); trained on target labeled data (sixth and fourteenth rows); and a

combination of source and target labeled data (seventh and ﬁfteenth rows), respectively. NB DA

S+T+U

is the best overall domain

adaptation classiﬁer in (Herndon and Caragea, 2014a), A1, trained on a combination of source labeled, target labeled, and

target unlabeled data. The best average values for each type of features used is shown in bold font. (Note that for P.paciﬁcus

the best auPRC value when using nucleotides and trimers with 2,500 and 6,500 target labeled instances is 67.10, obtained

with the na

ıve Bayes classiﬁer trained on source data, NB

.) We would like to highlight that when the source and target

domains are close (C.remanei and P.paciﬁcus are close to C.elegans), the best overall classiﬁer is logistic regression trained

on the combination of source labeled and target labeled data (i.e., best auPRC values in six out of eight cases). When the

source and target domains are distant (D.melanogaster and A.thaliana are far from C.elegans), the best overall classiﬁer is

ıve Bayes trained on the combination of source labeled and target labeled data (i.e., best auPRC values in ﬁve out of eight

cases).

(a) C.remanei

Features Classiﬁer 2,500 6,500 16,000 40,000

nucleotides

LR SL

77.63±1.37

LR SL

31.07±8.72 54.20±3.97 65.73±2.76 72.93±1.70

LR SL

S+T

77.65±1.34 77.88±1.16 78.32±1.29 79.00±0.97

LR cc 77.64±1.39 77.75±1.25 77.88±1.42 78.10±1.15

NB SL

63.77±1.30

NB SL

23.42±7.39 45.44±4.01 54.57±2.63 59.68±1.62

NB SL

S+T

75.49±1.39 75.56±1.46 75.63±1.45 75.82±1.32

NB DA

S+T+U

59.18±1.17 63.10±1.23 63.95±2.08 63.80±1.41

SVM 77.06±2.13 77.80±2.89 77.89±0.29 79.02±0.09

nucleotides and trimers

LR SL

81.37±2.27

LR SL

26.93±9.91 55.26±2.21 68.30±1.91 77.33±2.78

LR SL

S+T

81.40±2.25 81.73±1.90 82.62±2.28 83.57±1.76

LR cc 81.39±2.30 81.47±2.19 81.78±2.08 82.61±2.00

NB SL

77.67±2.24

NB SL

22.94±4.37 58.39±3.94 68.40±3.37 75.75±1.32

NB SL

S+T

81.11±0.73 81.38±0.34 81.51±0.87 82.73±0.52

NB DA

S+T+U

45.29±2.62 72.00±4.16 74.83±4.32 77.07±4.45

(b) P.paciﬁcus

Features Classiﬁer 2,500 6,500 16,000 40,000

nucleotides

LR SL

64.20±1.91

LR SL

29.87±3.58 49.03±4.90 59.93±2.74 69.10±2.25

LR SL

S+T

64.72±1.85 65.63±1.82 67.09±1.29 70.76±2.08

LR cc 64.70±1.85 65.31±2.10 66.76±0.89 70.18±2.12

NB SL

49.12±1.58

NB SL

19.22±3.39 37.33±2.65 45.33±2.28 52.84±2.06

NB SL

S+T

60.67±1.97 61.96±2.04 63.04±0.33 65.17±2.09

NB DA

S+T+U

45.32±2.68 49.82±2.58 52.09±2.04 54.62±1.51

SVM 64.72±3.75 66.39±0.66 68.44±0.67 71.00±0.38

nucleotides and trimers

LR SL

62.37±0.84

LR SL

28.40±4.49 49.67±2.83 62.97±3.32 74.60±2.85

LR SL

S+T

64.14±0.83 66.14±0.55 70.97±2.03 76.89±1.75

LR cc 64.18±1.10 65.49±1.84 69.76±2.08 75.82±2.00

NB SL

67.10±1.94

NB SL

26.39±3.97 48.54±3.42 59.29±2.80 68.78±1.52

NB SL

S+T

64.51±0.70 66.32±0.71 69.29±2.00 72.54±0.42

NB DA

S+T+U

20.21±1.17 53.29±3.08 62.33±3.60 69.88±4.04

Ab initio Splice Site Prediction with Simple Domain Adaptation Classiﬁers

249

Table 1: (Continued)

Features Classiﬁer 2,500 6,500 16,000 40,000

nucleotides

LR SL

35.87±2.32

LR SL

19.97±3.48 31.80±3.86 42.37±2.15 50.53±1.80

LR SL

S+T

41.35±1.40 43.66±3.20 49.96±2.09 54.02±0.95

LR cc 39.70±2.82 42.19±3.41 49.72±2.01 53.43±0.89

NB SL

31.23±1.03

NB SL

14.90±2.80 26.05±4.79 35.21±2.43 39.42±2.90

NB SL

S+T

45.43±0.87 47.12±3.86 51.73±1.24 52.74±2.43

NB DA

S+T+U

33.31±3.71 36.43±2.18 40.32±2.04 42.37±1.51

SVM 40.80±2.18 37.87±3.77 52.33±0.91 58.17±1.50

nucleotides and trimers

LR SL

32.23±2.76

LR SL

15.07±4.11 28.30±5.45 44.67±3.23 38.43±32.36

LR SL

S+T

34.97±2.59 37.22±4.30 49.16±5.11 43.03±22.03

LR cc 37.24±2.20 40.93±3.79 50.54±3.91 45.89±22.25

NB SL

34.09±2.44

NB SL

13.87±2.97 25.00±5.59 35.28±2.14 45.85±3.32

NB SL

S+T

46.85±1.41 50.84±4.39 56.57±2.37 50.15±14.84

NB DA

S+T+U

25.83±2.35 32.58±5.83 39.10±1.82 47.49±3.44

(b) A.thaliana

Features Classiﬁer 2,500 6,500 16,000 40,000

nucleotides

LR SL

16.93±0.21

LR SL

13.87±2.63 26.03±3.29 38.43±6.18 49.33±4.07

LR SL

S+T

22.79±0.92 31.70±2.70 41.28±2.64 49.91±2.38

LR cc 20.67±0.58 27.19±1.30 40.56±3.26 49.75±2.82

NB SL

11.97±0.23

NB SL

7.21±0.90 17.90±1.93 28.10±4.68 34.82±4.77

NB SL

S+T

23.30±1.18 30.97±2.31 39.18±2.79 44.88±3.13

NB DA

S+T+U

18.46±1.13 25.04±0.72 31.47±3.56 36.95±3.39

SVM 24.21±3.41 27.30±1.46 38.49±1.59 49.75±1.46

nucleotides and trimers

LR SL

14.07±0.31

LR SL

8.87±1.84 21.10±4.45 38.53±8.08 49.77±2.77

LR SL

S+T

15.87±0.36 23.65±1.49 39.97±4.39 50.60±2.11

LR cc 16.42±1.20 26.44±2.49 41.35±6.49 50.83±2.28

NB SL

13.98±0.71

NB SL

3.10±0.35 8.76±1.65 28.21±7.58 40.92±3.78

NB SL

S+T

21.62±1.02 27.89±2.19 43.52±6.16 53.33±3.77

NB DA

S+T+U

3.99±0.43 13.96±2.42 33.62±6.31 43.20±3.78

interesting to note that for this combination of

features used and target domain, the auPRC

for the regularized logistic regression classi-

ﬁer also decreases when the amount of target

labeled data increases from 16,000 to 40,000.

This partially explains this exception for our

proposed method when using the logistic re-

gression classiﬁer. Another factor, suggested

by the large standard deviation, is that the fre-

quency of features is very different between

training and test datasets, especially for trimers,

since using only nucleotide features does not

exhibit this behavior.

between the source and target domains in-

creases, the contribution from the source data

decreases, and the accuracy of our method de-

creases, which is expected.

(d) Weight Assigned to Source and Target Data:

in regards to the weight assigned to the target

labeled data, the best results are obtained when

is set to one, or close to one. For the weight

assigned to the source labeled data, when the

domains are closely related the best results are

for high values of w

, but as the distance be-

tween domains increases the value for w

de-

creases. It only makes sense to decrease the

weight assigned to source data when the dis-

tance between domains increases, so these re-

sults conﬁrm our intuition.

2. In terms of performance, our proposed method

produced the best results out of all domain adap-

tation classiﬁers compared, when the source and

target domains are closely related (for C.remanei

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

250

and P.paciﬁcus)), using logistic regression with

nucleotide and trimer features. It also produced

the best results when the domains are distant

(for D.melanogaster and A.thaliana), using na

ıve

Bayes with nucleotide and trimer features, in ﬁve

out of eight cases. This is a similar behavior to the

one observed in (Ng and Jordan, 2001), namely

that a generative classiﬁer performs better than a

discriminative one when there is a small amount

of training labeled data. For domain adaptation,

when the domains are close the source labeled

data contributes a lot to the classiﬁer so a discrim-

inative classiﬁer performs better than a generative

one. When the domains are distant, the source la-

beled data contributes less and a generative classi-

ﬁer performs better than a discriminative one. An-

other case for which our method produced the best

results is for very distant domains (A.thaliana),

using logistic regression with nucleotide features,

when there is somewhat scarce target labeled data

(6,500 instances). There are only two cases in

which another domain adaptation classiﬁer, the

SVM proposed by (Schweikert et al., 2009), out-

performed our proposed method.

5 CONCLUSIONS AND FUTURE

WORK

In this paper we proposed a simple domain adapta-

tion method to address the lack of or limited amount

of labeled data for a target domain, by leveraging the

large amount of labeled data from a related domain.

We evaluated this method on a biological problem,

splice site prediction, a critical step for gene annota-

tion, since many organisms have limited to no labeled

data, whereas related, more studied model organisms

have large amounts of labeled data.

From our experimental results we made a few ob-

servations, such as, in some cases simple features are

preferred over complex ones when the latter can lead

to sparse representations and decreased accuracy, and

vice versa; using more labeled data increases the ac-

curacy of the classiﬁer; and that as the distance be-

tween the domains increases the contribution of the

source data decreases. More importantly, we ob-

served that our proposed method performed better

than previously proposed methods with only a cou-

ple of exceptions, recommending it for ab initio splice

site prediction.

For future work we would like to explore ways to

further increase its accuracy. For example, we would

like to create balanced subsamples, through under-

sampling, and then training an ensemble of classiﬁers

on these subsamples. In addition, we would like to

experiment with ensembles of classiﬁers produced by

the different methods proposed, on balanced datasets.

Another direction for future work is to combine data

from multiple organisms and train a classiﬁer for a

target organism, i.e., use multiple source domains.

ACKNOWLEDGEMENTS

This work was supported by an Institutional Devel-

opment Award (IDeA) from the National Institute of

General Medical Sciences of the National Institutes

of Health under grant number P20GM103418. The

content is solely the responsibility of the authors and

does not necessarily represent the ofﬁcial views of

the National Institute of General Medical Sciences or

the National Institutes of Health. The computing for

this project was performed on the Beocat Research

Cluster at Kansas State University, which is funded

in part by grants MRI-1126709, CC-NIE-1341026,

MRI-1429316, CC-IIE-1440548.

REFERENCES

Arita, M., Tsuda, K., and Asai, K. (2002). Modeling splic-

ing sites with pairwise correlations. Bioinformatics,

18(suppl 2):S27–S34.

Baten, A. K., Chang, B. C., Halgamuge, S. K., and Li, J.

(2006). Splice site identiﬁcation using probabilistic

parameters and svm classiﬁcation. BMC bioinformat-

ics, 7(Suppl 5):S15.

Baten, A. K., Halgamuge, S. K., Chang, B., and Wickrama-

rachchi, N. (2007). Biological sequence data prepro-

cessing for classiﬁcation: A case study in splice site

identiﬁcation. In Advances in Neural Networks–ISNN

2007, pages 1221–1230. Springer.

Cai, D., Delcher, A., Kao, B., and Kasif, S. (2000). Model-

ing splice sites with bayes networks. Bioinformatics,

16(2):152–158.

Davis, J. and Goadrich, M. (2006). The relationship be-

tween precision-recall and roc curves. In Proceed-

ings of the 23rd international conference on Machine

learning, pages 233–240. ACM.

Giannoulis, G., Krithara, A., Karatsalos, C., and Paliouras,

G. (2014). Splice site recognition using transfer learn-

ing. In SETN, pages 341–353. Springer.

Gross, S. S., Do, C. B., Sirota, M., and Batzoglou, S.

(2007). Contrast: a discriminative, phylogeny-free ap-

proach to multiple informant de novo gene prediction.

Genome biology, 8(12):R269.

Herndon, N. and Caragea, D. (2014a). Empirical Study of

Domain Adaptation Algorithms on the Task of Splice

Site Prediction. Communications in Computer and In-

formation Science (CCIS 2014). Springer-Verlag.

Ab initio Splice Site Prediction with Simple Domain Adaptation Classiﬁers

251

Herndon, N. and Caragea, D. (2014b). Predicting pro-

tein localization using a domain adaptation approach.

In Biomedical Engineering Systems and Technologies,

pages 191–206. Springer.

Herndon, N. and Caragea, D. (2015). Domain adaptation

with logistic regression for the task of splice site pre-

diction. In 11th International Symposium on Bioin-

formatics Research and Applications, ISBRA 2015,

pages 125–137.

John, G. H. and Langley, P. (1995). Estimating continuous

distributions in bayesian classiﬁers. In Proceedings

of the Eleventh conference on Uncertainty in artiﬁcial

intelligence, pages 338–345. Morgan Kaufmann Pub-

lishers Inc.

Korf, I., Flicek, P., Duan, D., and Brent, M. R. (2001). Inte-

grating genomic homology into gene structure predic-

tion. Bioinformatics, 17(suppl 1):S140–S148.

Le Cessie, S. and Van Houwelingen, J. C. (1992). Ridge

estimators in logistic regression. Applied statistics,

pages 191–201.

Li, J., Wang, L., Wang, H., Bai, L., and Yuan, Z. (2012).

High-accuracy splice site prediction based on se-

quence component and position features. Genet Mol

Res, 11(3):3431–3451.

Li, Y., Li-Byarlay, H., Burns, P., Borodovsky, M., Robin-

son, G. E., and Ma, J. (2013). Truesight: a new algo-

rithm for splice junction detection using rna-seq. Nu-

cleic acids research, 41(4):e51–e51.

Minoche, A. E., Dohm, J. C., Schneider, J., Holtgr

awe, D.,

Vieh

over, P., Montfort, M., S

orensen, T. R., Weis-

shaar, B., and Himmelbauer, H. (2015). Exploiting

single-molecule transcript sequencing for eukaryotic

gene prediction. Genome biology, 16(1):1–13.

Ng, A. Y. and Jordan, M. I. (2001). On discriminative

vs. generative classiﬁers: a comparison of logistic

regression and naive bayes. In Proceedings of the

Neural Information Processing Systems Conference,

pages 841–848.

Schweikert, G., R

atsch, G., Widmer, C., and Sch

olkopf,

B. (2009). An empirical analysis of domain adap-

tation algorithms for genomic sequence analysis. In

Advances in Neural Information Processing Systems,

pages 1433–1440.

Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and

atsch, G. (2007). Accurate splice site prediction us-

ing support vector machines. BMC bioinformatics,

8(Suppl 10):S7.

Stanescu, A. and Caragea, D. (2014a). Ensemble-

based semi-supervised learning approaches for im-

balanced splice site datasets. In Bioinformatics and

Biomedicine (BIBM), 2014 IEEE International Con-

ference on, pages 432–437. IEEE.

Stanescu, A. and Caragea, D. (2014b). Semi-supervised

self-training approaches for imbalanced splice site

datasets. In Proceedings of the 6th International Con-

ference on Bioinformatics and Computational Biol-

ogy, BICoB, pages 131–136.

Steijger, T., Abril, J. F., Engstr

om, P. G., Kokocinski, F.,

Hubbard, T. J., Guig

o, R., Harrow, J., Bertone, P.,

Consortium, R., et al. (2013). Assessment of transcript

reconstruction methods for rna-seq. Nature methods,

10(12):1177–1184.

Zhang, Y., Chu, C.-H., Chen, Y., Zha, H., and Ji, X. (2006).

Splice site prediction using support vector machines

with a bayes kernel. Expert Systems with Applications,

30(1):73–81.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

252