Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers

Nic Herndon, Doina Caragea


The next generation sequencing technologies (NGS) have made it affordable to sequence any organism, opening the door to assembling new genomes and annotating them, even for non-model organisms. One option for annotating a genome is to assemble RNA-Seq reads into a transcriptome and aligning the transcriptome to the genome assembly to identify the protein-encoding genes. However, there are a couple of problems with this approach. RNA-Seq is error prone and therefore the gene models generated with this technique need to be validated. In addition, this method can only capture the genes expressed at the time of sequencing. Machine learning can help address both of these problems by generating ab initio gene models that can provide supporting evidence to the models generated with RNA-Seq, as well as predict additional genes that were not expressed during sequencing. However, machine learning algorithms need large amounts of labeled data to learn accurate classifiers, and newly sequenced, non-model organisms have insufficient labeled data. This can be addressed by leveraging the abundant labeled data from a related model-organism (the source domain) and use it in conjunction with the little labeled data from the organism of interest (the target domain) to train a classifier in a domain adaptation setting. The method we propose uses this approach and generates accurate classification on the task of splice site prediction – a difficult and essential step in gene prediction. It is simple – it combines source and target labeled data, with different weights, into one dataset, and then trains a supervised classifier on the combined dataset. Despite its simplicity it is surprisingly accurate, with highest areas under the precision-recall curve between 53.33% and 83.57%. Out of the domain adaptation classifiers evaluated (SVM, na¨ıve Bayes, and logistic regression) this method produced the best results in 12 out of the 16 cases studied.


  1. Arita, M., Tsuda, K., and Asai, K. (2002). Modeling splicing sites with pairwise correlations. Bioinformatics, 18(suppl 2):S27-S34.
  2. Baten, A. K., Chang, B. C., Halgamuge, S. K., and Li, J. (2006). Splice site identification using probabilistic parameters and svm classification. BMC bioinformatics, 7(Suppl 5):S15.
  3. Baten, A. K., Halgamuge, S. K., Chang, B., and Wickramarachchi, N. (2007). Biological sequence data preprocessing for classification: A case study in splice site identification. InAdvances in Neural Networks-ISNN 2007, pages 1221-1230. Springer.
  4. Cai, D., Delcher, A., Kao, B., and Kasif, S. (2000). Modeling splice sites with Bayes networks. Bioinformatics, 16(2):152-158.
  5. Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233-240. ACM.
  6. Giannoulis, G., Krithara, A., Karatsalos, C., and Paliouras, G. (2014). Splice site recognition using transfer learning. In SETN, pages 341-353. Springer.
  7. Gross, S. S., Do, C. B., Sirota, M., and Batzoglou, S. (2007). Contrast: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome biology, 8(12):R269.
  8. Herndon, N. and Caragea, D. (2014a). Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction. Communications in Computer and Information Science (CCIS 2014). Springer-Verlag.
  9. Herndon, N. and Caragea, D. (2014b). Predicting protein localization using a domain adaptation approach. In Biomedical Engineering Systems and Technologies, pages 191-206. Springer.
  10. Herndon, N. and Caragea, D. (2015). Domain adaptation with logistic regression for the task of splice site prediction. In 11th International Symposium on Bioinformatics Research and Applications, ISBRA 2015, pages 125-137.
  11. John, G. H. and Langley, P. (1995). Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 338-345. Morgan Kaufmann Publishers Inc.
  12. Korf, I., Flicek, P., Duan, D., and Brent, M. R. (2001). Integrating genomic homology into gene structure prediction. Bioinformatics, 17(suppl 1):S140-S148.
  13. Le Cessie, S. and Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Applied statistics, pages 191-201.
  14. Li, J., Wang, L., Wang, H., Bai, L., and Yuan, Z. (2012). High-accuracy splice site prediction based on sequence component and position features. Genet Mol Res, 11(3):3431-3451.
  15. Li, Y., Li-Byarlay, H., Burns, P., Borodovsky, M., Robinson, G. E., and Ma, J. (2013). Truesight: a new algorithm for splice junction detection using rna-seq. Nucleic acids research, 41(4):e51-e51.
  16. Minoche, A. E., Dohm, J. C., Schneider, J., Holtgräwe, D., Viehöver, P., Montfort, M., Sörensen, T. R., Weisshaar, B., and Himmelbauer, H. (2015). Exploiting single-molecule transcript sequencing for eukaryotic gene prediction. Genome biology, 16(1):1-13.
  17. Ng, A. Y. and Jordan, M. I. (2001). On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In Proceedings of the Neural Information Processing Systems Conference, pages 841-848.
  18. Schweikert, G., Rätsch, G., Widmer, C., and Schölkopf, B. (2009). An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In Advances in Neural Information Processing Systems, pages 1433-1440.
  19. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Rätsch, G. (2007). Accurate splice site prediction using support vector machines. BMC bioinformatics, 8(Suppl 10):S7.
  20. Stanescu, A. and Caragea, D. (2014a). Ensemblebased semi-supervised learning approaches for imbalanced splice site datasets. In Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on, pages 432-437. IEEE.
  21. Stanescu, A. and Caragea, D. (2014b). Semi-supervised self-training approaches for imbalanced splice site datasets. In Proceedings of the 6th International Conference on Bioinformatics and Computational Biology, BICoB, pages 131-136.
  22. Steijger, T., Abril, J. F., Engström, P. G., Kokocinski, F., Hubbard, T. J., Guigó, R., Harrow, J., Bertone, P., Consortium, R., et al. (2013). Assessment of transcript reconstruction methods for rna-seq. Nature methods, 10(12):1177-1184.
  23. Zhang, Y., Chu, C.-H., Chen, Y., Zha, H., and Ji, X. (2006). Splice site prediction using support vector machines with a bayes kernel. Expert Systems with Applications, 30(1):73-81.

Paper Citation

in Harvard Style

Herndon N. and Caragea D. (2016). Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers . In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016) ISBN 978-989-758-170-0, pages 245-252. DOI: 10.5220/0005710502450252

in Bibtex Style

author={Nic Herndon and Doina Caragea},
title={Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers},
booktitle={Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)},

in EndNote Style

JO - Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)
TI - Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers
SN - 978-989-758-170-0
AU - Herndon N.
AU - Caragea D.
PY - 2016
SP - 245
EP - 252
DO - 10.5220/0005710502450252