Splice Site Prediction: Transferring Knowledge Across Organisms

Simos Kazantzidis, Anastasia Krithara, George Paliouras

Abstract

As more genomes are sequenced, there is an increasing need for automated gene prediction. One of the subproblems of the gene prediction, is the splice sites recognition. In eukaryotic genes, splice sites mark the boundaries between exons and introns. Even though, there are organisms which are well studied and their splice sites are known, there are plenty others which have not been studied well enough. In this work, we propose two transfer learning approaches for the splice site recognition problem, which take into account the knowledge we have from the well-studied organisms. We use different representations for the sequences such as the n-gram graph representation and a representation based on biological motifs. Furthermore, we study the case where more than one organisms are available for training and we incorporate information from the phylogenetic analysis between organisms. An extensive evaluation has taken place. The results indicate that the proposed representations and approaches are very promising.

References

  1. Arnold, A., Nallapati, R., and Cohen, W. (2007). A comparative study of methods for transductive transfer learning. pages 77-82.
  2. Giannakopoulos, G. (2009). Automatic summarization from multiple documents, phd thesis, department of information and communication systems engineering, university of the aegean.
  3. Giannoulis, G., Krithara, A., Karatsalos, C., and Paliouras, G. (2014). Splice site recognition using transfer learning. In Artificial Intelligence: Methods and Applications, pages 341-353.
  4. Herndon, N. and Caragea, D. (2015). Empirical study of domain adaptation algorithms on the task of splice site prediction. In Biomedical Engineering Systems and Technologies, volume 511, pages 195-211.
  5. Herndon, N. and Caragea, D. (2016). A study of domain adaptation classifiers derived from logistic regression for the task of splice site prediction. IEEE Transactions on NanoBioscience.
  6. Kamath, U., Compton, J., Islamaj-Dogan, R., Jong, K. D., and Shehu, A. (2012). An evolutionary algorithm approach for feature generation from sequence data and its application to dna splice site prediction. In IEEE/ACM Transactions on Computational Biology and Bioinformatics, volume 9, pages 1387-1398.
  7. Li, P. and Goldman, N. (1998). ular evolution and phylogeny. 8(12):12331244.
  8. Mller, A., Asp, T., Holm, P., and Palmgren, M. (2007). Phylogenetic analysis of p5 p-type atpases, a eukaryotic lineage of secretory pathway pumps. In Molecular Phylogenetics and Evolution, page 619634.
  9. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443 - 453.
  10. Pan, S. and Yang, Q. (2010). A survey on transfer learning. In IEEE Transactions on Knowledge and Data Engineering, pages 1345-1359.
  11. Rajapakse, J. C. and Ho, L. S. (2005). Markov encoding for detecting signals in genomic sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(2):131-142.
  12. Rätsch, G. and Sonnenburg, S. (2004). Accurate Splice Site Prediction for Caenorhabditis Elegans. In Kernel Methods in Computational Biology, MIT Press series on Computational Molecular Biology, pages 277-298. MIT Press.
  13. Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R., and Schölkopf, B. (2007). Improving the c. elegans genome annotation using machine learning. PLoS Computational Biology, 3:e20.
  14. Schweikert, G., Widmer, C., Schlkopf, B., and Rtsch, G. (2008). An empirical analysis of domain adaptation algorithm for genomic sequence analysis. In Advances in Neural Information Processing Systems, pages 1433-1440.
  15. Smith, T. and Waterman, M. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195 - 197.
  16. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Rtsch, G. (2007). Accurate splice site prediction using support vector machines.
  17. Widmer, C. and Ratsch, G. (2012). Multitask learning in computational biology. pages 207-216.
  18. Wikipedia (2004). Nucleic acid notation - Wikipedia, the free encyclopedia. [Online; accessed 29 August 2015].
  19. Yamamura, M., Gotoh, O., Dunker, A., Konagaya, A., Miyano, S., and Takagi, T. (2003). Detection of the splicing sites with kernel method approaches dealing with nucleotide doublets. Genome Informatics Online, 14:426-427.
Download


Paper Citation


in Harvard Style

Kazantzidis S., Krithara A. and Paliouras G. (2017). Splice Site Prediction: Transferring Knowledge Across Organisms . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017) ISBN 978-989-758-214-1, pages 160-167. DOI: 10.5220/0006164401600167


in Bibtex Style

@conference{bioinformatics17,
author={Simos Kazantzidis and Anastasia Krithara and George Paliouras},
title={Splice Site Prediction: Transferring Knowledge Across Organisms},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)},
year={2017},
pages={160-167},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006164401600167},
isbn={978-989-758-214-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)
TI - Splice Site Prediction: Transferring Knowledge Across Organisms
SN - 978-989-758-214-1
AU - Kazantzidis S.
AU - Krithara A.
AU - Paliouras G.
PY - 2017
SP - 160
EP - 167
DO - 10.5220/0006164401600167