loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Nic Herndon and Doina Caragea

Affiliation: Kansas State University, United States

Keyword(s): Splice Site Prediction, Domain Adaptation, Imbalanced Data, Logistic Regression, Näive Bayes.

Related Ontology Subjects/Areas/Topics: Bioinformatics ; Biomedical Engineering ; Data Mining and Machine Learning ; Sequence Analysis

Abstract: The next generation sequencing technologies (NGS) have made it affordable to sequence any organism, opening the door to assembling new genomes and annotating them, even for non-model organisms. One option for annotating a genome is to assemble RNA-Seq reads into a transcriptome and aligning the transcriptome to the genome assembly to identify the protein-encoding genes. However, there are a couple of problems with this approach. RNA-Seq is error prone and therefore the gene models generated with this technique need to be validated. In addition, this method can only capture the genes expressed at the time of sequencing. Machine learning can help address both of these problems by generating ab initio gene models that can provide supporting evidence to the models generated with RNA-Seq, as well as predict additional genes that were not expressed during sequencing. However, machine learning algorithms need large amounts of labeled data to learn accurate classifiers, and newly sequenced, non-model organisms have insufficient labeled data. This can be addressed by leveraging the abundant labeled data from a related model-organism (the source domain) and use it in conjunction with the little labeled data from the organism of interest (the target domain) to train a classifier in a domain adaptation setting. The method we propose uses this approach and generates accurate classification on the task of splice site prediction – a difficult and essential step in gene prediction. It is simple – it combines source and target labeled data, with different weights, into one dataset, and then trains a supervised classifier on the combined dataset. Despite its simplicity it is surprisingly accurate, with highest areas under the precision-recall curve between 53.33% and 83.57%. Out of the domain adaptation classifiers evaluated (SVM, na¨ıve Bayes, and logistic regression) this method produced the best results in 12 out of the 16 cases studied. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.91.176.3

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Herndon, N. and Caragea, D. (2016). Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers. In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - BIOINFORMATICS; ISBN 978-989-758-170-0; ISSN 2184-4305, SciTePress, pages 245-252. DOI: 10.5220/0005710502450252

@conference{bioinformatics16,
author={Nic Herndon. and Doina Caragea.},
title={Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers},
booktitle={Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - BIOINFORMATICS},
year={2016},
pages={245-252},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005710502450252},
isbn={978-989-758-170-0},
issn={2184-4305},
}

TY - CONF

JO - Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - BIOINFORMATICS
TI - Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers
SN - 978-989-758-170-0
IS - 2184-4305
AU - Herndon, N.
AU - Caragea, D.
PY - 2016
SP - 245
EP - 252
DO - 10.5220/0005710502450252
PB - SciTePress