Authors:
Semyon Grigorev
and
Polina Lunina
Affiliation:
St. Petersburg State University, 7/9 Universitetskaya nab., St.Petersburg, Russia, JetBrains Research, Universitetskaya emb., 7-9-11/5A, St.Petersburg and Russia
Keyword(s):
Dense Neural Network, DNN, Machine Learning, Secondary Structure, Genomic Sequences, Proteomic Sequences, Formal Grammars, Parsing.
Related
Ontology
Subjects/Areas/Topics:
Algorithms and Software Tools
;
Bioinformatics
;
Biomedical Engineering
;
Data Mining and Machine Learning
;
Pattern Recognition, Clustering and Classification
;
Sequence Analysis
Abstract:
We propose a way to combine formal grammars and artificial neural networks for biological sequences processing. Formal grammars encode the secondary structure of the sequence and neural networks deal with mutations and noise. In contrast to the classical way, when probabilistic grammars are used for secondary structure modeling, we propose to use arbitrary (not probabilistic) grammars which simplifies grammar creation. Instead of modeling the structure of the whole sequence, we create a grammar which only describes features of the secondary structure. Then we use undirected matrix-based parsing to extract features: the fact that some substring can be derived from some nonterminal is a feature. After that, we use a dense neural network to process features. In this paper, we describe in details all the parts of our receipt: a grammar, parsing algorithm, and network architecture. We discuss possible improvements and future work. Finally, we provide the results of tRNA and 16s rRNA proces
sing which shows the applicability of our idea to real problems.
(More)