A Sequence-Motif Based Approach to Protein Function Prediction via

Deep-CNN Architecture

Vikash Kumar

, Ashish Ranjan

, Deng Cao

, Gopalakrishnan Krishnasamy

and Akshay Deepak

National Institute of Technology Patna, Patna, India

ITER, Siksha ’O’ Anusandhan Deemed to be University, Bhubaneswar, India

Associate Professor, Department of Mathematics & Computer Science, Central State University, Wilberforce, Ohio, U.S.A.

Keywords:

Protein Sequence, Convolutional Neural Network, Protein Sub-Sequence, Consistency Factor.

Abstract:

The challenge of determining protein functions, inferred from the study of protein sub-sequences, is a complex

problem. Also, a little literature is evident in this regard, while a broad coverage of the literature shows a bias in

the existing approaches for the full-length protein sequences. In this paper, a CNN-based architecture is intro-

duced to detect motif information from the sub-sequence and predict its function. Later, functional inference

for sub-sequences is used to facilitate the functional annotation of the full-length protein sequence. The results

for the proposed approach demonstrate a great future ahead for further exploration of sub-sequence based pro-

tein studies. Comparisons with the ProtVecGen-Plus – a (multi-segment + LSTM) approach – demonstrate,

an improvement of +1.24% and +4.66% for the biological process (BP) and molecular function (MF) sub-

ontologies, respectively. Next, the proposed method outperformed the hybrid ProtVecGen-Plus + MLDA by a

margin of +3.45% for the MF dataset, while raked second for the BP dataset. Overall, the proposed method

produced better results for signiﬁcantly large protein sequences (having sequence length > 500 amino acids).

1 INTRODUCTION

The study of the role of proteins in (i) the disease

Pathobiology, (ii) the examination of meta-genomes,

and (iii) the discovery of therapeutic targets, are im-

portant tasks that require deep knowledge about the

functions of proteins. In this regard, the functional

knowledge acquisition about proteins is well sup-

ported by the computational approaches that are fast

and economical, though, still needing a good amount

of effort to compete with the evolving dynamics of

proteins – only less than 1% of proteins have reviewed

annotations

. The recent trend to infer protein func-

tion(s) show a biasness of the existing works for pro-

tein sequences (Jiang et al., 2016), (Kumari et al.,

2019), (Radivojac et al., 2013), (Fa et al., 2018), (Kul-

manov and Hoehndorf, 2020), (Makrodimitris et al.,

2019), (Ranjan et al., 2019), (Ranjan et al., 2021)

– mostly due to their large and cheap availability,

though works based on protein structures (Yang et al.,

2015), (Gligorijevi

c et al., 2021), protein interaction

network (Kulmanov et al., 2018), and others (You

This statistics is based on the information from the

UniProtKB (Consortium, 2015).

et al., 2018) are also available.

Protein sequences encode vital patterns, which are

formed due to interactions among amino acids that

in turn fold into proteins’ sub-structures, for exam-

ple, binding sites, to perform the function. This jus-

tiﬁes the necessity for the sub-sequence based ap-

proaches, while the existing approaches are primar-

ily focused on full-length protein sequences (Cao

et al., 2017), (Kulmanov et al., 2018), (Kulmanov and

Hoehndorf, 2020) which makes the function predic-

tion a little less effective. There exist only a few no-

table works (Ranjan et al., 2019), (Ranjan et al., 2021)

that have demonstrated the utility of a sub-sequence-

based methodology. In (Ranjan et al., 2019), the

proposed solution is a (multi-segmentation + LSTM)

based framework. The other work (Ranjan et al.,

2021) is an ensemble (multi-segmentation + tf-idf +

MLDA) method. Both works involve utilizing pre-

dicted function(s) for protein sub-sequences to infer

the function(s) of the full-length protein sequence.

Convolutional neural networks (CNNs) have re-

cently gained popularity as a strong alternative to re-

current neural networks (RNNs), automating feature

representations for biological sequences, and for a va-

riety of tasks such as function prediction (Kulmanov

Kumar, V., Ranjan, A., Cao, D., Krishnasamy, G. and Deepak, A.

A Sequence-Motif Based Approach to Protein Function Prediction via Deep-CNN Architecture.

DOI: 10.5220/0011647800003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 243-251

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

243

et al., 2018), (Kulmanov and Hoehndorf, 2020), drug-

target prediction (

Ozt

urk et al., 2018), (

Ozt

urk et al.,

2019), etc. In, (Kulmanov et al., 2018) and (Kul-

manov and Hoehndorf, 2020), they applied CNNs for

the complete protein sequences to infer protein func-

tion(s). This paper proposes a framework that uses a

deep CNN-based architecture to ﬁrst infer the func-

tion(s) of protein sub-sequences and then uses the

inferred function(s) for protein sub-sequences to de-

termine the function(s) of the full-length protein se-

quence. The proposed CNN-based architecture ex-

tracts motif information from the sub-sequence, and

uses it to predict the GO-term(s) for the protein sub-

sequence.

The evaluations of the proposed framework con-

ducted for two independent datasets – biological

process (BP) and molecular function (MF) sub-

ontologies

– demonstrated a signiﬁcant effort of

the proposed framework. The overall improve-

ments with respect to the similar multi-segment based

ProtVecGen-Plus (Ranjan et al., 2019), based on

RNNs, i.e., LSTM network, demonstrated improve-

ments of: +1.24% for the BP dataset and +4.66% for

the MF dataset. Further, when compared to the hy-

brid, ProtVecGen-Plus + MLDA (Ranjan et al., 2019)

method, the proposed work produces improvement of

+3.45% for the MF dataset, while ranked second for

the BP dataset. The proposed method showed bet-

ter results for handling the longer protein sequences

(having sequence length > 500 amino-acids).

Following is the organization of the paper: Sec-

tion 2 is an elaboration of the dataset used for the ex-

periments and the proposed methods. Following this

is a Section 3 for the results discussion. Lastly, the

Section 4 is a conclusion.

2 DATASETS AND METHODS

Here, we will discuss the experimental datasets, the

steps for the segmented dataset construction, and the

proposed method.

2.1 Datasets

Experiments were conducted for two datasets, – cor-

responding to the biological process (BP) and molec-

ular function (MF) sub-ontologies as deﬁned by the

Gene Ontology (GO) (Ashburner et al., 2000). These

datasets were created by downloading reviewed pro-

tein sequences and their mapped functional annota-

deﬁned by the Gene Ontology Consortium (Ashburner

et al., 2000).

Figure 1: Protein sequences’ distributions are shown.

tions from the UniProtKB/SwissProt (2017) reposi-

tory (Consortium, 2015). The functional annotation

is a unique identiﬁer, known as a GO-term, that in-

dicates the distinct protein function. The biological

process dataset has 58,310 protein sequences and 295

unique GO-terms. The other dataset, molecular func-

tion, has 43,218 protein sequences and 135 unique

GO-terms. For each GO-terms, the least number of

protein sequences is taken as 200. Only proteins with

a sequence length higher than 40 and lower than 2,000

were chosen for this study.

2.2 Steps to Construct the Protein

Sub-Sequence Dataset

The proposed framework to infer the protein func-

tion(s) is based on the protein sub-sequences, hence,

the steps taken to construct the segmented dataset are

explained next. Let the training dataset be denoted

as S = [s

]

i=1

; where, s

and Y

denote the i

pro-

tein sequence and the corresponding GO-term(s), re-

spectively. As shown in Figure 1, for both the BP

and MF sub-ontologies, large number of protein se-

quences have lengths of around 200 to 300. So, the

maximum length for the protein sub-sequences is set

to 200, with a gap of 60 amino acids between two

consecutive sub-sequences.

1. For each protein sample pair, (s

); i ∈

{1, 2,..., n}, a protein sequence s

is split to gen-

erate a set of protein sub-sequences of size 200.

Zero-padding is done for the short protein sub-

sequences.

2. The output labels for each sub-sequences are as-

sumed equivalent to the parent protein sequence.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

244

One-Hot Encoding

(dim = 21)

CNN Block 1

Filter Size : 1 X 5

# Filter : 96

CNN Block 2

Filter Size : 1 X 7

# Filter : 96

CNN Block

3,1

Filter Size: 1 X 9

# Filters: 96

CNN Block

3,2

Filter Size: 1 X 9

# Filters: 96

CNN Block

3,N

Filter Size: 1 X 9

# Filters: 96

MaxPooling 1

Flatten 1 Flatten 2 Flatten N

Concatenate

Dropout (0.4)

Output Layer

MaxPooling N

. . . . .

MaxPooling 2

CNN Block

4,1

Filter Size: 1 X 6

# Filters: 128

CNN Block

4,2

Filter Size: 1 X 9

# Filters: 128

CNN Block

4,N

Filter Size: 1 X 27

# Filters: 128

1D CNN

Batch-

Normalization

LeakyRelu

CNN Block

Protein Sequence

Sub-

Network 1

Sub-

Network N

Sub-

Network 2

Figure 2: Proposed model architecture with number of sub-networks, denoted as N = 8. The dashed lines indicate the residual

connections.

2.3 Proposed Method

The paper introduces a two-step framework for infer-

ring protein function: (i) a deep stacked CNN-based

architecture is used to ﬁrst infer function(s) of pro-

tein sub-sequences, and (ii) the inferred function(s)

for protein sub-sequences are used to determine func-

tion(s) of the full-length protein sequence. A discus-

sion on the input sequence representation and the pro-

posed architecture are given next.

2.3.1 One-Hot Encoding Based Input Sequence

Representation

The protein sequences are pre-processed and repre-

sented as a string of amino acids, where amino acids

are represented via the one-hot encoding scheme. The

dimension for the one-hot encoding scheme is taken

as 21. The ﬁrst dimension is used to indicate the

rare amino acids (O, U, X), whereas the remaining 20

Table 1: Hyper-parameters conﬁgurations with different

CNN blocks.

No.

CNN

Block

Filter-

Size

Filters

Count

Note

1. CNN

Block

1 x 5 96 –

2. CNN

Block

1 x 7 96 –

3. CNN

Block

1 x 9 96 –

4. CNN

Block

1 x F

128 F

denote ﬁlter size

in range 6, 9, 12, ...,

(3N + 3).

dimensions correspond to the well-known 20 amino

acids.

A Sequence-Motif Based Approach to Protein Function Prediction via Deep-CNN Architecture

245

2.3.2 Stacked CNN-Based Architecture

The proposed architecture employs stacked layers of

Convolutional Neural Networks (CNNs) blocks to

process protein sub-sequences character-by-character

(where amino acids represent a character). The most

notable feature of CNNs is their ability to capture lo-

cal dependencies – between amino acids – through

the use of trainable ﬁlters that aid in the transfor-

mation of protein sequence into a viable representa-

tion. The complete architecture is shown in Figure 2.

There are two consecutive layers of two CNN blocks

(CNN Block 1 and CNN Block 2) that are further di-

vided into a set of sub-networks before being com-

bined down the architecture. The components of both

CNN blocks and sub-networks are discussed next:

2.1) CNN Block: The CNN block as shown in

Figure 2 has following layers.

1. 1D-CNN Layer: The purpose of the 1-

dimensional CNN layer is to learn local de-

pendencies between the amino acids along the

sequence. Here, the hyper-parameters that are

customized for different CNN blocks include

the number of ﬁlters and ﬁlter-size. A more

detailed discussion on these hyper-parameters

with different CNN blocks is given in Table 1. Let

[n] denote the output feature map after applying

the convolution operation, and the equation can

be seen as:

[n] = x[n] ∗ h[n] =

∞

∑

−∞

x[k] · h[n − k] (1)

where,

– h[n] is the kernel

– x[n] is the input feature

– * denotes the convolution operation.

2. Batch-Normalization Layer: This layer acts as a

regularizer that controls the biasness of the model

– utilizing the statistics of the mini-batch (Ioffe

and Szegedy, 2015).

Let I

denote a input feature-map corresponding

to the f

ﬁlter, where 1 ≤ f ≤ 128 (given in Table

1), then the output of batch-normalization layer

for the f

input feature-map, denoted as I

, is de-

ﬁned as:

= γ

− m[I

]

var[I

]

+ β

(2)

where,

– I

is the f

feature map,

– m[I

] is the mean of the f

feature-map,

– var[I

] is the variance for the f

feature-map,

– γ

and β

are two learning parameters which

control m[I

] and var[I

], respectively.

3. LeakyRelu Layer: This layer transforms the out-

put of the previous layer in the range as given

in Equation 3 and saves the unit from being non-

functional (Maas et al., 2013).

f (I

) =

(

αI

, if I

< 0

, otherwise

(3)

where,

– α is the constant taken as 0.2.

– I

is the given input to the Leaky Relu layer.

2.2) Components of Sub-Network: Each sub-

network is composed of layers as follows:

1. A consecutive layers of two CNN blocks (rep-

resented as CNN Block

3,N

and CNN Block

4,N

Here N represents the number of sub-networks.

Different-sized ﬁlters with the CNN Block

4,N

are

used to extract motifs of different sizes as given in

Table 1.

2. Next, the MaxPooling layer is employed to ex-

tract features that emphasize the important motifs

present in protein sub-sequences. This also helps

to prevent the over-ﬁtting of the model by reduc-

ing the feature maps.

− k)

s + 1

(4)

where,

– I

is the output feature vector.

– I

is the input feature vector.

– k is the kernel size

– s is the stride.

3. The last layer of each sub-network is the ﬂatten

layer to reduce the output to a 1-dimensional vec-

tor.

The combined output from each sub-networks,

obtained using the concatenate layer, is then passed to

the dropout layer (with dropout probability = 0.4). Fi-

nally, the output layer with sigmoid activation is used

as a classiﬁcation layer.

Importantly, a residual connection (shown with

dashed line) is added between the CNN blocks for ef-

ﬁcient training of deep neural architecture, as shown

in Figure 2. This provides a signiﬁcant improvement

in the network’s ability to overcome vanishing gradi-

ents. The hyper-parameters for the proposed architec-

ture are shown in Table 2.

2.3.3 Final Prediction for the Full-Length

Protein Sequence

The mean of inferred protein functions obtained for

all the protein sub-sequences is computed to get the

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

246

Table 2: Hyper-parameters conﬁgurations.

No.

Hyper-

Parameters

Values

1. Optimizer Adam (Kingma

and Ba, 2014)

2. Loss Function Binary Cross-

Entropy

3. Learning Rate 5e

−4

4. Clip Value 5.0

ﬁnal labels for the full-length protein sequence.

3 RESULTS AND DISCUSSION

We have trained this model using Keras built on

the Tensor-ﬂow architecture as the backend. The

datasets discussed in Section 2.1 were split into train-

ing (0.75%) and testing (0.25%) datasets. For mon-

itoring the training, we have used check-pointers,

early-stopping criteria along with 10% of training

data as validation data.

3.1 Evaluation Metrics

Here, let Y

= {y

, y

, ...} denotes the actual GO-

terms and P

= {p

, p

, ...} denotes the predicted GO-

terms for protein sequence S

;i ∈ {1, 2, .., n}. The

metrics were deﬁned as:

1. Average Recall: Recall catches the true prediction

made by the model over all the predicted true sam-

ples.

avg

∑

i=0

∩ P

(5)

2. Average Precision: Precision catches the true pre-

diction made by the model over all the actual true

samples.

avg

∑

i=0

∩ P

(6)

3. Average F1-Score: F1-Score balance both the pre-

cision & recall and return the value lowest be-

tween the recall and precision.

avg

∑

i=0

2|Y

∩ P

| + |P

(7)

4. Consistency f actor: This is based on variance

which emphasizes a method’s overall general-

ity with regard to protein sequences of various

lengths. This is deﬁned as follows:

Consistency f actor =

∑

( f 1 −

f 1

)

(8)

where, the average f 1 − score for the test samples

in the sequence length range ri is

f 1

, while f 1 is

the overall f 1 − score for the test dataset. A low

value indicated high consistency and vice versa.

3.2 Baseline Comparison Methods

A lot of work has been done in the past to predict

protein function using GO and amino acid sequences.

Notable works used for the fair comparison includes:

3.2.1 MLDA (Wang et al., 2016)

MLDA stands for Multi-Label Linear Discriminant

Analysis. This is based on the complete protein se-

quence that uses tf-idf features, further reduced in

dimension using the Multi-label LDA (MLDA) ap-

proach, as the input representation for the protein se-

quence. To reduce the features, MLDA project the

input feature to some other feature space.

3.2.2 ProtVecGen-Plus (Ranjan et al., 2019)

This work of ours, is the ﬁrst to present the deep-

learning-based method that exploits the protein sub-

sequences to infer functional annotation(s) for the

full-length protein sequence. To infer protein func-

tions, multiple LSTM-based network architectures

are used, each trained with different-sized protein

subsequences (i.e., 100, 120, 140).

3.2.3 ProtVecGen-Ensemble (Ranjan et al.,

2021)

Another sub-sequence-based method, this time em-

ploying the tf-idf + MLDA technique. However, this

also entails discarding a few protein sub-sequences

that have been found to be less informative and re-

lying on the remaining sub-sequences to infer annota-

tion(s) for the full-length protein sequence.

3.2.4 ProtVecGen-Plus + MLDA (Ranjan et al.,

2019)

This is an ensemble of standard machine learning

and deep learning methods. The results of the

MLDA approach are combined with the results of

deep learning-based ProtVecGen-Plus (Ranjan et al.,

2019) .This method showed great potential for pre-

dicting protein functions.

A Sequence-Motif Based Approach to Protein Function Prediction via Deep-CNN Architecture

247

Table 3: Biological Process Dataset: Classiﬁcation report with respect to different number of sub-networks with the proposed

approach on protein sub-sequences (N stands for number of sub-networks).

Dataset —> Full-length Sequence Approach Sub-Sequence Approach

No.

N # parame-

ters (Mil-

lions)

avg

(%)

Rec

avg

(%)

avg

(%)

CF Pr

avg

(%)

Rec

avg

(%)

avg

(%)

1. 6 ≈ 1.79 92.75 33.53 34.44 8.101 56.25 55.20 52.97 5.229

2. 7 ≈ 2.21 93.17 32.44 33.37 8.707 57.31 56.61 54.19 4.690

3. 8 ≈ 2.66 92.30 33.84 34.73 8.960 58.98 57.17 55.45 4.917

4. 9 ≈ 3.15 91.99 35.48 36.33 9.232 59.70 57.32 55.89 4.471

Table 4: Molecular Function Dataset: Classiﬁcation report with respect to different number of sub-networks with the proposed

approach on protein sub-sequences (N stands for number of sub-networks).

Dataset —> Full-length Sequence Approach Sub-Sequence Approach

No.

N # parame-

ters (Mil-

lions)

avg

(%)

Rec

avg

(%)

avg

(%)

CF Pr

avg

(%)

Rec

avg

(%)

avg

(%)

1. 6 ≈ 1.67 94.34 45.40 s

46.24

5.722 71.23 70.94 69.09 5.013

2. 7 ≈ 2.07 93.24 50.82 51.35 5.544 71.42 71.14 69.34 4.970

3. 8 ≈ 2.50 94.35 49.58 50.35 5.938 72.25 72.02 70.08 5.125

4. 9 ≈ 2.97 93.68 49.84 50.84 5.481 72.72 72.21 70.57 4.057

3.3 Study the Effect of Number of

sub-Networks

In this sub-section, the effect of number of sub-

networks (denoted as N), considering N as 6, 7, 8,

9, with the proposed architecture is studied for both

the datasets. The experiments are conducted for two

different cases:

1. Full-length Protein Sequence: The proposed ar-

chitecture is trained and evaluated based on the

full-length protein sequences. The truncation of

protein sequences larger than 500 amino acids are

done.

2. Protein Sub-sequence: This represents the com-

plete proposed framework that is based on the pro-

tein sub-sequences.

This sort of study will allow to understand advan-

tages with the sub-sequence based method over the

methods that are based on the full-length sequence

model. The results for both cases are reported in Ta-

ble 3 (for the BP dataset) and Table 4 (for the MF

dataset).

On increasing the number of sub-networks, with

respect to various performance metrics, including

avg

, Re

avg

, and F1

avg

, a general increase is ob-

served. Further, increasing the number of sub-

networks also helps improve the consistency f actor,

a lower value of consistency f actor indicate a better

generalized behavior of the model towards protein se-

quences of different lengths. For the sub-sequence

based framework, the best F1

avg

reported are 55.89%

(for the BP dataset) and 70.57% (for the MF dataset),

with the nine sub-networks. These experimental ob-

servations stand true for full-length sequences as well.

Importantly, the results as shown in the Table

3 and the Table 4, clearly indicate that the pro-

posed sub-sequence based framework is superior. The

performances obtained for the full-length sequence

model are notably worse when compared to the sub-

sequence based framework, and this observation ap-

plies regardless of choice of the sub-networks. Ever-

more, the full-length sequence model tends to favor

a particular size of protein sequence more as quan-

tiﬁed with the consistency f actor. With respect to

the best F1

avg

, in comparison to the full-length se-

quence model, the sub-sequence based framework is

able to produce an improvements of 19.56% for the

BP dataset and 19.73% for the MF dataset.

The poor performance of the full-length sequence

model can be attributed primarily to the model’s in-

ability to efﬁciently retain the useful information.

This is because, the useful information is masked by

the presence of too much not useful information, es-

pecially for the case involving long-sized protein se-

quences.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

248

Table 5: Comparison between the state-of-art approach and proposed model (CF = consistency factor).

Dataset —> Biological Process Molecular Function

No.

Approach Pr

avg

(%)

Rec

avg

(%)

avg

(%)

CF Pr

avg

(%)

Rec

avg

(%)

avg

(%)

1. MLDA (Wang et al.,

2016)

52.61 49.42 49.27 10.969 60.20 58.29 57.91 8.408

2. ProtVecGen-Plus (Ran-

jan et al., 2019)

56.65 56.42 54.65 5.681 67.42 66.93 65.91 4.732

3. ProtVecGen-Ensemble

(Ranjan et al., 2021)

58.59 56.09 55.34 5.056 67.69 66.32 65.47 3.279

4. ProtVecGen-Plus +

MLDA (Ranjan et al.,

2019)

58.80 58.19 56.68 5.281 68.27 68.62 67.12 5.022

5. Proposed model 59.70 57.32 55.89 4.471 72.72 72.21 70.57 4.057

3.4 Overall Comparison with

State-of-the-Art Approaches

In this section, the proposed model is compared with

the existing state-of-the-art literature works, that in-

clude: (i) Multi-label LDA (MLDA) (Wang et al.,

2016), (ii) ProtVecGen-Plus (Ranjan et al., 2019),

(iii) ProtVecGen-Ensemble (Ranjan et al., 2021), and

(iv) hybrid approach ProtVec-Plus + MLDA (Ranjan

et al., 2019). The observed performance metrics for

each of the methods are shown in Table 5 for both the

BP and MF datasets.

3.4.1 [object Promise]

For the BP dataset, the proposed approach easily

betters the results with the MLDA (Wang et al.,

2016), ProtVecGen-Plus (Ranjan et al., 2019) and

ProtVecGen-Ensemble (Ranjan et al., 2021) ap-

proachs, the respective absolute enhancement in

avg

being +6.62%, +1.24% and +0.55%, as shown

in Table 5. A similar trend is seen for the MF dataset

as well, with the proposed approach showing an im-

provement of +12.66%, +4.66%, and +5.10% over the

MLDA (Wang et al., 2016), ProtVecGen-Plus (Ranjan

et al., 2019), ProtVecGen-Ensemble (Ranjan et al.,

2021), respectively. The other metrics, Pr

avg

and

avg

follow this behavior as well.

In comparison to the ProtVecGen-Plus + MLDA

(Ranjan et al., 2019), the proposed methods stood

second for the BP dataset, while comfortably outper-

forming for the MF dataset. The increase in the F1

avg

for the MF dataset is +3.45%.

3.4.2 [object Promise]

Hereby, the consistency f actor gives an indication

about the model’s behavior to perform for protein

sequences of various lengths. For the BP dataset,

the consistency f actor of the proposed model is re-

duced by 6.49, 1.21, 0.58, and 0.81 units with re-

spect to MLDA (Wang et al., 2016), ProtVecGen-Plus

(Ranjan et al., 2019), ProtVecGen-Ensemble (Ranjan

et al., 2021) and ProtVecGen-Plus + MLDA (Ran-

jan et al., 2019), respectively. For MF, the proposed

model stood next to the ProtVecGen-Ensemble (Ran-

jan et al., 2021) (consistency f actor = 3.279), while

reducing the consistency f actor by 4.35, 0.67, and

0.96 units over the MLDA, ProtVecGen-Plus, and

ProtVecGen-Plus + MLDA respectively.

0-200

201-500

501-1000

>1000

Protein Sequence Length Range ---->

Avg. F1-Scores ---->

MLDA

ProtVecGen-Plus

ProtVecGen-Ensemble

ProtVecGen-Plus+MLDA

Proposed

Figure 3: Biological Process: Length-wise performances of

protein sequences.

An in-depth, detailed illustration of the perfor-

mances obtained with different methods for handling

protein sequences of various lengths is provided by

grouping the test protein sequences into four groups,

are shown in Figures 3 (BP) and 4 (MF). The pro-

posed method is showing great performances for sig-

niﬁcantly large protein sequences (having sequence

length > 500 amino acids).

A Sequence-Motif Based Approach to Protein Function Prediction via Deep-CNN Architecture

249

0-200

201-500

501-1000

>1000

Protein Sequence Length Range ---->

Avg. F1-Scores ---->

MLDA

ProtVecGen-Plus

ProtVecGen-Ensemble

ProtVecGen-Plus+MLDA

Proposed

Figure 4: Molecular Function: Length-wise performances

of protein sequences.

4 CONCLUSION

In this work, a sub-sequence based method for pro-

tein function prediction is introduced. The proposed

method takes beneﬁts from information collected for

multiple sequence motifs – captured using the CNN

network – to determine the function for each sub-

sequence. Later, the functional inference for sub-

sequences are used to facilitate the functional annota-

tion of full-length protein sequence. Overall, the pro-

posed method showed great potential, especially for

long protein sequences. The research focused on pro-

tein sub-sequence is still an open research area, and

remarkably, can be great asset to improve the protein

studies. Future work will focus on merging additional

features and putting different deep learning models to

the test.

REFERENCES

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D.,

Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K.,

Dwight, S. S., Eppig, J. T., et al. (2000). Gene ontol-

ogy: tool for the uniﬁcation of biology. Nature genet-

ics, 25(1):25–29.

Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen,

Z. (2017). Prolango: protein function prediction using

neural machine translation based on a recurrent neural

network. Molecules, 22(10):1732.

Consortium, U. (2015). Uniprot: a hub for protein informa-

tion. Nucleic acids research, 43(D1):D204–D212.

Fa, R., Cozzetto, D., Wan, C., and Jones, D. T. (2018). Pre-

dicting human protein function with multi-task deep

neural networks. PloS one, 13(6):e0198216.

Gligorijevi

c, V., Renfrew, P. D., Kosciolek, T., Leman,

J. K., Berenberg, D., Vatanen, T., Chandler, C.,

Taylor, B. C., Fisk, I. M., Vlamakis, H., et al.

(2021). Structure-based protein function prediction

using graph convolutional networks. Nature commu-

nications, 12(1):1–14.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-

celerating deep network training by reducing internal

covariate shift. In International conference on ma-

chine learning, pages 448–456. PMLR.

Jiang, Y., Oron, T. R., Clark, W. T., Bankapur, A. R.,

D’Andrea, D., Lepore, R., Funk, C. S., Kahanda, I.,

Verspoor, K. M., Ben-Hur, A., et al. (2016). An

expanded evaluation of protein function prediction

methods shows an improvement in accuracy. Genome

biology, 17(1):1–19.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Kulmanov, M. and Hoehndorf, R. (2020). Deepgoplus:

improved protein function prediction from sequence.

Bioinformatics, 36(2):422–429.

Kulmanov, M., Khan, M. A., and Hoehndorf, R. (2018).

Deepgo: predicting protein functions from sequence

and interactions using a deep ontology-aware classi-

ﬁer. Bioinformatics, 34(4):660–668.

Kumari, D., Ranjan, A., and Deepak, A. (2019). Pro-

tein function prediction: Combining statistical fea-

tures with deep learning. In Proceedings of 2nd In-

ternational Conference on Advanced Computing and

Software Engineering (ICACSE).

Maas, A. L., Hannun, A. Y., Ng, A. Y., et al. (2013). Rec-

tiﬁer nonlinearities improve neural network acoustic

models. In Proc. icml, volume 30, page 3. Citeseer.

Makrodimitris, S., van Ham, R. C., and Reinders, M. J.

(2019). Improving protein function prediction using

protein sequence and go-term similarities. Bioinfor-

matics, 35(7):1116–1124.

Ozt

urk, H.,

Ozg

ur, A., and Ozkirimli, E. (2018). Deepdta:

deep drug–target binding afﬁnity prediction. Bioinfor-

matics, 34(17):i821–i829.

Ozt

urk, H., Ozkirimli, E., and

Ozg

ur, A. (2019). Wid-

edta: prediction of drug-target binding afﬁnity. arXiv

preprint arXiv:1902.04166.

Radivojac, P., Clark, W. T., Oron, T. R., Schnoes, A. M.,

Wittkop, T., Sokolov, A., Graim, K., Funk, C., Ver-

spoor, K., Ben-Hur, A., et al. (2013). A large-scale

evaluation of computational protein function predic-

tion. Nature methods, 10(3):221–227.

Ranjan, A., Fahad, M. S., Fern

andez-Baca, D., Deepak, A.,

and Tripathi, S. (2019). Deep robust framework for

protein function prediction using variable-length pro-

tein sequences. IEEE/ACM Transactions on Computa-

tional Biology and Bioinformatics, 17(5):1648–1659.

Ranjan, A., Fernandez-Baca, D., Tripathi, S., and Deepak,

A. (2021). An ensemble tf-idf based approach to pro-

tein function prediction via sequence segmentation.

IEEE/ACM Transactions on Computational Biology

and Bioinformatics.

Wang, H., Yan, L., Huang, H., and Ding, C. (2016).

From protein sequence to protein function via multi-

label linear discriminant analysis. IEEE/ACM trans-

actions on computational biology and bioinformatics,

14(3):503–513.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

250

Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang,

Y. (2015). The i-tasser suite: protein structure and

function prediction. Nature methods, 12(1):7–8.

You, R., Zhang, Z., Xiong, Y., Sun, F., Mamitsuka, H., and

Zhu, S. (2018). Golabeler: improving sequence-based

large-scale protein function prediction by learning to

rank. Bioinformatics, 34(14):2465–2473.

A Sequence-Motif Based Approach to Protein Function Prediction via Deep-CNN Architecture

251