Ontology-Powered Boosting for Improved Recognition of Ontology

Concepts from Biological Literature

Pratik Devkota

1 a

, Somya D. Mohanty

1 b

and Prashanti Manda

2 c

Department of Computer Science, University of North Carolina, Greensboro, NC, U.S.A.

Informatics and Analytics, University of North Carolina, Greensboro, NC, U.S.A.

Keywords:

Named Entity Recognition, Automated Ontology Curation, Deep Learning, Biological Ontology.

Abstract:

Automated ontology curation involves developing machine learning models that can learn patterns from sci-

entiﬁc literature to predict ontology concepts for pieces of text. Deep learning has been used in this area with

promising results. However, these models often ignore the semantically rich information that’s embedded in

the ontologies and treat ontology concepts as independent entities. Here, we present a novel approach called

Ontology Boosting for improving prediction accuracy of automated curation techniques powered by deep

learning. We evaluate the performance of our models using Jaccard semantic similarity – a metric designed

to assess similarity between ontology concepts. Semantic similarity metrics have the capability to estimate

partial similarity between ontology concepts thereby making them ideal for evaluating the performance of

annotation systems such as deep learning where the goal is to get as close as possible to human performance.

We use the CRAFT gold standard corpus for training our architectures and show that the Ontology Boosting

approach results in substantial improvements in the performance of these architectures.

1 INTRODUCTION

Biological ontologies are critical for consistent

knowledge representation that can be accessed by

humans and computers alike. Since the introduc-

tion of the Gene Ontology, over 900 bio-ontologies

have been created for representing knowledge in sub-

domains such as anatomy, human disease, chemicals

and drugs, etc. Scientists and curators now use these

ontology concepts to describe various aspects of bi-

ological objects thereby creating knowledge bases of

ontology annotations. These annotations are critical

for transferring knowledge in free text such as publi-

cations into a computationally amenable format that

can power large-scale comparative analyses. Ontol-

ogy annotation is still largely accomplished via hu-

man curation - a process where scientists manually

read text and select ontology concepts that accurately

represent the information in the text. While there has

been a rapid growth in the number of ontologies as

well as the number of ontology-powered annotations,

ontology curation has not experienced the same level

https://orcid.org/0000-0001-5161-0798

https://orcid.org/0000-0002-4253-5201

https://orcid.org/0000-0002-7162-7770

of advances.

Automated curation methods that can scale to the

pace of publishing and show efﬁciency and accuracy

are direly needed. The goal of these methods would

be to process scientiﬁc literature and mark words or

phrases in text with one or more ontology concepts

thereby conducting automated curation. Automated

curation tools can be used as stand-alone approaches

that perform annotation without supervision or pre-

liminary annotators that can make suggestions for hu-

man curators to accept or reject. In either case, it is

important for the models to replicate the process of

a human curator as closely as possible. One of the

ways in which curators select appropriate ontology

concepts is by looking at the concepts in the context

of the ontology including the hierarchy and relation-

ships in the ontology and not as concepts as individual

constructs. This indicates that for automated methods

to be successful at curation, they need to be cognizant

of the ontology structure as well as relationships be-

tween different concepts.

The ultimate goal of automated ontology concept

recognition is to develop intelligent systems that can

understand the ontology hierarchy and make predic-

tions that are cognizant of those relationships. For

example, if a phrase in text corresponds to ontology

Devkota, P., Mohanty, S. and Manda, P.

Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature.

DOI: 10.5220/0011683200003414

In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 3: BIOINFORMATICS, pages 80-90

ISBN: 978-989-758-631-6; ISSN: 2184-4305

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

concept X in the gold standard, the model should ide-

ally recognize that X corresponds to the phrase. How-

ever, automated models are not perfect and sometimes

make mistakes. In this case, the desired outcome

would be for the model to recognize that semantically

similar concepts to X (such as X’s parent) exist in the

ontology and that those concepts should be associated

with the text. Ontology sentient models take the on-

tology structure or the ontology graph as input while

training and make predictions accordingly. However,

developing accurate ontology sentient models can be

a challenge due to the size and complexities of the on-

tology graph that often results in models that are too

large or require an inordinate time for training.

The automated annotation models previously de-

veloped by this team (Manda et al., 2018; Manda

et al., 2020; Devkota et al., 2022b; Devkota et al.,

2022a) have shown good accuracy in recognizing on-

tology concepts from text. In most cases, these sys-

tems are able to predict the same ontology concept as

the ground truth in the gold standard data achieving

perfect accuracy. However, ontology-based predic-

tion systems can also achieve partial accuracy. This

happens when a model might not predict the exact

ontology concept as the gold standard but a related

concept (sub-class or super-class), thereby achieving

partial accuracy. In cases where our models were not

able to predict the ground truth exactly, they failed

to achieve reasonable partial accuracy and predicted

concepts that were highly unrelated to the ground

truth. Thus, our models’ accuracy could be improved

by focusing on improving the partial accuracy for in-

stances when the model fails to make an exact predic-

tion.

With the above motivation, here, we present an al-

ternative approach called Ontology-powered Boost-

ing (OB) to improve the prediction performance

of automated curation models by using informa-

tion about the ontology hierarchy to post-process the

model’s predictions after training has completed. The

goal of OB is to combine the model’s preliminary

predictions with knowledge of the ontology hierarchy

to selectively increase the conﬁdence of certain pre-

dictions to improve overall prediction accuracy. The

method relies on a computationally inexpensive cal-

culation and avoids bloated machine learning models

that cannot be trained or deployed without requiring

enormous resources.

Note that the contribution of this work is not in

presenting novel architectures for recognizing ontol-

ogy concepts but rather in presenting the Ontology

Boosting approach for further improving prediction

accuracy of our previously published deep learning

architectures. Hence, we will present architecture de-

tails brieﬂy and will refer the reader to our prior work

for complete details.

2 BACKGROUND

Automated methods of recognizing ontology con-

cepts in literature have been developed in the last

decade and the approaches range from lexical anal-

ysis to traditional machine learning to deep learning

in more recent times.

Text mining tools that use traditional machine

learning based methods employ supervised learning

techniques using gold standard corpora (Beasley and

Manda, 2018). In 2018, we conducted a survey of

ontology-based Named Entity Recognition and con-

ducted a formal comparison of methods and tools

for recognizing ontology concepts from scientiﬁc lit-

erature. Three concept recognition tools (MetaMap

(Aronson, 2001), NCBO Annotator (Jonquet et al.,

2009), Textpresso (M

uller et al., 2004) were com-

pared (Beasley and Manda, 2018).These methods can

form generalizable associations between text and on-

tology concepts leading to improved accuracy.

The rise of deep learning in the areas of image

and speech recognition has translated into text-based

problems as well. Preliminary research has shown

that deep learning methods result in greater accu-

racy for text-based tasks including identifying ontol-

ogy concepts in text (Lample et al., 2016; Habibi

et al., 2017; Lyu et al., 2017; Wang et al., 2018;

Manda et al., 2020). Deep learning methods use vec-

tor representations that enable them to capture depen-

dencies and relationships between words using en-

riched representations of character and word embed-

dings from training data (Casteleiro et al., 2018). We

evaluated the feasibility of using deep learning for the

task of recognizing ontology concepts in a 2018 study

(Manda et al., 2018). We compared Gated Recurrent

Units (GRUs), Long Short Term Memory (LSTM),

Recurrent Neural Networks (RNNs), and Multi Layer

Perceptrons (MLPs) and evaluated their performance

on the CRAFT gold standard dataset. We also intro-

duced a new deep learning model/architecture based

on combining multiple GRUs with a character+word

based input. We used data from ﬁve ontologies in

the CRAFT corpus as a Gold Standard to evaluate

our model’s performance. Results showed that our

GRU-based model outperformed prior models across

all ﬁve ontologies. These ﬁndings indicated that deep

learning algorithms are a promising avenue to be ex-

plored for automated ontology-based curation of data.

This study also served as a formal comparison and

guideline for building and selecting deep learning

Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature

models and architectures for ontology-based curation.

In 2020, we presented new architectures based

on GRUs and LSTM combined with different in-

put encoding formats for automated Named Entity

Recognition (NER) of ontology concepts from text.

We found that GRU-based models outperform LSTM

models across all evaluation metrics. We also created

multi level deep learning models designed to incorpo-

rate ontology hierarchy into the prediction. Surpris-

ingly, inclusion of ontology semantics via subsump-

tion reasoning yielded modest performance improve-

ment (Manda et al., 2020). This result indicated that

more sophisticated approaches to take advantage of

the ontology hierarchy are needed.

Continuing this work, a 2022 study (Devkota

et al., 2022b) presented state of the art deep learning

architectures based on GRUs for annotating text with

ontology concepts. We augmented the models with

additional information sources including NCBI’s Bio-

Thesauraus and Uniﬁed Medical Language System

(UMLS) to augment information from CRAFT for in-

creasing prediction accuracy. We demonstrated that

augmenting the model with additional input pipelines

can substantially enhance prediction performance.

Considering that our previous attempt at creating

intelligent prediction systems that use the ontology

hierarchy was not successful, we developed a differ-

ent approach to providing the ontology as input to

the deep learning model (Manda et al., 2020). In

2022, we presented an intelligent annotation system

(Devkota et al., 2022a) that uses the ontology hierar-

chy for training and predicting ontology concepts for

pieces of text. Here, we used a vector of semantic

similarity scores to the ground truth and all ancestors

in the ontology to train the model. This representation

allowed the model to identify the target GO term fol-

lowed by “similar” GO terms that are partially accu-

rate predictions. This output label representation also

helped the model optimize the weights to target more

than one prediction label. We showed that our ontol-

ogy aware models can result in 2% - 10% (depend-

ing upon choice of embedding) improvements over a

baseline model that doesn’t use ontology hierarchies.

3 METHODS

3.1 Ontology Boosting

A key component of our approach is to combine the

prediction of the deep learning architectures with the

graph structure of ontological concepts. Here we

“boost” the predictions of the deep learning model by

supporting them with similar predictions while tak-

ing model uncertainty into account. Here we take a

two step approach, 1) identify candidates for boost-

ing, and 2) boost the predictions with semantically

similar concepts.

In the ﬁrst step, we identify the candidate pre-

dictions where the deep learning models had low

conﬁdence. We calculate the uncertainty in the

predictions based on the last layer softmax out-

put. The layer outputs a probability vector (ν

p(x

), p(x

),··· p(x

) >, where i is the i’th input to-

ken and p(x

) probability of tag x

), corresponding to

all possible tags (0···m), for each token that is pro-

vided as input to the model. In general, we calcu-

late the argmax(ν

) to select the tag with the highest

probability as the model output. Here we leverage the

top k probabilities from ν

vector to calculate the un-

certainty in the model’s predictions by evaluating the

entropy H(ν

) using Shannon’s information entropy

where ν

= p(argmax

(ν

)).

The highest predicted probability of the ν

prob-

ability vector and the entropy (H(ν

)) value are used

to determine the threshold for the predictions where

boosting needs to be applied. The intuition behind

this is to only boost speciﬁc predictions where the

model has low conﬁdence. We choose the thresh-

olds for the two parameters by analyzing the pre-

dictions graphs (Figures 3, 3a), i.e. visualizing the

thresholds of the parameters where the model makes

the most errors. We also select the top k predictions

(argmax

(x)) from the ν

vector to boost. Boosting

all of the possible tag predictions does not beneﬁt the

model’s performance and has a detrimental effect on

computational overhead.

The second step boosts the predicted probabili-

ties of the identiﬁed candidates by combining them

with the model predictions of the candidate’s ances-

tors/subsumers. Speciﬁcally, for each token i, we

boost the probabilities of top k tags (p(argmax

(ν

)))

with the probabilities of their subsumers using the fol-

lowing computation:

I (x

) = −log(f

/C)

˜p(x

) = β ∗ p(x

) ∗I (x

) +

∑

n=1

α ∗ p(x

) ∗I (x

)

where, we ﬁrst calculate the information content

(I (x

)) of tag x

(0 ≤ j ≤ m, where m is the num-

ber of tags) as the negative log of concept frequency

( f

) over the total number of available concepts (C).

I (x

) is then utilized to calculate the boosted proba-

bility, ˜p(x

), which consists of two components, the

modulated original probability (β ∗ p(x

) ∗ I (x

)) and

supportive parent boosting (

∑

n=1

α∗p(x

)∗I (x

)

The modulated original probability combines the

original probability with a weighting factor β and the

BIOINFORMATICS 2023 - 14th International Conference on Bioinformatics Models, Methods and Algorithms

information content of the concept I (x

). The second

part of the computation evaluates all of the subsumer

probabilities of x

by individually calculating the

modulated probabilities of parents (α∗ p(x

)∗ I (x

)),

where x

has d ancestors, while controlling the inﬂu-

ence by normalizing with the depth factor n. Here α is

a weighting parameter used to control the inﬂuence of

the ancestor probabilities to the boosting. The calcu-

lated parent probabilities are then summed and added

to the modulated probability of x

Using the aforementioned approach, ˜p(x

) com-

bines the predicted probabilities of the tags with their

ancestor’s predictions from a single model. This is

done only for the GO annotations, where a speciﬁc

annotation probability might be boosted to make it the

top prediction if it had supporting parent predictions

from the model. The information content parameter

modulates the effect on the boosted probability by

taking frequency of occurrence of a concept and its

hierarchy into account. α and β parameters can fur-

ther control the emphasis we put on the parental con-

tribution vs original probability, where a β value of

0 nulliﬁes the parental contribution and of 1 includes

the parental support completely. As we go higher in

the ancestor path, the depth factor n enforces lower

contributions coming from higher subsumers.

We utilize Bayesian optimization to evaluate the

different values of k (top k predictions), entropy

threshold (H(ν

)), α and β to maximize the model

prediction accuracy. Speciﬁcally, we utilize Tree-

structured Parzen Estimators (TPE) (Bergstra et al.,

2013) approach to derive the optimal values for each

of parameters for maximizing our objective function.

The objective function in the experiment is deﬁned to

maximize the mean semantic similarity. α and β are

evaluated with continuous values between 0.0 to 1.0,

while k was evaluated for values between 1 - 10. The

range of entropy was deﬁned to be continuous values

ranging from 0.0 to the highest value of entropy of the

top k predictions for each token.

We demonstrate the efﬁcacy of the Ontology

Boosting approach on two deep learning architectures

from our previous work (Devkota et al., 2022b; De-

vkota et al., 2022a).

3.2 Training Dataset

This study uses GO annotations from version v4.0.1

(https://github.com/UCDenver-ccp/CRAFT/releases/

tag/v4.0.1) of The Colorado Richly Annotated Full

Text Corpus (CRAFT) (Bada et al., 2012), a manually

annotated corpus containing 97 articles each of which

is annotated to 10 ontologies.

3.3 Data Preprocessing

The following preprocessing steps are performed to

translate annotations from the CRAFT corpus to the

desired input formats for the deep learning models.

Please refer to previous work (Devkota et al., 2022b)

for full details.

3.3.1 Sentence Segmentation and Tokenization

Annotations for each CRAFT article are recorded

in the corresponding xml annotations ﬁle via char-

acter index spans. In order to obtain annotations

per word, we utilize a sentence segmentation library

called SpaCy (https://spacy.io/). First, the segmenter

splits the text into sentences by accounting for sen-

tence end marks (such as periods, exclamation, ques-

tion marks, etc.) and then uses a tokenizer to split

the sentences into individual words (or tokens) by ac-

counting for word boundaries (such as space, hyphen,

tab, etc.).

3.3.2 IOB Tagging

Each extracted word/token is mapped to a GO term or

an out-of-concept annotation. Each token is mapped

to one of three tags: 1) GO to indicate an annotation,

‘O’ for a non-annotation, and ‘EOS’ to indicate the

end of sentence.

3.3.3 POS Tagging and Token Encoding

POS tagging looks at the contextual information of

the word based on the words surrounding it in a sen-

tence or a phrase. Here we used the SpaCy POS tag-

ger to evaluate and tag the tokens of sentences with

15 parts of speech tags — adjective, adposition (such

as - in, to, during), adverb, auxiliary (such as - is, has

done, will do, should do), conjunction, coordinating

conjunction, determiner, interjection, noun, numeral,

particle, pronoun, proper noun, punctuation, subor-

dinating conjunction, symbol, verb, other (not anno-

tated to any of the others), and space.

We represent character level aspects of a token

using character encodings. These encodings repre-

sent upper-case and lower-case characters with ‘C’ or

‘c’ respectively. Numbers are represented using an

‘N’ and punctuation (such as commas, periods, and

dashes) are retained in the encoding.

3.3.4 BioThesaurus Encoding

One of the architectures tested in this study uses exter-

nal information from existing large scale knowledge

bases. The ﬁrst data source we use is BioThesaurus

(Liu et al., 2006), which is a database of protein and

Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature

gene names mapped to the UniProt Knowledgebase.

If a token is present in the database, we map if it iden-

tiﬁes as a protein name, biomedical terms, chemical

terms, and/or macromolecule.

3.3.5 Uniﬁed Medical Language System

(UMLS) Encoding

Next, we query the UMLS (Lindberg et al., 1993)

database for tokens extracted from the articles.

Words/tokens associated with a UMLS term are en-

coded as 1 or 0 otherwise. If a phrase (sequence of

tokens) is found in UMLS, all tokens from the phrase

are encoded as 1.

3.4 Deep Learning Architecture

We evaluate our boosting approach on two deep learn-

ing architectures from our prior work (Devkota et al.,

2022b; Devkota et al., 2022a). We describe the two

architectures brieﬂy here and refer the reader to the

articles for comprehensive details.

3.4.1 Architecture 1 - Externally Augmented

Predictor (A

)

Figure 1 shows the three key components of A

—

1) Input Pipelines; 2) Embedding/Latent Represen-

tations; and 3) Sequence Modeler. This architecture

was originally published in (Devkota et al., 2022b).

Figure 1: Architecture of a GRU based model for ontology

concept recognition. Figure originally published in (De-

vkota et al., 2022b).

Input Pipelines. Each sentence and each token are

provided six different components as input — 1) to-

ken (X

token

tr ain

), 2) character sequence (X

char

tr ain

), 3) token-

character representation (X

repr

tr ain

), 4) parts-of-speech

POS

tr ain

), 5) BioThesaurus (X

BIOT

tr ain

), and 6) UMLS

UMLS

tr ain

The token (X

token

tr ain

) input, is a sequential tensor

where each token is represented with a high di-

mensional one hot encoded vector. Similarly, the

character sequence (X

char

tr ain

) is also a sequential ten-

sor consisting of character sequences present in a

word/token.

Character representations (X

repr

tr ain

) and POS tags

POS

tr ain

) are based on words/tokens in sentences. Bio-

thesaurus encodings (X

BIOT

tr ain

) contain a four dimen-

sional vector sequence where each token is one hot

encoded for its association with protein, biomedical,

chemical and macromolecule categories. UMLS en-

codings (X

UMLS

tr ain

) are also provided as an one hot en-

coded vector sequence, where 1 indicates a token’s

presence and 0 indicates absence in UMLS.

Embedding/Latent Representations. The super-

vised embedding is a bottleneck layer which learns

to map the one hot encoded input into a smaller di-

mensional representation. We used ELMo (Peters

et al., 2018) pretrained embeddings for the X

token

tr ain

in-

put. Embeddings in ELMo are learned via a bidi-

rectional language model where the sequence of the

words are also taken into account. We use the pre-

trained model on 1 Billion Word Benchmark, which

consists of approximately 800M tokens of news crawl

data and has an embedding of 1024 dimensional out-

put embedding vectors.

Sequence Modeler. We utilize Bi-GRUs in two lo-

cations in the architecture, ﬁrst to model the sequence

of characters present in each token and a second

main Bi-GRU model to concatenate input pipelines

together. After the embedding of the characters, they

are passed via the ﬁrst Bi-GRU (consisting of 150

units) resulting in a sequence representation of the

characters in a sentence. 10% dropout is used in this

pipeline to regularize the output to prevent overﬁtting.

The character sequence representation is then con-

catenated with the ELMo embeddings, character rep-

resentation, and parts of speech, and input tensors

from Bio-Thesaurus and UMLS. This concatenated

feature map representing each sentence is then passed

to a spatial dropout, which removes 30% of the 1-D

sequence features from the input to the main Bi-GRU.

The main Bi-GRU processes the feature maps (with

10% dropout), and outputs to a single time-distributed

dense layer of 1774 nodes (representing each of the

output tags).

Architecture hyper-parameters, which include —

supervised embedding shape ({20, 50, 100, 150,

200}), dropout ({01, .2, .3, .5, .7}), number of epochs

({50, 100, 200, 300}), and class weighting, were eval-

uated using a grid search approach. We used Adam

(Kingma and Ba, 2017) as our optimiser for all of the

experiments with a default learning rate of 0.0001.

BIOINFORMATICS 2023 - 14th International Conference on Bioinformatics Models, Methods and Algorithms

3.4.2 Architecture 2 - Intelligent Predictor (A

)

, as shown in ﬁgure 2 uses an intelligent predic-

tion system by using the ontology hierarchy structure

as opposed to A

(Devkota et al., 2022a). This ar-

chitecture was originally published in (Devkota et al.,

2022a). The overall structure of the A

is similar to

in terms of the Sequence Modeler and Embed-

ding/Latent Representation components. This archi-

tecture varies in the Input Pipelines provided as well

as how the target vector is represented for training the

model.

Figure 2: Architecture of an intelligent ontology prediction

system based on GRUs. Figure originally published in (De-

vkota et al., 2022a).

Input Pipelines. We provide three inputs for each

word in a sentence - 1) token (X

token

tr ain

), 2) character

sequence (X

char

tr ain

), 3) parts-of-speech (X

POS

tr ain

Target Vector Representation. Target labels to be

predicted are typically provided as a one-hot encoded

vector where the size of the vector equals the number

of output labels (as in the case of A

). In our stud-

ies, the output labels correspond to the set of all GO

terms. Typically, the value of the GO term to be pre-

dicted is set to a 1 and the value of all other GO terms

is set to 0. This approach of representing the target

labels, however, does not allow the model to learn the

ontology hierarchy nor does it allow for semantically

similar partial predictions.

In A

, we use Jaccard semantic similarity scores

as values in the label vector. The value of the GO

term to be predicted is set to 1 and the value of all

other GO terms in the vector is set to the Jaccard sim-

ilarity score between that term and the GO term to

be predicted. This representation allows the model to

identify the target GO term followed by “similar” GO

terms that are partially accurate predictions. This out-

put label representation also helps the model optimize

the weights to target more than one prediction label.

So for each output tag, the representation is as fol-

lows:

Y =

(

l = [1], if T ==

l = [β ∗ J

sim

(T ,

T )], if T ̸=

T & T ̸= O

(1)

where, Y is the ﬁnal target vector, l is the label

for the word, T is the ground truth tag,

T is the pre-

dicted tag, β is the Jaccard weight, and J

sim

is the Jac-

card similarity between T and

T . The target vector

Y is computed by comparing the true tag (T ) with the

possible tags (

T ), where if the T ==

T == O (no an-

notation) OR a GO annotation then the value is set

to 1. Else, if the ground truth tag (T ) is a GO term,

then we calculate its Jaccard similarity to all possible

GO terms and create the target vector by weighting

it with β. We evaluate β values between {0, .25, .5,

1}, where a β value 0 indicates the traditional one-hot

vectorization and a β value 1 indicates the full Jaccard

score taken into account.

The Jaccard similarity (J

sim

) of the ground truth

concept T and a predicted concept

T (Pesquita et al.,

2009) is calculated as:

sim

(T ,

T ) =

|S(T ) ∩ S(

T )|

|S(T ) ∪ S(

T )|

(2)

where, S(T ) is the set of ontology subsumers of

T . Speciﬁcally, J

sim

of two concepts (A, B) in an on-

tology is deﬁned as the ratio of the number of con-

cepts in the intersection of their subsumers over the

number of concepts in their union of their subsumers

(Pesquita et al., 2009).

Sequence Modeler. The sequence modeler for A

is similar to the sequence modeler in A

. The differ-

ences are in the optimizer, the loss function and the

activation function in the ﬁnal output layer. A soft-

max activation is used in the ﬁnal layer which normal-

izes the output of the model to a probability distribu-

tion over the output tags. We use the Adam algorithm

with weight decay (Loshchilov and Hutter, 2017) as

our optimiser with an initial learning rate of 0.001.

The learning rate is reduced by a factor of 0.1 after

the ﬁrst 10000 training steps, and reduced further by

a factor of 0.1 after the next 15000 steps. The weight

decays by a factor of 0.0001 after the ﬁrst 10000 steps

and further by another factor of 0.0001 after the next

15000 steps. We use sigmoid focal cross entropy

(Lin et al., 2017) as the loss function. Sigmoid focal

cross entropy is particularly useful for cases where we

have highly imbalanced classes. It reduces the relative

loss for easy to classify, higher frequency examples,

putting more focus on harder to classify, misclassiﬁed

examples. This loss function uses α, also called bal-

Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature

ancing factor and β or modulating factor, which are

set to 0.25 and 2.0 respectively.

3.5 Performance Evaluation Metrics

Our primary evaluation metric in this study is seman-

tic similarity. Metrics such as F1 are designed for tra-

ditional information retrieval systems that either re-

trieve a piece of information or fail to do so (a binary

evaluation). However, this is not a true indication of

the performance of ontology-based retrieval or pre-

diction systems where the notion of partial accuracy

applies. A model might not predict the exact concept

as a gold standard but might predict the parent or an

ancestor of the ground truth as indicated by the ontol-

ogy. The parent will have a higher degree of similarity

to the ground truth thereby resulting in high semantic

similarity (but would count as a failure for F1 compu-

tation). Semantic similarity metrics (Pesquita et al.,

2009) designed to measure different degrees of sim-

ilarity between ontology concepts can be leveraged

to measure the similarity between the predicted con-

cept and the actual annotation to quantify the partial

prediction accuracy. Here, we use Jaccard similarity

(Pesquita et al., 2009) that measures the ontological

distance between two concepts to assess partial simi-

larity.

We also provide a modiﬁed F1 score for our ar-

chitectures. The model is tasked with predicting non-

annotations (indicated by an ‘O’ tag) or annotations

(indicated by a ‘GO’ tag). Since the majority of tags

in the training corpus are non-annotations, the model

predicts them with great accuracy. In order to avoid

biasing the F1 score, we omit accurate predictions of

‘O’ tags from the calculation to report a relatively

conservative F1 score.

4 RESULTS AND DISCUSSION

Table 1 presents a summary of how boosting affected

the results of the two architectures. First, we see that

the majority of tokens are selected for boosting via

the Bayesian selection process (Row 2). However, the

majority of tokens that are boosted remain unchanged

indicating that when the model makes correct predic-

tions, the boosting process largely retains the correct

prediction (Row 3). Reassuringly, boosting does not

change any of the GO predictions to an ‘O’ tag (Row

4). The majority of incorrect predictions made by the

models happen when the ground truth is a GO con-

cept but the model incorrectly predicts an ‘O’ (non-

annotation). Boosting makes a substantial difference

in this case by changing these instances from an ‘O’ to

a GO concept (Row 5). Of these instances, 37% (A

)

- 41% (A

) are corrected from an ‘O’ prediction to

an exactly matching GO concept as the ground truth

(Row 6). When boosting corrects an ‘O’ prediction to

a GO term (exact or partial match to the ground truth),

the average semantic similarity of these instances lies

between 53 % - 60%.

We examine the impact of our boosting approach

on improving prediction accuracy of the two architec-

tures presented above (Table 2). The base scores refer

to the output of the architecture before boosting was

applied and the boosted scores reﬂect performance af-

ter boosting is applied. We see that boosting improves

Jaccard semantic similarity scores by 7% for A

and

5.8% for the A

. The F1 scores experience a modest

improvement of 2.5% for A

and no improvement for

Ontology Boosting corrected 201 incorrect pre-

dictions (A

) while changing 6 correct predictions to

a semantically similar concept to the ground truth.

Similarly, boosting corrected 174 incorrect predic-

tions (A

) while changing 113 correct predictions to

a different GO concept (semantically similar). The

net effect of these two contributions appears to re-

sult in modest improvements or keeps the F1 score

unchanged. However, the real contribution of boost-

ing is reﬂected in the semantic similarity scores which

show an improvement.

Figures 3a and 3b show the probability of the

highest prediction and entropy (of the top 5 predic-

tions) across all tokens in the dataset. Correct predic-

tions are shown in blue and incorrect ones are shown

in red. These graphs indicate the probability and en-

tropy zones where the architectures make incorrect

predictions that can be corrected using boosting. We

use the Bayesian method described above to select in-

correct predictions on this graph to be boosted.

Figure 4a shows the incorrect predictions by A

Like above, the majority of these instances were cases

where the ground truth is a GO concept and the

prediction is an ‘O’ term (non-annotation). Figure

4b shows instances incorrectly predicted as ‘O’ that

were corrected by boosting. In this ﬁgure, blue in-

stances represent cases where the boosted prediction

was an exact match to the ground truth (100% ac-

curacy) whereas the purple instances represent cases

where the boosted prediction was a partial match to

the ground truth. The size of the purple instances re-

ﬂect the degree of partial relatedness to the ground

truth - larger indicates higher semantic similarity to

the ground truth. We see that boosting has a substan-

tial effect on correcting inaccurate predictions.

Figure 5a shows the incorrect predictions by A

The majority of these instances were cases where the

BIOINFORMATICS 2023 - 14th International Conference on Bioinformatics Models, Methods and Algorithms

Table 1: Effect of ontology boosting on the two architectures.

Row Description A

1 Total number of tokens 5439 5495

2 Number of tokens boosted 4972 5197

3 Number of tokens boosted but unchanged 4411 4587

4 Number of tokens boosted from GO to O 0 0

5 Number of tokens boosted from O to GO 534 366

6 Number of tokens boosted from O to an exact GO 197 152

7 Average Semantic Similarity for O to GO 0.53 0.60

Table 2: Impact of boosting on the two architectures.

Architecture Semantic Similarity Modiﬁed F1

Base 0.84 0.83

Boosted 0.90 0.85

Base 0.85 0.81

Boosted 0.90 0.81

(a) Architecture A

(b) Architecture A

Figure 3: Distribution of correct and incorrect predictions with respect to probability and entropy of predictions.

Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature

(a) Incorrect predictions.

(b) Corrected predictions after boosting.

Figure 4: A

predictions corrected via Ontology Boosting.

ground truth is a GO concept and the prediction is an

‘O’ term (non-annotation). Figure 5b shows instances

incorrectly predicted as ‘O’ that were corrected by

boosting.

5 CONCLUSIONS

Automated methods for ontology-based annotation

of scientiﬁc literature are important for represent-

ing knowledge in a consistent machine-readable for-

mat. Deep learning models have shown promising

improvements and performance at this task. How-

ever, the majority of these architectures do not ac-

count for or take advantage of the ontology hierarchy

thereby losing valuable information. Encoding and

representing complex ontologies can lead to massive

models that are computationally and ﬁnancially infea-

sible to train. Here, we presented a novel approach

called Ontology Boosting that allows post processing

of ontology predictions by deep learning architectures

to selectively improve the conﬁdence of certain pre-

dictions by using information from the ontology such

as subsumers, information content, depth of a con-

cept in the ontology, etc. We show that this computa-

tionally inexpensive step can result in substantial im-

provements to our key performance metric - semantic

similarity. Our results clearly show that the predic-

tions made by the deep learning model are closer to

the human ground truth after applying the Boosting

process as compared to before.

ACKNOWLEDGEMENTS

This work is funded by a CAREER grant from the

Division of Biological Infrastructure at the National

Science Foundation of United States of America

(#1942727).

BIOINFORMATICS 2023 - 14th International Conference on Bioinformatics Models, Methods and Algorithms

(a) Incorrect predictions.

(b) Corrected predictions after boosting.

Figure 5: A

predictions corrected via Ontology Boosting.

REFERENCES

Aronson, A. R. (2001). Effective mapping of biomedical

text to the umls metathesaurus: the metamap program.

In Proceedings of the AMIA Symposium, page 17.

American Medical Informatics Association.

Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K.,

Sitnikov, D., Baumgartner, W. A., Cohen, K. B., Ver-

spoor, K., Blake, J. A., and Hunter, L. E. (2012). Con-

cept annotation in the craft corpus. BMC Bioinformat-

ics, 13(1):161.

Beasley, L. and Manda, P. (2018). Comparison of natu-

ral language processing tools for automatic gene on-

tology annotation of scientiﬁc literature. Proceedings

of the International Conference on Biomedical Ontol-

ogy.

Bergstra, J., Yamins, D., and Cox, D. (2013). Making a

science of model search: Hyperparameter optimiza-

tion in hundreds of dimensions for vision architec-

tures. In Dasgupta, S. and McAllester, D., editors,

Proceedings of the 30th International Conference on

Machine Learning, volume 28 of Proceedings of Ma-

chine Learning Research, pages 115–123, Atlanta,

Georgia, USA. PMLR.

Casteleiro, M. A., Demetriou, G., Read, W., Prieto, M. J. F.,

Maroto, N., Fernandez, D. M., Nenadic, G., Klein,

J., Keane, J., and Stevens, R. (2018). Deep learning

meets ontologies: experiments to anchor the cardio-

vascular disease ontology in the biomedical literature.

Journal of biomedical semantics, 9(1):13.

Devkota, P., Mohanty, S., and Manda, P. (2022a). Knowl-

edge of the ancestors: Intelligent ontology-aware an-

notation of biological literature using semantic sim-

ilarity. Proceedings of the International Conference

on Biomedical Ontology.

Devkota, P., Mohanty, S. D., and Manda, P. (2022b). A

gated recurrent unit based architecture for recognizing

ontology concepts from biological literature. BioData

Mining, 15(1):1–23.

Habibi, M., Weber, L., Neves, M., Wiegandt, D. L., and

Leser, U. (2017). Deep learning with word embed-

dings improves biomedical named entity recognition.

Bioinformatics, 33(14):i37–i48.

Jonquet, C., Shah, N., H Youn, C., Musen, M., Callendar,

Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature

C., and Storey, M.-A. (2009). Ncbo annotator: Se-

mantic annotation of biomedical data.

Kingma, D. P. and Ba, J. (2017). Adam: A method for

stochastic optimization.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami,

K., and Dyer, C. (2016). Neural architectures

for named entity recognition. arXiv preprint

arXiv:1603.01360.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017). Focal loss for dense object detection.

Lindberg, D. A., Humphreys, B. L., and McCray, A. T.

(1993). The uniﬁed medical language system. Year-

book of Medical Informatics, 2(01):41–51.

Liu, H., Hu, Z.-Z., Zhang, J., and Wu, C. (2006). Bio-

thesaurus: a web-based thesaurus of protein and gene

names. Bioinformatics, 22(1):103–105.

Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-

cay regularization.

Lyu, C., Chen, B., Ren, Y., and Ji, D. (2017). Long short-

term memory rnn for biomedical named entity recog-

nition. BMC bioinformatics, 18(1):462.

Manda, P., Beasley, L., and Mohanty, S. (2018). Taking

a dive: Experiments in deep learning for automatic

ontology-based annotation of scientiﬁc literature. Pro-

ceedings of the International Conference on Biomedi-

cal Ontology.

Manda, P., SayedAhmed, S., and Mohanty, S. D. (2020).

Automated ontology-based annotation of scientiﬁc lit-

erature using deep learning. In Proceedings of The

International Workshop on Semantic Big Data, pages

1–6.

uller, H.-M., Kenny, E. E., and Sternberg, P. W. (2004).

Textpresso: An ontology-based information retrieval

and extraction system for biological literature. PLOS

Biology, 2(11).

Pesquita, C., Faria, D., Falcao, A. O., Lord, P., and Couto,

F. M. (2009). Semantic similarity in biomedical on-

tologies. PLoS computational biology, 5(7).

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M.,

Clark, C., Lee, K., and Zettlemoyer, L. (2018).

Deep contextualized word representations. CoRR,

abs/1802.05365.

Wang, X., Zhang, Y., Ren, X., Zhang, Y., Zitnik, M.,

Shang, J., Langlotz, C., and Han, J. (2018). Cross-

type biomedical named entity recognition with deep

multi-task learning. arXiv preprint arXiv:1801.09851.

BIOINFORMATICS 2023 - 14th International Conference on Bioinformatics Models, Methods and Algorithms