Unsupervised Approach for Named Entity Recognition in Biomedical

Documents Using LDA-BERT Models

Veena G

1 a

, Deepa Gupta

2 b

and V D Vivek

Department of Computer Science and Applications, Amrita Vishwa Vidyapeetham,

Amritapuri, India

Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham,

Bengaluru, India

Keywords:

Unsupervised Approach, Named Entity Recognition, SciBERT, Biomedical, Topic Modeling, LDA.

Abstract:

Named Entity Recognition (NER) is a crucial task in biomedical text mining, enabling the identiﬁcation and

extraction of entities such as genes, proteins, diseases, and drugs. Existing NER approaches often rely on

supervised learning with labeled data, which may be limited and expensive to obtain. This research proposes

a novel semantic model featuring weighted distributions focused on unsupervised NER within the biomedical

domain. Our methodology leverages an enhanced SciBERT model incorporating LDA topic modeling (SciB-

ERT+LDA) for NER. The model speciﬁcally targets identifying three signiﬁcant entities: Disease, Chemical,

and Protein. The assessment of the proposed method demonstrates its effectiveness in recognizing named

entities within the biomedical domain. Our proposed approach demonstrates promising results, attaining a

macro-average F-measure of 86%. Moreover, the proposed approach can readily be expanded to encompass

the recognition of more domain-speciﬁc entities.

1 INTRODUCTION

Biomedical Named Entity Recognition (BioNER) is

a specialized task within Natural Language Process-

ing (NLP) and computational biology that involves

identifying and classifying speciﬁc entities or terms in

biomedical texts. It recognizes the major named enti-

ties present in medical records, research articles, and

other healthcare documents. Identifying these entities

is vital in downstream NLP tasks such as knowledge

graph creation, clinical decision support systems, and

drug discovery and development. Signiﬁcant research

efforts have been focused on developing open-domain

NER systems that utilize cutting-edge machine learn-

ing models. Identifying named entities within spe-

ciﬁc domains poses a notably more challenging task

((Gridach, 2017), (Wei et al., 2019), (Zhang et al.,

2021)) due to the semantic complexity of the entities.

Current open NER models failed to identify domain-

speciﬁc entities due to their training on distinct cor-

pora. Furthermore, the challenges of transferability

and generalizability persist as signiﬁcant obstacles. In

this work, we formulate BioNER as a classiﬁcation

https://orcid.org/0000-0003-3513-9304

https://orcid.org/0000-0002-1041-5125

problem instead of sequence labeling problem, which

is used in State-of-the-art NER models. Our method-

ology uses the extended SciBERT coupled with LDA

(SciBERT+LDA) model to generate a Global Vector

for each entity. We ﬁrst use topic modeling to identify

the latent topics within the corpus. Speciﬁcally, we

adopt the widely recognized LDA approach for topic

modeling (Blei et al., 2003a). Subsequently, these

identiﬁed topics undergo vectorization using SciB-

ERT, a contextualized word embedding model tai-

lored for the biomedical and scientiﬁc domains. SciB-

ERT captures the context and semantics of scientiﬁc

language, including the speciﬁc vocabulary and struc-

tures commonly found in academic papers, research

articles, and other scientiﬁc documents. By combin-

ing LDA and SciBERT, we capitalize on the distinc-

tive capabilities inherent in each model. Furthermore,

we extend SciBERT’s tokenizer to recognize domain-

speciﬁc keywords. Additionally, in our process, we

use the weighted scores obtained from the LDA mod-

elm to calculate local input word vectors. In the ﬁnal

step of our methodology, we compare these locally

generated input vectors and the Global Vectors. This

comparison uses the cosine similarity metric to ascer-

tain the ultimate entity tag for the given input phrase.

The primary contributions of our work include:

G, V., Gupta, D. and V D, V.

Unsupervised Approach for Named Entity Recognition in Biomedical Documents Using LDA-BERT Models.

DOI: 10.5220/0013588500004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 2, pages 155-162

ISBN: 978-989-758-763-4

155

• We introduce an unsupervised approach for

named entity labeling within the Biomedical do-

main. The SciBERT+LDA model incorporates el-

ements from both SciBERT and LDA to provide

a robust solution for identifying and categorizing

named entities in biomedical texts.

• Our approach produces domain-speciﬁc word

vectors through an enhanced SciBERT, suitable

for application in tasks related to relation extrac-

tion.

The remainder of the document is organized as

follows: Section 2 provides a review of related works.

The proposed method is detailed in Section 3. Experi-

mental results and analysis are discussed in Section 4.

In conclusion, Section 5 summarizes the paper, pro-

viding concluding remarks and suggesting potential

directions for future research.

2 RELATED WORKS

In the present era, state-of-the-art performance

in NER systems is accomplished by utilising deep

learning models, which demand minimal feature en-

gineering (Brundha et al., 2023), (Srivastava et al.,

2022). We conducted an exhaustive literature review

to develop a NER system tailored to the biomedical

domain, concentrating on unsupervised approaches

for identifying named entities. These models au-

tonomously uncover latent features within raw text

and have found applications across diverse domains

(Shahina et al., 2019). Their versatility has enabled

effective application in various ﬁelds, ranging from

biomedical research and healthcare to ﬁnance and le-

gal domains. A deep neural network based NER

model in the Biomedical Domain (BioNER) is pre-

sented in (Yao et al., 2015). Bio-NER reported a

71.01% F-measure on the GENIA corpus. Wei et

al.(Wei et al., 2016) present a Conditional Random

Field (CRF)-based neural system designed for identi-

fying disease entities in PubMed abstracts. Gopalakr-

ishnan et al. (Gopalakrishnan et al., 2019) pro-

posed an RNN, LSTM, and GRU approach on GE-

NIA version 3.02 corpus and achieved an F score

of 90%. Chiu and Eric Nichols (Chiu and Nichols,

2016) proposed an approach utilizing bidirectional

LSTM-CNN for named entity identiﬁcation, achiev-

ing a 91.62% F-measure on the CoNLL-2003 dataset.

Li et al. (Li et al., 2020) propose a detailed survey of

deep learning based NER systems.

Several recent research works have explored Pre-

trained language models for downstream NLP ap-

plications, yielding substantial success (Veena et al.,

2023b), (G et al., 2023), (Veena et al., 2023a). Bidi-

rectional Encoder Representations from Transform-

ers for Biomedical Text Mining (BioBERT) is a

domain-speciﬁc language representation model pre-

trained on large-scale biomedical corpora (Lee et al.,

2020). BioBERT is ﬁne-tuned on speciﬁc biomed-

ical NLP tasks, NER, relation extraction, and clas-

siﬁcation (Zhu et al., 2020), (KaﬁKang and Hen-

dawi, 2023), etc. ABioNER, a BERT-based model

to identify biomedical named entities in the Arabic

text, is proposed in (Boudjellal et al., 2021). A de-

tailed survey of transformer based pretrained models

for biomedical NER systems can be found in (Kalyan

et al., 2022).

Conventional supervised approaches necessitate

the availability of manually annotated datasets for

training, a process that can be both resource-intensive

and time-consuming. In contrast, unsupervised mod-

els offer a more ﬂexible and adaptable alternative

by alleviating the dependency on annotated data, en-

abling them to be applied across diverse domains and

languages. Unsupervised methodologies are specif-

ically crafted to unveil latent patterns or structures

inherent within datasets lacking explicit labeling. A

notable illustration of such an approach is exem-

pliﬁed by Zhang and Elhadad (Zhang and Elhadad,

2013), who introduce an unsupervised technique for

biomedical named-entity recognition. The efﬁcacy

of their method is assessed on the i2b2 and GENIA

corpora, yielding reported accuracies of 69.5% and

53.8%, respectively. Another unsupervised strategy,

CycleNER (Iovine et al., 2022), is centered around

learning the mapping between sentences and entities.

This approach undergoes evaluation on the CoNLL

and BC2GM datasets, with resulting average F mea-

sures of 68.6% and 38%, respectively. After the lit-

erature survey, we identiﬁed the following research

gaps:

• Traditional supervised approaches require manu-

ally annotated datasets for training, which can be

resource-intensive and time-consuming to create.

• Models trained on speciﬁcally labeled datasets

may have difﬁculty generalizing to new, unseen

data that differs signiﬁcantly from the training set.

• Supervised models can be inﬂuenced by biases

present in annotated data.

Considering these gaps, we propose an unsuper-

vised approach for SciBert with LDA. After applying

LDA, each document can be represented as a distri-

bution over topics, and each topic is represented as a

distribution over words. These distributions can serve

as features for the documents. Detailed explanations

of the proposed architecture and the analysis of results

INCOFT 2025 - International Conference on Futuristic Technology

156

are provided in the following sections.

3 METHODOLOGY

The subsequent steps involve the labeling of enti-

ties within the sentence by the proposed model, cat-

egorizing them into three distinct classes, viz., Dis-

ease, Chemical, and Protein.

Figure 1: The Pipeline model of the Proposed Unsupervised

Named Entity Recognition in Biomedical text.

3.1 Dataset Creation Module

We created distinct datasets for each Biomedical sub-

domain: Disease, Chemical, and Protein. To ex-

tract data, we used key phrases associated with Dis-

ease, Chemical, and Protein subdomains, focusing

on BC4CHEMD ((Krallinger et al., 2015)), NCBI-

Disease((Do

gan et al., 2014)), and BC2GM ((Smith

et al., 2008)) datasets. The list of keyphrases con-

tains 550 disease names, 250 names of genes, and 100

chemical names. Examples of key phrases include

{‘arthralgia’, ‘atherogenesis’, ‘Duchenne muscular

dystrophy’, ‘pancreatic cysts and tumors’, ‘breast

and prostate cancer’, ‘Erythropoietin receptor’, ‘lep-

tin receptor’, ’phalloidin’, ’cyclin-dependent kinase

1’,..}. Examples of retrieved sentences related to each

subdomain are displayed in Table 3.

Table 1: Sample Sentences extracted using Keywords

Domain Phrases

Disease Duchenne muscular dystrophy (DMD) is a

severe type of muscular dystrophy affecting

boys. Retinoblastoma (Rb) is a rare cancer

developing in retinal cells. Soft-tissue sar-

coma (STS) is a malignant tumor in soft tis-

sue, often growing painlessly over months or

years.

Protein The long-chain fatty acyl-CoA ligase

enzyme activates fatty acid oxidation.

ACSL5 gene encodes the long-chain-fatty-

acid—CoA ligase 5 enzyme.

Chemical The glycine transporter-1 inhibitor

SSR103800 exhibits an antipsychotic-

like proﬁle in mice. Schizophrenia is linked

to dopamine dysfunction. High doses of

catecholamines, sedatives, and relaxants via

a central venous catheter were ineffective.

In the process of constructing the training set, 80%

of the key phrases are used. Detailed information re-

garding the data statistics for the combined training

and test datasets within each subdomain is available

in Table 3.

Table 2: Statistics of the Dataset

Sub-domain Sentences Avg. Length Unique Words

Disease 2906 12.40 7836

Chemical 2837 11.88 16060

Protein 3503 11.90 18113

Following data acquisition, we undertake funda-

mental pre-processing procedures. In the subsequent

subsection, a detailed explanation of the Entity Iden-

tiﬁcation process is provided.

3.2 Entity Identiﬁcation Module

The test sentences undergo an Entity Identiﬁcation

module, which recognizes meaningful noun phrases

representing entities within the input sentence. For

entity extraction, we employed Open Information Ex-

traction systems (Del Corro and Gemulla, 2013). Af-

ter obtaining the entities, the subsequent step involves

labeling each entity into one of the three categories or

assigning it to the ’OTHER’ category. This process is

further detailed in the subsequent subsection.

3.3 Biomedical Entity Recognization

using the extended SciBERT with

LDA model

The three major entities in the biomedical docu-

ments, namely, Disease, Chemical, and Protein en-

Unsupervised Approach for Named Entity Recognition in Biomedical Documents Using LDA-BERT Models

157

tities, are identiﬁed and labeled using the SciB-

ERT+LDA model. The overall architecture of the

SciBERT+LDA model is depicted in Figure 1. The

entire process is divided into four submodules and ex-

plained in the following subsections.

3.3.1 Subdomain Word Vectorization

The training datasets from the three biomedical sub-

domains are fed into the extended SciBERT and LDA

models. The extended SciBERT model produces

domain-speciﬁc word vectors that capture semantic

information. At the same time, the LDA model cre-

ates domain-speciﬁc topic distributions and word dis-

tributions, offering insights into the thematic content

of the respective subdomains.

Figure 2: Comparison of different versions of BERT Mod-

els

SciBERT (Beltagy et al., 2019)is an extension of

the BERT model tailored for scientiﬁc and biomedi-

cal text. Its architecture retains the transformer-based

design of BERT but incorporates domain-speciﬁc pre-

training and an expanded vocabulary to better han-

dle scientiﬁc language characteristics. As depicted in

Figure 2, the number of tokens generated by various

BERT models for the following list of words cystic ﬁ-

brosis alpha and beta-thalassemias Marfan syndrome

fragile X syndrome Huntington disease hemochro-

matosis are higher compared to SciBERT. This is

because of generating more subwords by the BERT

models. Consequently, for our experiments, we opted

for the SciBERT model. However, the SciBERT

model presents challenges, particularly concerning

the inclusion of domain-speciﬁc terms. To address

this issue, in our approach we augment the SciBERT

tokenizer’s vocabulary with domain-speciﬁc terms

((Tai et al., 2020)). Once the training dataset is vec-

torized, the resulting subdomain word vectors are

stored in separate dictionaries named WE Disease,

WE Protein, and WE Chemical. These dictionaries

are then used to generate both Global Vectors and Lo-

cal Vectors, as detailed in the following sections.

3.3.2 Topic Modeling using LDA Model

As shown in Figure 1, to identify the topics and as-

sociated word distributions, we employ topic model-

ing on the training datasets within each subdomain.

The main objective of topic modeling is to identify

the thematic structures within an extensive corpus of

text data. Common techniques for topic modeling

include Latent Dirichlet Allocation (LDA) and Non-

negative Matrix Factorization (NMF), Latent Seman-

tic Analysis (LSA) and Probabilistic Latent Seman-

tic Analysis (pLSA). We have used the most popular

approach, LDA (Blei et al., 2003b), for topic model-

ing. Subsequently, these topics undergo vectorization

using SciBERT, a semantic model leveraging the dis-

tributional properties of words. We used the two cru-

(a) Chemical (b) Disease

Figure 3: Topic Distribution of Three LDA Models

cial hyperparameters in our experimental setup: α and

η. The parameter α governs document-topic density,

while η controls the topic-word density. For these hy-

perparameters, we opted for the default values pro-

vided by the Gensim library (

). Additionally, We

also focused on the determination of two critical pa-

rameters: the number of topics required (T ) and the

number of words required in a topic (K). In the con-

text of LDA, the optimal number of topics (T ) is se-

lected based on coherence scores. The topics identi-

ﬁed for each subdomain are presented in Figures 3.

https://pypi.org/project/gensim/

INCOFT 2025 - International Conference on Futuristic Technology

158

3.3.3 Weighted Local Vector Generation

The Entity Identiﬁcation module identiﬁes the in-

put arguments to be labeled. Subsequently, these

input arguments undergo vectorization through the

weighted Local Vector creation module. We employ

a weighted scoring mechanism to adjust the signiﬁ-

cance of words relative to each subdomain. Specif-

ically, if Arg represents an input argument with m

words (w

, w

, . . . w

), we calculate the Local

Vector (LV) as the weighted average of all the m

words in Arg using the equation 1.

LV (Arg) =

∑

i=1

, W

∗Vec(w

)

(1)











LV C; i f W

∈ WW C & Vec(w

) ∈ W E C

LV D; i f W

∈ WW D & Vec(w

) ∈ W E D

LV P; i f W

∈ WW G & Vec(w

) ∈ W E P

GV (subdomain) =

∑

j=1

(

∑

i=1

∗Vec(w

))

(2)











GV C; i f W

∈ TV C & Vec(w

i j

) ∈ W E C

GV D; i f W

∈ TV D & Vec(w

i j

) ∈ W E D

GV P; i f W

∈ TV P & Vec(w

i j

) ∈ W E P

In Equation (1), w

represents the ith word, and W

denotes the weight associated with the word w

. The

vector corresponding to each of the m words is de-

rived from the domain-speciﬁc word vectors which is

stored in the dictioneries WE C, WE D, and WE G.

These weights are obtained from the word-weight dic-

tionaries WW C (Chemical), WW D (Disease), and

WW P (Protein).

3.3.4 Global Vector(GV) Generation

The Global Vector creation commences with the ap-

plication of Topic modeling to the subdomain dataset,

where major topics are identiﬁed based on coherence

scores. The topics undergo vectorization, and the

Global Vector is created utilizing Equation (2). To

break down the steps further, the Global Vector is

derived by ﬁrst calculating the average of all subdo-

main Topic Vectors, denoted as TV C, TV D, and

TV P. This involves computing the vector repre-

sentation for each word (w

i j

) within a topic, where

T signiﬁes the total number of topics, K represents

the number of words in a topic, and W

wi j

denotes

the weight assigned to that word as deﬁned by the

LDA in a subdomain. The function Vec(w

i j

) extracts

the embedding vector associated with the word w

i j

from the subdomain dictionaries. This ensures that

the contextual embeddings of the words, as captured

in the subdomain-speciﬁc dictionaries, contribute to

the overall composition of the Global Vector. This

entire process is repeated for each subdomain, lead-

ing to the generation of three distinct Global Vectors,

viz., GV C, GV D, and GV P.

3.3.5 Bio Named Entity Labeling- Chemical,

Disease, and Protein Entities

In this section, the labeling of the input argument is

determined based on its resemblance to subdomains.

As depicted in Figure 1, the input argument under-

goes a vectorization procedure, generating Local Vec-

tors tailored to each subdomain. Subsequently, these

Local Vectors are compared with the corresponding

set of four Global Vectors, namely GV C, GV D,

and GV P. The rationale behind this comparison

is to evaluate the relevance of the input argument

within the distinct contexts represented by the global

characteristics of each subdomain. In Equation (3),

Cos(LV, GV ) calculates the cosine similarity between

the Local and Global Vectors. Following the compu-

tation of the cosine similarity scores, the next step in-

volves identifying the Global Vector that exhibits the

maximum similarity with the Local Vector. The maxi-

mum similarity score is computed using Equation (4),

which shows the resemblance between the input argu-

ment and the identiﬁed subdomain’s Global Vector.

The ﬁnal named entity label is determined by assign-

ing the tag (label) associated with the Global Vector,

as speciﬁed in Equation (5).

Label = argmax

{Cos(LV i, GV i) : i ∈ {C, D, G}}

(3)

Score = max

{Cos(LV i, GV i) : i ∈ {C, D, G}} (4)

ArgLabel=











C; i f Label = C and Score ≥ θ

chemical

D; i f Label = D and Score ≥ θ

disease

P; i f Label = P and Score ≥ θ

protein

OT HER or Place; Otherwise

(5)

Individual threshold values are assigned to establish

precise criteria for each subdomain. The threshold pa-

rameter, θ, is the average similarity score between the

training seed phrases and their corresponding Global

Vectors. To illustrate, consider the disease domain,

where the threshold value θ

disease

= 0.65 is derived.

This value is computed by determining the average

cosine similarity score across one thousand training

phrases compared to the Global Vector GV D repre-

sentative of the disease domain. Similar procedures

are employed to ﬁnd threshold values for other do-

mains. In the chemical domain, the threshold is set

at θ

chemical

= 0.61, representing the average similarity

score between training phrases and the correspond-

ing Global Vector GV C for chemicals. Likewise,

in the protein domain, the threshold is determined as

Unsupervised Approach for Named Entity Recognition in Biomedical Documents Using LDA-BERT Models

159

protein

= 0.60, reﬂecting the average cosine similarity

score obtained from the training phrases in relation to

the Global Vector GV P speciﬁc to proteins.

4 RESULTS ANALYSIS

This section provides a detailed examination of

the outcomes and assessment of the introduced un-

supervised BioNER model. A test corpus comprising

600 sentences from each subdomain is utilized to con-

duct the evaluation. The entities are identiﬁed using

the OpenIE system. In Table 3 the similarity score is

presented.

Table 3: Similarity Score of Global and Local Vectors

No. Phrase Cos Sim. GV D Cos Sim. GV C Cos Sim. GV P Pred. Actual

1 Abnormal

color vision

0.74 0.54 0.10 DISEASE DISEASE

2 Colorectal can-

cers

0.67 0.53 0.00 DISEASE DISEASE

3 Glycine 0.00 0.62 0.30 Chemical Chemical

4 Palinurin 0.00 0.00 0.00 Chemical Other

5 5-

hydroxytryptamin

0.53 0.63 0.43 Protein Protein

6 5-kinase 0.53 0.63 0.43 Chemical Protein

The model’s performance is rigorously assessed

using key metrics, including Precision, Recall, Accu-

racy, and F Score. The comprehensive performance

metrics of the proposed model are depicted in Figure

4. For a more in-depth understanding, we delve into

an entity-level analysis, the ﬁndings of which are pre-

sented in Figure 5. This analysis allows us to explore

the model’s performance at a granular level, offering

insights into how well it performs for each speciﬁc

entity.

Figure 4: Average performance metrics including Accuracy,

Precision, Recall, and F-Measure of the Proposed Unsuper-

vised BioNER Model.

Figure 5: Performance Metrics of the Proposed Unsuper-

vised BioNER Model for Identifying Disease, Chemical,

and Protein Entities

5 CONCLUSION

The presented methodology introduces an inno-

vative unsupervised approach for NER within the

Biomedical domain, known as the unsupervised

BioNER Model. This model speciﬁcally targets and

labels biomedical entities, including Disease, Chemi-

cal, and Protein. The core of the proposed approach

lies in the SciBERT+LDA model, which plays a piv-

otal role in the NER process. This model is designed

to create Global Vectors, essentially representative

vectors, for each distinct biomedical entity. These

Global Vectors encapsulate the overall characteristics

and signiﬁcance of the respective entities within the

biomedical context. To enhance the model’s adapt-

ability to various domain-speciﬁc entities, a crucial

suggestion is made to extend the SciBERT tokenizer.

For the recognition of new inputs, the model gener-

ates Local Vectors using a sophisticated mechanism

involving a weighted score derived from the LDA

model and uses the cosine similarity of Local vec-

tors with Global vectors. Moreover, the proposed

BioNER Model offers seamless integration with any

Open Information Extraction (OIE) System. This

means that the model can be effortlessly incorporated

into existing OIE frameworks to carry out the label-

ing of arguments. The presented approach represents

a comprehensive and effective strategy for unsuper-

vised biomedical entity recognition. The suggested

methodology yielded an average F Measure of 86%, a

noteworthy achievement, particularly within the con-

text of an unsupervised setting. The attainment of an

86% F Measure implies a commendable level of ac-

curacy and effectiveness in the proposed approach for

the given task.

In our future endeavours, we aim to identify

entities, properties and relations that extend beyond

INCOFT 2025 - International Conference on Futuristic Technology

160

the conﬁnes of two sentences. The outcomes hold

signiﬁcant promise for downstream applications,

particularly in creating knowledge bases. By in-

corporating the extracted information into existing

healthcare systems, professionals can access a wealth

of structured and interrelated data. This, in turn, can

contribute decision making regarding patient care

and treatment plans.

REFERENCES

Beltagy, I., Lo, K., and Cohan, A. (2019). SCIBERT: A pre-

trained language model for scientiﬁc text. In EMNLP-

IJCNLP 2019 - 2019 Conference on Empirical Meth-

ods in Natural Language Processing and 9th Inter-

national Joint Conference on Natural Language Pro-

cessing, Proceedings of the Conference.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003a). Latent

Dirichlet allocation. Journal of Machine Learning Re-

search, 3(4-5).

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003b). Latent

Dirichlet allocation. Journal of Machine Learning Re-

search, 3(4-5).

Boudjellal, N., Zhang, H., Khan, A., Ahmad, A., Naseem,

R., Shang, J., and Dai, L. (2021). ABioNER: A BERT-

Based Model for Arabic Biomedical Named-Entity

Recognition. Complexity, 2021.

Brundha, J., Nair, P. C., Gupta, D., and Agarwal, J. (2023).

Name entity recognition for Air Trafﬁc Control tran-

scripts using deep learning based approach. In 2023

IEEE 20th India Council International Conference,

INDICON 2023.

Chiu, J. P. and Nichols, E. (2016). Named Entity Recogni-

tion with Bidirectional LSTM-CNNs. Transactions of

the Association for Computational Linguistics, 4.

Del Corro, L. and Gemulla, R. (2013). ClausIE: Clause-

based open information extraction. In WWW 2013 -

Proceedings of the 22nd International Conference on

World Wide Web.

gan, R. I., Leaman, R., and Lu, Z. (2014). NCBI disease

corpus: A resource for disease name recognition and

concept normalization. Journal of Biomedical Infor-

matics, 47.

G, V., Gupta, D., and Kanjirangat, V. (2023). Semi Super-

vised Approach for Relation Extraction in Agriculture

Documents.

Gopalakrishnan, A., Soman, K. P., and Premjith, B. (2019).

A Deep Learning-Based Named Entity Recognition in

Biomedical Domain. In Lecture Notes in Electrical

Engineering, volume 545.

Gridach, M. (2017). Character-level neural network for

biomedical named entity recognition. Journal of

Biomedical Informatics, 70.

Iovine, A., Fang, A., Fetahu, B., Rokhlenko, O., and Mal-

masi, S. (2022). CycleNER: An Unsupervised Train-

ing Approach for Named Entity Recognition. In

WWW 2022 - Proceedings of the ACM Web Confer-

ence 2022.

KaﬁKang, M. and Hendawi, A. (2023). Drug-Drug Inter-

action Extraction from Biomedical Text Using Rela-

tion BioBERT with BLSTM. Machine Learning and

Knowledge Extraction, 5(2).

Kalyan, K. S., Rajasekharan, A., and Sangeetha, S. (2022).

AMMU: A survey of transformer-based biomedical

pretrained language models.

Krallinger, M., Rabal, O., Leitner, F., Vazquez, M., Salgado,

D., Lu, Z., Leaman, R., Lu, Y., Ji, D., Lowe, D. M.,

Sayle, R. A., Batista-Navarro, R. T., Rak, R., Huber,

T., Rockt

aschel, T., Matos, S., Campos, D., Tang, B.,

Xu, H., Munkhdalai, T., Ryu, K. H., Ramanan, S. V.,

Nathan, S.,

Zitnik, S., Bajec, M., Weber, L., Irmer, M.,

Akhondi, S. A., Kors, J. A., Xu, S., An, X., Sikdar,

U. K., Ekbal, A., Yoshioka, M., Dieb, T. M., Choi,

M., Verspoor, K., Khabsa, M., Giles, C. L., Liu, H.,

Ravikumar, K. E., Lamurias, A., Couto, F. M., Dai,

H. J., Tsai, R. T. H., Ata, C., Can, T., Usi

e, A., Alves,

R., Segura-Bedmar, I., Mart

ınez, P., Oyarzabal, J.,

and Valencia, A. (2015). The CHEMDNER corpus

of chemicals and drugs and its annotation principles.

Journal of Cheminformatics, 7.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and

Kang, J. (2020). BioBERT: A pre-trained biomedi-

cal language representation model for biomedical text

mining. Bioinformatics, 36(4).

Li, J., Sun, A., Han, J., and Li, C. (2020). A Survey on

Deep Learning for Named Entity Recognition. IEEE

Transactions on Knowledge and Data Engineering.

Shahina, K. K., Jyothsna, P. V., Prabha, G., Premjith, B.,

and Soman, K. P. (2019). A Sequential Labelling Ap-

proach for the Named Entity Recognition in Arabic

Language Using Deep Learning Algorithms. In 2019

International Conference on Data Science and Com-

munication, IconDSC 2019.

Smith, L., Tanabe, L. K., Ando, R., Kuo, C. J., Chung, I. F.,

Hsu, C. N., Lin, Y. S., Klinger, R., Friedrich, C. M.,

Ganchev, K., Torii, M., Liu, H., Haddow, B., Stru-

ble, C. A., Povinelli, R. J., Vlachos, A., Baumgartner,

W. A., Hunter, L., Carpenter, B., Tsai, R. T. H., Dai,

H. J., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adri-

aans, P., Blaschke, C., Torres, R., Neves, M., Nakov,

P., Divoli, A., Ma

na-L

opez, M., Mata, J., and Wilbur,

W. J. (2008). Overview of BioCreative II gene men-

tion recognition.

Srivastava, S., Paul, B., and Gupta, D. (2022). Study

of Word Embeddings for Enhanced Cyber Security

Named Entity Recognition. In Procedia Computer

Science, volume 218.

Tai, W., Kung, H. T., Dong, X., Comiter, M., and Kuo, C. F.

(2020). exBERT: Extending pre-trained models with

domain-speciﬁc vocabulary under constrained train-

ing resources. In Findings of the Association for

Computational Linguistics Findings of ACL: EMNLP

2020.

Veena, G., Gupta, D., and Kanjirangat, V. (2023a). Semi-

Supervised Bootstrapped Syntax-Semantics-Based

Approach for Agriculture Relation Extraction for

Knowledge Graph Creation and Reasoning. IEEE Ac-

cess, 11.

Unsupervised Approach for Named Entity Recognition in Biomedical Documents Using LDA-BERT Models

161

Veena, G., Kanjirangat, V., and Gupta, D. (2023b).

AGRONER: An unsupervised agriculture named en-

tity recognition using weighted distributional seman-

tic model. Expert Systems with Applications, 229.

Wei, H., Gao, M., Zhou, A., Chen, F., Qu, W., Wang, C.,

and Lu, M. (2019). Named Entity Recognition from

Biomedical Texts Using a Fusion Attention-Based

BiLSTM-CRF. IEEE Access, 7.

Wei, Q., Chen, T., Xu, R., He, Y., and Gui, L. (2016). Dis-

ease named entity recognition by combining condi-

tional random ﬁelds and bidirectional recurrent neural

networks. Database, 2016.

Yao, L., Liu, H., Liu, Y., Li, X., and Anwar, M. W. (2015).

Biomedical Named Entity Recognition based on Deep

Neutral Network. International Journal of Hybrid In-

formation Technology, 8(8).

Zhang, Q., Sun, Y., Zhang, L., Jiao, Y., and Tian, Y. (2021).

Named entity recognition method in health preserving

ﬁeld based on BERT. In Procedia Computer Science,

volume 183.

Zhang, S. and Elhadad, N. (2013). Unsupervised biomedi-

cal named entity recognition: Experiments with clini-

cal and biological texts. Journal of Biomedical Infor-

matics, 46(6).

Zhu, Y., Li, L., Lu, H., Zhou, A., and Qin, X. (2020).

Extracting drug-drug interactions from texts with

BioBERT and multiple entity-aware attentions. Jour-

nal of Biomedical Informatics, 106.

INCOFT 2025 - International Conference on Futuristic Technology

162