Diversifying Knowledge Enhancement of Biomedical Language Models

Using Adapter Modules and Knowledge Graphs

Juraj Vladika, Alexander Fichtl and Florian Matthes

Department of Computer Science, Technical University of Munich,

Boltzmannstraße 3, 85748 Garching bei M

unchen, Germany

Keywords:

Natural Language Processing (NLP), Pre-Trained Language Models, Knowledge Graphs, Domain

Knowledge, Knowledge Enhancement, Adapters, Biomedicine, Biomedical NLP.

Abstract:

Recent advances in natural language processing (NLP) owe their success to pre-training language models on

large amounts of unstructured data. Still, there is an increasing effort to combine the unstructured nature

of LMs with structured knowledge and reasoning. Particularly in the rapidly evolving ﬁeld of biomedical

NLP, knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap be-

tween large language models and domain-speciﬁc knowledge, considering the available biomedical knowledge

graphs (KGs) curated by experts over the decades. In this paper, we develop an approach that uses lightweight

adapter modules to inject structured biomedical knowledge into pre-trained language models (PLMs). We

use two large KGs, the biomedical knowledge system UMLS and the novel biochemical ontology OntoChem,

with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT. The approach includes partitioning

knowledge graphs into smaller subgraphs, ﬁne-tuning adapter modules for each subgraph, and combining the

knowledge in a fusion layer. We test the performance on three downstream tasks: document classiﬁcation,

question answering, and natural language inference. We show that our methodology leads to performance

improvements in several instances while keeping requirements in computing power low. Finally, we provide a

detailed interpretation of the results and report valuable insights for future work.

1 INTRODUCTION

The ﬁeld of natural language processing (NLP) has

been marked by impressive advancements in recent

years. The appearance of new model architectures,

including the emergence of generative transformers

and pre-trained language models (PLMs), has brought

along with it widespread usage and attention. Still,

most of these models were trained on large amounts

of web content, and while they excel at tasks in a

general-purpose setting, there is still a performance

gap when it comes to domain-speciﬁc challenges.

One of these challenging domains is bio-

medicine, which centers around the study of the hu-

man body, diseases, drugs, and treatments. Biomed-

ical text is often characterized as highly complex

because of its advanced terminology, which fre-

quently includes names of chemical compounds,

long-spanning relations, and other jargon not com-

monly used in everyday language. For NLP mod-

els trained on general corpora to work well in the

biomedical domain, researchers have turned to trans-

fer learning methods and domain adaption. The

most common approach to domain adaptation is to

continue the initial general pre-training of language

models with data from domain-speciﬁc medical cor-

pora. Examples of models adapted in this way are

BioBERT (Lee et al., 2019) and SciBERT (Beltagy

et al., 2019), which drew the additional training data

from biomedical and computer science research ab-

stracts. Dropping the mixed-domain approach from

previous frameworks, models like PubMedBERT (Gu

et al., 2020) and BioLinkBERT (Yasunaga et al.,

2022) were instead trained solely on PubMed research

articles, with BioLinkBERT even leveraging links (ci-

tations) to other research articles.

While domain ﬁne-tuning of whole PLMs has

proven to increase the performance on downstream

biomedical NLP tasks, additional pre-training can of-

ten be resource-intensive and infeasible for smaller re-

search groups and situations where computing power

is limited. A promising research direction has

emerged in the form of knowledge-enhanced lan-

guage models (KELMs) (Hu et al., 2023). It refers

376

Vladika, J., Fichtl, A. and Matthes, F.

Diversifying Knowledge Enhancement of Biomedical Language Models Using Adapter Modules and Knowledge Graphs.

DOI: 10.5220/0012395200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 376-387

ISBN: 978-989-758-680-4; ISSN: 2184-433X

to any set of methods that try to incorporate external

knowledge into language models, usually by inject-

ing it into the model’s input, architecture, or output.

In a sea of knowledge-enhancement methods, an es-

pecially interesting one is the utilization of adapters.

Broadly speaking, adapters are small bottleneck

feed-forward layers inserted within each layer of a

transformer-based language model (Houlsby et al.,

2019; Pfeiffer et al., 2020b). The small amount of

additional parameters allows for the injection of new

data or knowledge without requiring the whole model

to be ﬁne-tuned. Adapters plugged on top of large lan-

guage models will often only have around 1% of the

number of training parameters compared to the trans-

former. The transformer model’s learned parameters

(weights) are frozen and left unchanged, and only the

adapter is ﬁne-tuned. Other than being lightweight on

resources, this approach also helps avoid the problem

of catastrophic forgetting, where language models

forget their existing knowledge from the pre-training

corpora when they are ﬁne-tuned on a new, smaller

corpus (Colon-Hernandez et al., 2021).

This paper speciﬁcally focuses on using adapters

to inject structured biomedical knowledge from large

knowledge graphs into PLMs. We provide an

overview of existing adapter approaches for the

biomedical domain, as well as existing biomedi-

cal language models. We perform extensive ex-

periments to test the performance of knowledge-

enhanced, adapter-based biomedical language models

on a number of representative biomedical classiﬁca-

tion tasks (document classiﬁcation, question answer-

ing, natural language inference). We show that the

model performance is improved in several instances

on downstream tasks and provide a deeper look into

the resulting change in model predictions. Finally,

our experiments demonstrate that the OntoChem on-

tology (Irmer et al., 2013), which has not been used

for knowledge enhancement yet, is a viable alterna-

tive to other prominent knowledge sources.

2 RELATED WORK

2.1 Knowledge-Enhanced PLMs

PLMs are trained on enormous corpora of training

data, ranging from 3.3 billion tokens in the case of the

original BERT (Devlin et al., 2019), all the way to 3.5

trillion tokens in the case of the recent Falcon-180B

model (Almazrouei et al., 2023). The power of the

model architecture, combined with transfer learning,

has led to these models showing impressive capabili-

ties on most NLP tasks. While the textual data used

for the model training is usually completely unstruc-

tured in nature, research has shown that models like

BERT do encode, to some extent, syntactic structures,

hierarchical concepts, and certain semantic concep-

tual relations (Rogers et al., 2021). Still, other studies

have shown weakness in modeling tasks dealing with

structured knowledge, such as hyponymy relations

(Ravichander et al., 2020) or preserving the associ-

ation between text and meaning (Di Sciullo, 2018).

In most cases, the knowledge we ﬁnd and gather,

especially scientiﬁc knowledge, can be represented in

a structured manner. This is the underlying idea of

knowledge graphs (KGs), a data structure that mod-

els concepts (entities) and relations between them

in a graph-like format (Ji et al., 2021). KGs have

been used in the ﬁeld of NLP to enhance the per-

formance of NLP models in many downstream NLP

tasks (Schneider et al., 2022). There are multiple

ways to combine KGs with PLMs. The knowledge

triples from KGs can be embedded as vector represen-

tations such as TransE (Wang et al., 2014) or TuckER

(Balazevic et al., 2019) and then combined with the

vectors encoding text. Alternatively, the triples from

KGs can be converted to sentences, and, in turn, these

textual representations can then be used to ﬁne-tune

PLMs in the same way as with any other text. This

approach was followed by COMET (Bosselut et al.,

2019), which utilized the knowledge graph Concept-

Net (Speer et al., 2017) to enhance the performance

on commonsense reasoning tasks. Besides knowledge

graphs, lexicons are sometimes used for knowledge

enhancement (Hoang et al., 2022).

While there are numerous ways to inject struc-

tured knowledge into PLMs such as adding it to

the input and output of models (Wei et al., 2021),

an especially promising approach is adding adapters

to the architecture of the model (Colon-Hernandez

et al., 2021). Adapters are small layers that are in-

serted within a language model and are subsequently

ﬁne-tuned to a speciﬁc task. The major beneﬁt of

adapters is that they add a minimal amount of ad-

ditional parameters, thus signiﬁcantly reducing the

needed training time. Combined with freezing orig-

inal model weights, adapters can avoid catastrophic

forgetting, where the PLM’s performance deteriorates

when all of its weights are ﬁne-tuned with a new

knowledge source. Adapters have been used for nu-

merous purposes such as learning hierarchical rep-

resentation (Chronopoulou et al., 2022), transferring

models trained on English to low-resource languages

(Wang et al., 2021), and in the domain of efﬁcient

transformers as low-rank adapters (LoRA) (Hu et al.,

2022). General knowledge-enhanced PLMs utiliz-

ing adapters include, for example, KnowBERT (Pe-

Diversifying Knowledge Enhancement of Biomedical Language Models Using Adapter Modules and Knowledge Graphs

377

ters et al., 2019) and K-Adapter (Wang et al., 2020).

A practical tool emerged that combines well-known

adapter architectures in one place, called AdapterHub

(Pfeiffer et al., 2020b).

2.2 Biomedical Knowledge-Enhanced

PLMs

A major focus of knowledge enhancement in PLMs

is in domain adaption to expert domains such as the

biomedical domain. So far, most of the advance-

ments have focused on utilizing the knowledge graph

UMLS (Bodenreider, 2004) for this purpose. Exam-

ples include BERT-MK (He et al., 2020) and KeBi-

oLM (Yuan et al., 2021), which both ﬁne-tune the

whole weights of the base language model by using

masked language modeling of triples from UMLS.

Biomedical PLMs can then be used for various NLP

tasks, such as biomedical text summarization (Abacha

et al., 2021), named entity recognition (Sung et al.,

2022), medical fact-checking (Vladika and Matthes,

2023), information retrieval (Luo et al., 2022), or

health question answering (Vladika et al., 2023).

There are also existing approaches using adapters

for biomedical knowledge enhancement. Represen-

tative works are DAKI (Lu et al., 2021), which ﬁne-

tunes the adapters with entity prediction task, and KE-

BLM (Lai et al., 2023), which ﬁne-tunes the adapters

on three different knowledge types from UMLS and

PubChem (Kim et al., 2019), namely entity descrip-

tions, entity-entity relations, and entity synonyms.

The most similar approach to ours and a direct inspi-

ration was the Mixture-of-Partitions (MoP) approach

(Meng et al., 2021), where the adapters were ﬁne-

tuned on smaller subgraphs of UMLS.

Figure 1: Triplet from the OntoChem Fact Finder

In essence, our work builds on the present foun-

dations of adapter-based biomedical models and uses

the yet unexplored knowledge graph OntoChem,

which is rich with chemical knowledge. For our ex-

periments, we use the well-known biomedical PLM

https://sciwalker.com/analytics/factﬁnder

PubMedBERT as well as the yet unexplored but pow-

erful BioLinkBERT base model. Following the sug-

gestions of (Meng et al., 2021), we use only the

triplets corresponding to the 20 most frequent rela-

tions of OntoChem for the knowledge injection. An

example of an OntoChem triplet can be seen in Figure

1. Finally, we provide a deeper qualitative analysis

of learned structured knowledge on a speciﬁc dataset.

Notably, our work achieves the SOTA (averaged)

performance on the question-answering BioASQ-7b

dataset.

3 METHODOLOGY

In this section, we will explain the training method-

ology we used for the experiments in this paper. It is

depicted in Figure 2.

3.1 Knowledge Graph Representation

A central element of our method is the knowledge

graph (KG). This KG is a structured representation of

information denoted as a collection of ordered triples

(Ji et al., 2021). We denote these triples as (s, r, o),

where s is a subject, r is a relation, and o is an object.

Both s and o are entities that come from an entity set

E, while relations come from a relation set R. Each

entity and relation in the KG is associated with its cor-

responding textual surface form. This form can take

the shape of a single word or a compound term (e.g.,

for names of chemicals) or even a concise phrase, es-

pecially in case of relations. This textual association

is critical as it bridges the gap between the structured

KG and natural language, allowing for easier injec-

tion of KG knowledge into the language models and

associated ﬁne-tuning.

The primary objective is to enhance the capabili-

ties of a pre-trained language model, denoted as LM,

by integrating the knowledge contained within the

KG. To achieve this, we need a training objective that

effectively incorporates the KG knowledge into the

model. Most encoder-only PLMs based on the origi-

nal BERT use masked language modeling (MLM) as

one of its pre-training objectives. This task consists of

masking a certain word in a given sentence and hav-

ing the model predict which word would ﬁt the best

in the place of the masked token. We follow the es-

tablished approach of using an entity prediction ob-

jective, where we mask one of the entities and have

the model predict which token would best ﬁt. In this

way, the model incorporates the structured knowledge

of (s, r, o) triples into its internal weights.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

378

Figure 2: Methodology used to construct the ﬁnal model and run the experiments.

3.2 Graph Partitioning

During the prediction of the masked token, the model

produces a probability distribution (with a softmax

function) over all of the entities from KG’s entity

set E. Considering the massive size of the biomedi-

cal KGs we use in the paper, computing the softmax

function over all its entities carries a lot of compu-

tation complexity. This issue can signiﬁcantly slow

down model training and inference. To bridge this

challenge, some approaches have been suggested in

the literature. We opt for the approach of (Meng

et al., 2021), which involves partitioning the KG into

smaller subgraphs, which are then trained on indepen-

dently, and later, their knowledge combined to uniﬁed

knowledge representations.

The process of dividing a KG yields smaller sub-

graphs that we denote as G

, G

, ..., G

. We set k in

ﬁnal experiments to be 20, following empirical ob-

servations and previous literature, which balances ef-

ﬁciency and graph coverage well. Ideally, these 20

smaller subgraphs should be almost equal-sized com-

ponents, meaning nodes are balanced across parti-

tions. Additionally, the capacity of edges between dif-

ferent components should be minimized to maximize

the retention of factual knowledge. This is a problem

known as balanced graph partitioning and is known

to be an NP-complete problem (Andreev and R

acke,

2004). Several good approximations have been devel-

oped to determine the exact solution despite its com-

putational complexity. We opt for the METIS algo-

rithm (Karypis and Kumar, 1997), which was used in

other works dealing with large-scale KG partitioning

(Zheng et al., 2020).

3.3 Adapter Module Learning

Once the KG is appropriately partitioned, the pro-

cess of ﬁne-tuning the LM can be started. We deploy

adapter modules for this purpose. As mentioned pre-

viously, adapters are newly initialized feed-forward

networks inserted between the transformer model’s

layers. Notably, the training of adapter modules does

not require ﬁne-tuning the existing parameters of the

pre-trained model. Instead, it focuses solely on updat-

ing the parameters within the adapters. This strategy

ensures that the pre-trained model’s core knowledge

remains intact while enabling the model to specialize

in the biomedical domain by adapting to the speciﬁc

knowledge contained in the KG.

There are multiple adapter module conﬁgurations,

such as (Houlsby et al., 2019) and (Bapna and Firat,

2019). The adapter module conﬁguration used in the

paper is based on the one by (Pfeiffer et al., 2020a),

the so-called Pfeiffer architecture. In this conﬁgura-

tion, only one adapter module is added as a down-

projection and up-projection, unlike the Houlsby ar-

chitecture, where there are two projections. While

the Houlsby architecture has more learning capac-

ity, it comes with training and inference speed costs.

Previous studies showed no signiﬁcant difference in

performance between the model architectures, mak-

ing Pfeiffer architecture a very lightweight choice that

brings powerful learning capabilities.

As already mentioned, masked language model-

ing is used to ﬁne-tune the adapter modules. More

precisely, it is a task of entity prediction since a miss-

ing entity from the graph triple is being predicted.

Given a subgraph G

and its triples (s,r, o), each triple

has a textual representation. The object entity o is

Diversifying Knowledge Enhancement of Biomedical Language Models Using Adapter Modules and Knowledge Graphs

379

removed from each triple, and the remaining two el-

ements of the triple are transformed into a textual

representation like: ”[CLS] s [SEP] r [SEP]”. The

adapter module is then trained to predict the missing

object entity using the representation of the [CLS] to-

ken. The parameters of the adapter module are opti-

mized by minimizing the cross-entropy loss.

3.4 Knowledge Fusion

Finally, with a set of knowledge-encapsulated adapter

modules at hand, we need to fuse their knowledge to-

gether into a ﬁnal representation. For this, we use

the so-called AdapterFusion mixture layers (Pfeiffer

et al., 2020a). These layers serve the purpose of com-

bining knowledge from various adapters to enhance

the model’s performance on downstream tasks. It is

a relatively recent approach designed to effectively

learn how to combine information from a set of task-

speciﬁc adapters. It does so by employing a softmax

attention mechanism that assigns contextual mixture

weights over the adapters. These weights are then

used to predict task labels in the ﬁnal layer. The

composition of these layers and their interactions ulti-

mately contribute to the model’s ability to generalize

and perform well on a range of tasks.

4 EXPERIMENTS

In this section, we describe our approach to lever-

aging data from OntoChem’s SciWalker platform to-

gether with adapters to improve existing approaches

to biomedical KELMs. For reproducibility, we

made the code for the experiment runs available on

GitHub.

4.1 Datasets

All of our datasets, with the exception of MedNLI,

originated from a collection of common biomedical

NLP tasks known as BLURB – Biomedical Language

Understanding and Reasoning Benchmark.

Inspired

by a similar suite of tasks for general-purpose natu-

ral language understanding (NLU) known as GLUE

(Wang et al., 2018), BLURB covers a wide-range of

tasks related to biomedical NLU. This means no tasks

include text generation and are all essentially classiﬁ-

cation tasks, which makes them convenient to evalu-

ate with common classiﬁcation metrics such as preci-

https://github.com/alexander-ﬁchtl/diversifying

KELMs.git

https://microsoft.github.io/BLURB/index.html

sion, recall, accuracy, and F1 score. The four datasets

are described in continuation.

MedNLI (Romanov and Shivade, 2018) is a dataset

for natural language inference (NLI). It consists of

14,049 unique sentence pairs, where one sentence is

a hypothesis, and the other one is a premise. The task

is to infer whether the premise entails the hypothesis,

contradicts it, or is in a neutral relation with respect

to it. The premises were collected from MIMIC-III

(Johnson et al., 2016), the largest repository of pub-

licly available clinical data (patient notes).

BioASQ-7b (Nentidis et al., 2020) is a biomedical

question answering (QA) benchmark dataset contain-

ing questions in English, along with golden stan-

dard (reference) answers and related material. It

has been designed to reﬂect real information needs

of biomedical experts. Other than only exact an-

swers, the BioASQ dataset also includes ideal an-

swers (summaries). Researchers working on para-

phrasing and textual entailment can also measure the

degree to which their methods improve the perfor-

mance of biomedical QA systems. The dataset is a

part of the ongoing shared challenge with the same

name (Tsatsaronis et al., 2015), while our dataset (7b)

is from the 2019 challenge.

PubMedQA (Jin et al., 2019) is a different QA

dataset collected from PubMed abstracts, the largest

collection of biomedical research papers (White,

2020). The task of PubMedQA is to answer research

questions with yes/no/maybe using the corresponding

abstracts. The dataset has 1,000 expert-annotated in-

stances of question-answer pairs. Each PubMedQA

instance is composed of a question, a context (abstract

without the conclusion), a long answer (conclusion of

the abstract), and a yes/no/maybe label that summa-

rizes the conclusion.

The Hallmarks of Cancer (HOC) Corpus (Baker

et al., 2015) consists of 1852 PubMed publication ab-

stracts manually annotated by experts according to a

taxonomy. The taxonomy consists of 37 classes in a

hierarchy. Zero or more class labels are assigned to

each sentence in the corpus. These hallmarks refer

to the alterations in cell behavior that characterize the

cancer cell. Proposed as a strategy to capture the com-

plexity of cancer in a few basic principles, it provides

an organized framework comprising of ten hallmarks

(Baker et al., 2017).

4.2 Knowledge Sources

The Uniﬁed Medical Language System (UMLS) is a

set of resources and tools developed by the US Na-

tional Library of Medicine (NLM) to facilitate the in-

tegration and retrieval of biomedical and clinical in-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

380

Table 1: Twenty most common relations in each of the three KGs used in the experiments.

UMLS20 #Triples Onto20Fused #Triples Onto20Type #Triples

has ﬁnding site 367,237 relates to 708,076 [protein] relates to [disease] 295,841

has method 275,398 induces 502,512 [substance] induces [physiology] 282,721

has associated morphology 269729 modulates 326,534 [food] contains [compound] 269,211

has procedure site 129,686 treats 225,279 [substance] treats [disease] 247,348

possibly equivalent to 91,446 inhibits 219,720 [biomarker] of [disease] 205,604

has causative agent 86,780 is analyzed by 195,291 [substance] is analyzed by [method] 130,275

interprets 84,533 produces 173,979 [plant] produces [compound] 102,270

has direct procedure site 83,749 increases activity of 148,673 [protein] induces [physiology] 85,411

has active ingredient 63,792 contains 133,241 [compound] increases activity of [protein] 85,196

has pathological process 54,639 increases 110,803 [compound] decreases activity of [protein] 72,311

has occurrence 40,154 detects 93,373 [substance] inhibits [physiology] 68,728

has dose form 30,940 decreases activity of 85,425 [protein] is a [biomarker] 65,558

has direct morphology 29,667 prevents 82,574 [anatomy] produces [protein] 64,206

has intent 25,907 increases expression of 80,771 [substance] prevents [disease] 60,260

has interpretation 24,624 expresses 62,142 [protein] induces [disease] 59,577

has direct substance 23,042 attenuates 54,865 [substance] modulates [protein] 54,533

has direct device 17,726 decreases expression of 51,152 [protein] is analyzed by [method] 54,250

moved to 17,507 binds to 49,206 [method] treats [disease] 35,768

has temporal context 17,195 is a 47,435 [method] detects [physiology] 33,504

has subject relationship context 16,926 affects expression of 37,399 [protein] modulates [physiology] 24,332

Total 1,750,677 3,388,450 2,296,904

formation from various sources (Bodenreider, 2004).

Created in 1986 and continuously developed over the

decades, it can be viewed as a comprehensive the-

saurus and ontology of biomedical concepts, making

it easier to connect and use medical terminology in

research, clinical practice, and healthcare information

systems. We use the most recent SNOMED CT, US

Edition vocabulary from September 2023.

The second knowledge graph, more precisely on-

tology, that we use, is the OntoChem Ontology

(Irmer et al., 2013). The ontology contains more

than 900 complex relationships between two or more

named entities. Entities include chemical compounds,

diseases, drug combinations, chemical reactions, bi-

ological activities, adverse reactions, etc. Relation-

ships can be downloaded as RDF ﬁles. The data orig-

inates from MedLine,

a bibliographic database from

the US National Library of Medicine’s (NLM), that

contains more than 30 million journal articles focus-

ing on medicine and life sciences. The KG triples can

be interactively queried and also downloaded from the

SciWalker platform with the Fact Finder tool.

4.3 KG Subsets

The versions of the KGs from the two knowledge

sources we use in this work are subsets of their re-

spective full KGs – we use versions including only

https://www.nlm.nih.gov/healthit/snomedct/

us edition.html

https://www.nlm.nih.gov/medline/index.html

https://sciwalker.com/analytics/factﬁnder

the top 20 most common relations. This was done

to increase the efﬁciency of training but also because

initial experiments showed this smaller version does

not hurt the performance on downstream tasks. For

UMLS, the list of most common relations was taken

from MoP and SNOMED, a systematically organized

collection of medical terms providing codes, terms,

synonyms and deﬁnitions used in clinical documenta-

tion and reporting. We label this KG as UMLS20.

The relations provided by OntoChem are unique

to the type of entities that the relation connects, so

there can be several types of the same relation. For

example, the relation ”induces” can have a ”sub-

stance” as a subject and a ”disease” as an object, so

the full relation becomes ”[substance] induces [dis-

ease]”, while another one is with a ”physiology” as

a subject and a ”disease” as an object, producing

”[physiology] induces [disease]”. To test the perfor-

mance between these two types, we produce both a

KG with top 20 fused relations (independent of entity

types) and with top 20 typed relations (dependent on

entity types). We call these two KGs Onto20Fused

and Onto20Type.

The top 20 relations in each of the three KGs is

shown in Table 1. This also gives a good insight into

what kind of structured knowledge is actually con-

tained in these manually curated biomedical knowl-

edge bases. While there are certain overlaps between

top relations UMLS and OntoChem, a lot of them re-

fer to different types of interactions between entities.

Therefore, a promising research avenue that we did

not explore in this work would be to merge these two

knowledge bases into a uniﬁed KG and use both to

Diversifying Knowledge Enhancement of Biomedical Language Models Using Adapter Modules and Knowledge Graphs

381

ﬁne-tune the adapters.

4.4 Setup

Task-speciﬁc ﬁne-tuning is carried out for the four

chosen benchmark downstream tasks. We aligned

our hyperparameters with the settings recommended

by the BLURB creators (Gu et al., 2020): We de-

ploy the Adam optimizer (Zhang, 2018) alongside

the typical slanted triangular learning rate schedule,

with a warm-up for the initial 10 percent of steps

and a cool-down for the subsequent 90 percent, and

set the dropout probability at 0.1. Furthermore, we

followed (Pfeiffer et al., 2020a) and (Meng et al.,

2021) by introducing mixture layers and AdapterFu-

sion to route valuable knowledge from the adapters to

downstream tasks automatically. Given the random

initialization of the task-speciﬁc model and dropout,

outcomes can ﬂuctuate based on different random

seeds, particularly for the small PubMedQA and

BioASQ7b datasets. For a more accurate represen-

tation, we present average results from ten iterations

for BioASQ7b and PubMedQA, ﬁve iterations HoC,

and three for MedNLI, as done in related biomedical

NLP papers benchmarking these tasks.

The training was carried out on Google Colab,

with V100 and T4 GPUs provided on the platform.

Speciﬁc hyperparameters and settings used in our ex-

periments are shown in Table 2. Run seeds are re-

ported on GitHub.

Table 2: Settings and hyperparameters used for training

each of the datasets of the downstream tasks.

Setting/Task HoC PubMedQA BioASQ7b MedNLI

repeat runs 5 10 10 3

epochs 20 30 25 20

patience 3 4 5 3

batch size 16 4 4 8

learning rate 1e-5 0.5e-5 0.5e-5 0.5e-5

max. seq. len. 128 512 512 256

5 RESULTS

This section describes the detailed experiment results.

We provide both a numerical analysis and a qualita-

tive analysis of the results.

5.1 Numerical Analysis

Table 3 shows the ﬁnal results of the experiments.

Each section ﬁrst shows the performance of the

base biomedical model on its own, namely SciB-

ERT (Beltagy et al., 2019), BioBERT (Lee et al.,

2019), PubMedBERT (Gu et al., 2020), and Bi-

oLinkBERT (Yasunaga et al., 2022). Afterwards, in-

dentended rows show the performance of knowledge-

enhanced versions of the models. For SciBERT

and BioBERT, we report on competing approaches

that use structured knowledge integration: MoP

(Meng et al., 2021), DAKI (Lu et al., 2021), and

KEBLM (Lai et al., 2023). For PubMedBERT

and BioLinkBERT, we report on the knowledge-

enhanced versions as described in our paper, aug-

mented with structured knowledge from knowledge

graphs UMLS20, Onto20Fused, and Onto20Type. It

should be noted that the BioLinkBERT results differ

from the ones in the original publication because we

report on averaged experiment results over multiple

runs, unlike the best single run in the original paper.

The results demonstrate that our knowledge en-

hancement approach improved PubMedBERT in six

instances and the BioLinkBERT model in eight in-

stances, either with the UMLS data or the OntoChem

data. Notably, there is a difference in the margin of

improvement between the datasets. For HoC, the im-

provement is either negligible or 1% in the best case.

This shows that the task of trying to classify document

abstracts according to cancer properties is mostly de-

pendent on the document context itself and does not

noticeably beneﬁt from external knowledge. Simi-

lar is the case for MedNLI, which either deteriorates

or improves less than 1%, showing that entailment

recognition is mostly tied to the reasoning capabili-

ties of a language model and not the deeper medical

knowledge.

On the other hand, the two question-answering

datasets experience noticeable improvements. This

makes sense considering the knowledge-intensive na-

ture of QA, where factual knowledge is at its core.

Especially for PubMedQA, both base PLMs get a

7% jump in accuracy with different KGs. An im-

pressive result is the BioLinkBert-base + Onto20Type

model achieving state-of-the-art performance on the

BioASQ7b dataset (when looking at the averaged per-

formance over 10 runs). When looking at the dif-

ference between the two styles of OntoChem rela-

tions, the fused version was superior for PubMedQA

(by 3%), while the more detailed, typed version per-

formed better for BioASQ (by 1.5%). We attribute

this to the slight difference in the domain of these

two datasets – BioASQ contains more questions re-

lating to chemical knowledge, where speciﬁc types

could come into play, while PubMedQA covers di-

verse medical diagnoses and treatments.

An interesting result that we have to investigate

further is the relatively worse performance of our ap-

proach with OntoChem KGs on PubMedBERT com-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

382

Table 3: Final results of the model experiments: The metric for HoC is Micro F1, while for the other three it is accuracy.

The best results for every task are in bold. ”↑” denotes that improvements are observed when compared to the base model.

“†” denotes a statistically signiﬁcant better result over the base model (T-test, p < 0.05). The results in italic are taken from

previous works, while the rest of results comes from our experiments.

↓ model—dataset → HoC PubMedQA BioASQ7b MedNLI

SciBERT-base 80.52

±0.60

57.38

±4.22

75.93

±4.20

81.19

±0.54

+ MoP 81.79

†

±0.66

↑ 54.66

±3.10

78.50

†

±4.06

↑ 81.20

±0.37

+ KEBLM / 59.0 / 82.14

BioBERT-base 81.41

±0.59

60.24

±2.32

77.50

±2.92

82.42

±0.59

+ MoP 82.53

†

±1.08

↑ 61.04

±4.81

↑ 80.79

†

±4.40

↑ 82.93

±0.55

↑

+ KEBLM / 68.00 ↑ / 84.24 ↑

+ DAKI / / / 83.41 ↑

PubMedBERT-base 82.25

±0.46

55.84

±1.78

87.71

±4.25

84.18

±0.19

+ UMLS20 83.26

†

±0.32

↑ 62.84

†

±2.71

↑ 90.64

†

±2.43

↑ 84.70

±0.19

↑

+ Onto20Type 82.17

±0.62

55.40

±5.57

86.36

±3.07

83.94

±0.63

+ Onto20Fused 82.39

±0.65

↑ 56.12

±2.91

↑ 84.36

±4.73

83.97

±0.59

BioLinkBERT-base 82.21

±0.87

56.76

±3.00

91.29

±3.18

84.1

±0.03

+UMLS20 82.36

±0.57

↑ 63.62

†

±5.31

↑ 91.50

±2.25

↑ 83.78

±0.09

+Onto20Type 82.37

±0.42

↑ 60.46

±5.81

↑ 92.14

±2.30

↑ 82.84

±0.34

+Onto20Fused 82.24

±1.25

↑ 63.28

†

±4.46

↑ 90.57

±3.14

83.69

±0.55

pared to BioLinkBERT, even when factoring in the

stronger base performance of BioLinkBERT. When

the base models don’t match, it is hard to distin-

guish whether performance gains or losses come from

the difference in base models or the difference in the

adapter-based approaches. Here, the base models of

BioLinkBERT generally perform better than those of

PubMedBERT or SciBERT over a variety of tasks.

Therefore, whenever we use BioLinkBERT, we can-

not say how much of the performance gains come

from the superiority of our approach versus the su-

periority of the base model.

5.2 Qualitative Analysis

To investigate the performance of our knowledge-

enhanced models on a deeper level, we decided to

look at the classiﬁcation performance on an instance

level and singled out some interesting examples. Ta-

ble 4 shows two instances from the BioASQ dataset

where our knowledge-enhanced model predicted the

answer correctly, unlike the base model. Instances in

BioASQ consist of a question and context, and the

goal is to answer the question with a yes/no verdict.

The ﬁrst row contains a question on the rela-

tionship between Diazepam and traumatic brain in-

jury. While the vanilla BioLinkBERT answered the

question incorrectly, our knowledge-enhanced Bi-

oLinkBERT + Onto20Type model gave the correct

answer. Diazepam (ﬁrst marketed as Valium) is listed

as an entity in the OntoChem KG, where it has a direct

relation to brain injuries – the full triple is ”diazepam

[substance] treats [disease] brain injury” (see also ﬁg-

ure 1. It is likely that, thanks to the injection of this

knowledge, the enhanced model was able to deduce

the answer, while the base model was not.

The second row shows a question about axitinib

and its relation to pancreatic cancer. Here, the base

version of BioLinkBERT incorrectly predicted that

axitinib does prolong the survival of pancreatic can-

cer patients, while our BioLinkBERT + Onto20Type

model gave the correct negative answer. This time,

there is no relation between axitinib and any form

of cancer listed in the KG. Therefore, our enhanced

model might have been able to rely on its injected

knowledge and deduce that there are no such connec-

tions between the entities in question.

6 CONCLUSION

This paper investigated the performance of biomed-

ical pre-trained language models when enhanced

with structured domain-speciﬁc biomedical knowl-

edge. For this purpose, we utilized two biomedi-

cal PLMs (PubMedBERT and BioLinkBERT) and ex-

ternal knowledge from two large KGs, UMLS and

OntoChem. The KGs were partitioned into smaller

subgraphs and later fused into a common knowl-

edge representation. The knowledge was injected into

the PLMs by using lightweight but powerful adapter

modules. We tested the performance on four down-

stream biomedical NLP tasks and showed that the

knowledge-enhanced models consistently improved

Diversifying Knowledge Enhancement of Biomedical Language Models Using Adapter Modules and Knowledge Graphs

383

Table 4: Examples of two instances from the BioASQ dataset (with a question, context, and verdict) where the knowledge-

enhanced model performed correctly, unlike its vanilla counterpart.

Question Context Predictions

Can Diazepam be bene-

ﬁcial in the treatment of

traumatic brain injury?

The present experiment examined the ef-

fects of diazepam, a positive modulator

at the GABA(A) receptor, on survival

and cognitive performance in traumatically

brain-injured animals.

BioLinkBERT:

BLBERT+Onto20Type:

Gold Label:

yes

Does axitinib prolong

the survival of pancreatic

cancer patients?

Axitinib/gemcitabine, while tolerated, did

not provide survival beneﬁt over gemc-

itabine alone in patients with advanced

pancreatic cancer from Japan or other re-

gions [...].

BioLinkBERT:

BLBERT+Onto20Type:

Gold Label:

yes

the results, indicating a clear beneﬁt of infusing exter-

nal structured knowledge into unstructured PLMs. By

updating the adapter weights, which are only about 1–

2% amount of PLM weights, the performance (in best

setting) increased on HoC and MedNLI for 1%, on

BioASQ 3%, and on PubMedQA 7%. This demon-

strates the power of using adapter modules to ﬁne-

tune PLMs for domain-speciﬁc purposes. Moreover,

we have demonstrated that OntoChem is a viable al-

ternative to UMLS and other knowledge sources in

the ﬁeld of biomedical knowledge enhancement.

Future Research. In future work, we would like

to further investigate the potential of the OntoChem

ontology. Besides entities and relations, every data

triple comes with the source sentence from which

it was extracted. Drawing inspiration from works

like K-Adapter (Wang et al., 2020), this linguistic

knowledge could be extracted and used in additional

adapters to enhance the models. Moreover, the idea of

merging together the data from Ontochem with sub-

graphs from MSI (Ruiz et al., 2021), UMLS (Boden-

reider, 2004), or PubChem(Kim et al., 2020) presents

a promising direction. Finally, future work could

be more human-centric and have medical profession-

als curate the KGs. This way, the resulting KELMs

would be tailored directly by those who use them.

Limitations. Our research did not come without

certain challenges and limitations. A portion of the

data from OntoChem was not usable due to incom-

plete ID mappings. As a result, only a fraction of

the available knowledge was integrated into the ex-

perimental segment of this work, which has likely

led to less thoroughly connected KGs. Additionally,

medical professionals often indicate concerns regard-

ing ethical questions and the development and use of

LLMs in bio-medicine. While our methodology and

models will likely not be used in practice without fur-

ther research and improvements, we did not specif-

ically address the medical community’s concerns in

our work. We tried to improve the overall model per-

formance and factual accuracy to reduce hallucina-

tions, but there is no way to entirely eliminate the risk

of wrong predictions and other critical issues. At the

time of writing, we are conducting a survey involv-

ing clinicians to address their concerns in our future

work.

REFERENCES

Abacha, A. B., M’rabet, Y., Zhang, Y., Shivade, C., Lan-

glotz, C., and Demner-Fushman, D. (2021). Overview

of the mediqa 2021 shared task on summarization in the

medical domain. In Proceedings of the 20th Workshop

on Biomedical Language Processing, pages 74–85.

Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A.,

Cojocaru, R., Debbah, M., Gofﬁnet, E., Heslow, D., Lau-

nay, J., Malartic, Q., Noune, B., Pannier, B., and Penedo,

G. (2023). Falcon-40B: an open large language model

with state-of-the-art performance.

Andreev, K. and R

acke, H. (2004). Balanced graph par-

titioning. In Proceedings of the Sixteenth Annual ACM

Symposium on Parallelism in Algorithms and Architec-

tures, SPAA ’04, page 120–124, New York, NY, USA.

Association for Computing Machinery.

Baker, S., Ali, I., Silins, I., Pyysalo, S., Guo, Y., H

ogberg,

J., Stenius, U., and Korhonen, A. (2017). Cancer Hall-

marks Analytics Tool (CHAT): a text mining approach

to organize and evaluate scientiﬁc literature on cancer.

Bioinformatics, 33(24):3973–3981.

Baker, S., Silins, I., Guo, Y., Ali, I., H

ogberg, J., Stenius, U.,

and Korhonen, A. (2015). Automatic semantic classiﬁ-

cation of scientiﬁc literature according to the hallmarks

of cancer. Bioinformatics, 32(3):432–440.

Balazevic, I., Allen, C., and Hospedales, T. (2019).

TuckER: Tensor factorization for knowledge graph com-

pletion. In Proceedings of the 2019 Conference on Em-

pirical Methods in Natural Language Processing and the

9th International Joint Conference on Natural Language

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

384

Processing (EMNLP-IJCNLP), pages 5185–5194, Hong

Kong, China. Association for Computational Linguistics.

Bapna, A. and Firat, O. (2019). Simple, scalable adap-

tation for neural machine translation. In Proceedings

of the 2019 Conference on Empirical Methods in Natu-

ral Language Processing and the 9th International Joint

Conference on Natural Language Processing (EMNLP-

IJCNLP), pages 1538–1548, Hong Kong, China. Associ-

ation for Computational Linguistics.

Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A pre-

trained language model for scientiﬁc text. In Conference

on Empirical Methods in Natural Language Processing.

Bodenreider, O. (2004). The uniﬁed medical language sys-

tem (umls): integrating biomedical terminology. Nucleic

acids research, 32(suppl 1):D267–D270.

Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celiky-

ilmaz, A., and Choi, Y. (2019). COMET: Common-

sense transformers for automatic knowledge graph con-

struction. In Proceedings of the 57th Annual Meeting

of the Association for Computational Linguistics, pages

4762–4779, Florence, Italy. Association for Computa-

tional Linguistics.

Chronopoulou, A., Peters, M., and Dodge, J. (2022). Efﬁ-

cient hierarchical domain adaptation for pretrained lan-

guage models. In Proceedings of the 2022 Conference of

the North American Chapter of the Association for Com-

putational Linguistics: Human Language Technologies,

pages 1336–1351, Seattle, United States. Association for

Computational Linguistics.

Colon-Hernandez, P., Havasi, C., Alonso, J. B., Huggins,

M., and Breazeal, C. (2021). Combining pre-trained

language models and structured knowledge. ArXiv,

abs/2101.12294.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional trans-

formers for language understanding. In Proceedings of

the 2019 Conference of the North American Chapter of

the Association for Computational Linguistics: Human

Language Technologies, Volume 1 (Long and Short Pa-

pers), pages 4171–4186, Minneapolis, Minnesota. Asso-

ciation for Computational Linguistics.

Di Sciullo, A. M. (2018). Knowledge of language and

knowledge science.

Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N.,

Liu, X., Naumann, T., Gao, J., and Poon, H. (2020).

Domain-speciﬁc language model pretraining for biomed-

ical natural language processing. ACM Transactions on

Computing for Healthcare (HEALTH), 3:1 – 23.

He, B., Zhou, D., Xiao, J., Jiang, X., Liu, Q., Yuan, N. J.,

and Xu, T. (2020). BERT-MK: Integrating graph contex-

tualized knowledge into pre-trained language models. In

Findings of the Association for Computational Linguis-

tics: EMNLP 2020, pages 2281–2290, Online. Associa-

tion for Computational Linguistics.

Hoang, S. N., Nguyen, B., Nguyen, N. P., Luu, S. T., Phan,

H. T., and Nguyen, H. D. (2022). Enhanced task-based

knowledge for lexicon-based approach in vietnamese

hate speech detection. In 2022 14th International Con-

ference on Knowledge and Systems Engineering (KSE),

pages 1–5.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,

De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and

Gelly, S. (2019). Parameter-efﬁcient transfer learning for

nlp. In International Conference on Machine Learning,

pages 2790–2799. PMLR.

Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y.,

Wang, S., Wang, L., and Chen, W. (2022). LoRA: Low-

rank adaptation of large language models. In Interna-

tional Conference on Learning Representations.

Hu, L., Liu, Z., Zhao, Z., Hou, L., Nie, L., and Li, J. (2023).

A survey of knowledge enhanced pre-trained language

models. IEEE Transactions on Knowledge and Data En-

gineering.

Irmer, M., Bobach, C., B

ohme, T., P

uschel, A., and Weber,

L. (2013). Using a chemical ontology for detecting and

classifying chemical terms mentioned in texts. Proceed-

ings of Bio-Ontologies 2013.

Ji, S., Pan, S., Cambria, E., Marttinen, P., and Philip, S. Y.

(2021). A survey on knowledge graphs: Representation,

acquisition, and applications. IEEE transactions on neu-

ral networks and learning systems, 33(2):494–514.

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., and Lu, X. (2019).

PubMedQA: A dataset for biomedical research question

answering. In Proceedings of the 2019 Conference on

Empirical Methods in Natural Language Processing and

the 9th International Joint Conference on Natural Lan-

guage Processing (EMNLP-IJCNLP), pages 2567–2577,

Hong Kong, China. Association for Computational Lin-

guistics.

Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H.,

Feng, M., Ghassemi, M., Moody, B., Szolovits, P., An-

thony Celi, L., and Mark, R. G. (2016). Mimic-iii, a

freely accessible critical care database. Scientiﬁc data,

3(1):1–9.

Karypis, G. and Kumar, V. (1997). Metis: A software

package for partitioning unstructured graphs, partition-

ing meshes, and computing ﬁll-reducing orderings of

sparse matrices.

Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S.,

Li, Q., Shoemaker, B., Thiessen, P., Yu, B., Zaslavsky,

L., Zhang, J., and Bolton, E. (2020). Pubchem in 2021:

New data content and improved web interfaces. Nucleic

Acids Research, 49.

Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He,

S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B.,

et al. (2019). Pubchem 2019 update: improved access to

chemical data. Nucleic acids research, 47(D1):D1102–

D1109.

Lai, T. M., Zhai, C., and Ji, H. (2023). Keblm: Knowledge-

enhanced biomedical language models. Journal of

Biomedical Informatics, 143:104392.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and

Kang, J. (2019). BioBERT: a pre-trained biomedical lan-

guage representation model for biomedical text mining.

Bioinformatics, 36(4):1234–1240.

Lu, Q., Dou, D., and Nguyen, T. H. (2021). Parameter-

efﬁcient domain knowledge integration from multiple

sources for biomedical pre-trained language models. In

Findings of the Association for Computational Linguis-

tics: EMNLP 2021, pages 3855–3865. Association for

Computational Linguistics.

Diversifying Knowledge Enhancement of Biomedical Language Models Using Adapter Modules and Knowledge Graphs

385

Luo, M., Mitra, A., Gokhale, T., and Baral, C. (2022). Im-

proving biomedical information retrieval with neural re-

trievers. In Proceedings of the AAAI Conference on Arti-

ﬁcial Intelligence, volume 36, pages 11038–11046.

Meng, Z., Liu, F., Clark, T., Shareghi, E., and Collier, N.

(2021). Mixture-of-partitions: Infusing large biomedi-

cal knowledge graphs into BERT. In Proceedings of the

2021 Conference on Empirical Methods in Natural Lan-

guage Processing, pages 4672–4681, Online and Punta

Cana, Dominican Republic. Association for Computa-

tional Linguistics.

Nentidis, A., Bougiatiotis, K., Krithara, A., and Paliouras,

G. (2020). Results of the seventh edition of the bioasq

challenge. In Cellier, P. and Driessens, K., editors, Ma-

chine Learning and Knowledge Discovery in Databases,

pages 553–568, Cham. Springer International Publish-

ing.

Peters, M. E., Neumann, M., RobertL.Logan, I., Schwartz,

R., Joshi, V., Singh, S., and Smith, N. A. (2019). Knowl-

edge enhanced contextual word representations. In Con-

ference on Empirical Methods in Natural Language Pro-

cessing.

Pfeiffer, J., Kamath, A., R

uckl

e, A., Cho, K., and Gurevych,

I. (2020a). Adapterfusion: Non-destructive task compo-

sition for transfer learning. ArXiv, abs/2005.00247.

Pfeiffer, J., R

uckl

e, A., Poth, C., Kamath, A., Vuli

c, I.,

Ruder, S., Cho, K., and Gurevych, I. (2020b). Adapter-

Hub: A framework for adapting transformers. In Pro-

ceedings of the 2020 Conference on Empirical Methods

in Natural Language Processing: System Demonstra-

tions, pages 46–54, Online. Association for Computa-

tional Linguistics.

Ravichander, A., Hovy, E., Suleman, K., Trischler, A., and

Cheung, J. C. K. (2020). On the systematicity of prob-

ing contextualized word representations: The case of hy-

pernymy in BERT. In Proceedings of the Ninth Joint

Conference on Lexical and Computational Semantics,

pages 88–102, Barcelona, Spain (Online). Association

for Computational Linguistics.

Rogers, A., Kovaleva, O., and Rumshisky, A. (2021). A

primer in bertology: What we know about how bert

works. Transactions of the Association for Computa-

tional Linguistics, 8:842–866.

Romanov, A. and Shivade, C. (2018). Lessons from natural

language inference in the clinical domain. In Proceed-

ings of the 2018 Conference on Empirical Methods in

Natural Language Processing, pages 1586–1596, Brus-

sels, Belgium. Association for Computational Linguis-

tics.

Ruiz, C., Zitnik, M., and Leskovec, J. (2021). Identiﬁcation

of disease treatment mechanisms through the multiscale

interactome. Nature Communications, 12:1–15.

Schneider, P., Schopf, T., Vladika, J., Galkin, M., Sim-

perl, E., and Matthes, F. (2022). A decade of knowledge

graphs in natural language processing: A survey. In He,

Y., Ji, H., Li, S., Liu, Y., and Chang, C.-H., editors, Pro-

ceedings of the 2nd Conference of the Asia-Paciﬁc Chap-

ter of the Association for Computational Linguistics and

the 12th International Joint Conference on Natural Lan-

guage Processing (Volume 1: Long Papers), pages 601–

614, Online only. Association for Computational Lin-

guistics.

Speer, R., Chin, J., and Havasi, C. (2017). Conceptnet 5.5:

An open multilingual graph of general knowledge. In

Proceedings of the AAAI conference on artiﬁcial intelli-

gence, volume 31.

Sung, M., Jeong, M., Choi, Y., Kim, D., Lee, J., and Kang,

J. (2022). Bern2: an advanced neural biomedical named

entity recognition and normalization tool. Bioinformat-

ics, 38(20):4837–4839.

Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I.,

Zschunke, M., Alvers, M. R., Weissenborn, D., Krithara,

A., Petridis, S., Polychronopoulos, D., et al. (2015). An

overview of the bioasq large-scale biomedical seman-

tic indexing and question answering competition. BMC

bioinformatics, 16(1):1–28.

Vladika, J. and Matthes, F. (2023). Scientiﬁc fact-checking:

A survey of resources and approaches. In Rogers, A.,

Boyd-Graber, J., and Okazaki, N., editors, Findings

of the Association for Computational Linguistics: ACL

2023, pages 6215–6230, Toronto, Canada. Association

for Computational Linguistics.

Vladika, J., Schneider, P., and Matthes, F. (2023). Healthfc:

A dataset of health claims for evidence-based medical

fact-checking.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and

Bowman, S. (2018). GLUE: A multi-task benchmark

and analysis platform for natural language understand-

ing. In Proceedings of the 2018 EMNLP Workshop

BlackboxNLP: Analyzing and Interpreting Neural Net-

works for NLP, pages 353–355, Brussels, Belgium. As-

sociation for Computational Linguistics.

Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., Ji, J.,

Cao, G., Jiang, D., and Zhou, M. (2020). K-adapter: In-

fusing knowledge into pre-trained models with adapters.

In Findings.

Wang, X., Tsvetkov, Y., Ruder, S., and Neubig, G. (2021).

Efﬁcient test time adapter ensembling for low-resource

language varieties. In Conference on Empirical Methods

in Natural Language Processing.

Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014). Knowl-

edge graph embedding by translating on hyperplanes. In

Proceedings of the AAAI conference on artiﬁcial intelli-

gence, volume 28.

Wei, X., Wang, S., Zhang, D., Bhatia, P., and Arnold, A. O.

(2021). Knowledge enhanced pretrained language mod-

els: A compreshensive survey. ArXiv, abs/2110.08455.

White, J. (2020). Pubmed 2.0. Medical reference services

quarterly, 39(4):382–387.

Yasunaga, M., Leskovec, J., and Liang, P. (2022). Linkbert:

Pretraining language models with document links. In An-

nual Meeting of the Association for Computational Lin-

guistics.

Yuan, Z., Liu, Y., Tan, C., Huang, S., and Huang, F. (2021).

Improving biomedical pretrained language models with

knowledge. In Proceedings of the 20th Workshop on

Biomedical Language Processing, pages 180–190, On-

line. Association for Computational Linguistics.

Zhang, Z. (2018). Improved adam optimizer for deep neural

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

386

networks. In 2018 IEEE/ACM 26th international sympo-

sium on quality of service (IWQoS), pages 1–2. Ieee.

Zheng, D., Ma, C., Wang, M., Zhou, J., Su, Q., Song, X.,

Gan, Q., Zhang, Z., and Karypis, G. (2020). Distdgl:

Distributed graph neural network training for billion-

scale graphs. In 2020 IEEE/ACM 10th Workshop on

Irregular Applications: Architectures and Algorithms

(IA3), pages 36–44, Los Alamitos, CA, USA. IEEE

Computer Society.

Diversifying Knowledge Enhancement of Biomedical Language Models Using Adapter Modules and Knowledge Graphs

387