Harena Semantics: A Framework to Support Semantic Annotation in

Citizen Science Systems

Fagner Leal Pantoja

1 a

, Marco Antonio de Carvalho Filho

2 b

and Andr

e Santanch

1 c

Institute of Computing, University of Campinas, Campinas, Brazil

Faculty of Veterinary Medical Education, Utrecht University, Utrecht, The Netherlands

Keywords:

Semantic Web, Natural Language Processing, Machine Learning, Citizen Science, Clinical Reasoning.

Abstract:

We propose a new approach to support human agents to annotate semantic concepts in free-text sentences in

the biomedical domain. Using our markdown-derived language called Versum, authors can easily annotate

relevant terms while producing content for Citizen Science systems. Besides, an embedded Automatic Anno-

tation Mechanism suggests semantic concepts for the author. It implements a Named Entity Recognition task

using a hybrid approach: (1) a Transformer-based Deep Neural Network and (2) an Ontology-based method.

We conducted a case study running over content produced in the Harena e-learning system, which intends

to teach Clinical Reasoning to students using Clinical Cases. Results of this pilot evaluation suggest the po-

tential of Harena Semantics to engage volunteers in the production of semantic, agent-centered resources on

crowdsourcing systems.

1 INTRODUCTION

In clinical learning environments, professors use

Clinical Cases as pedagogical resources to teach stu-

dents to solve problems and, consequently, to develop

their clinical reasoning capacities. Usually, Clini-

cal Cases have ﬁctional narratives inspired by real

situations interconnected in a network of unantici-

pated events commonly occurring in a clinical envi-

ronment. This complex information comprises a valu-

able health knowledge source.

Despite the potential that Clinical Cases have to

become an unprecedented Knowledge Base, there are

open challenges concerning: (1) how to handle and

process free-text information contained in the case

narrative; and (2) how to integrate and interrelate

complex information fragmented across a plethora of

documents on the Web.

Envisaging these challenges, we propose Harena

Semantics to construct and curate Clinical Cases de-

livering two main contributions:

• Versum: a markdown-based script language that

enables authors to annotate semantic concepts in-

side natural language texts. Via Versum, the se-

https://orcid.org/0000-0003-1784-5512

https://orcid.org/0000-0001-7008-4092

https://orcid.org/0000-0002-1756-4852

mantic annotation may be done manually by a hu-

man agent (e.g., professors, learners), automati-

cally by some computer-assisted method, or by a

mixture of these methods.

• An Annotation Mechanism which automatically

recognizes relevant clinical concepts within a

given sentence following a hybrid approach com-

posed of two independent algorithms: (1) a

Transformer-based Named Entity Recognition

(NER) task implemented as a Deep Neural Net-

work (Vaswani et al., 2017) and (2) an Ontology-

based NER to link terms from free-text sentences

to ontology-related concepts, formally deﬁned as

knowledge graph, which comprises a network of

interconnected semantic resources.

We conducted a case study of our framework run-

ning over the Harena

system (de Menezes Mota

et al., 2019), an e-learning environment, based on

cases resolution, which is used as a supporting peda-

gogical tool in Emergency Medicine courses. Harena

represents Clinical Cases in a Virtual Patient for-

mat (Cook and Triola, 2009).

The Harena environment comprises a Knowledge

Base of clinical cases, besides two complementary

Web-based modes: (1) Interface Author, to enable one

https://jacinto.harena.org/

336

Pantoja, F., Filho, M. and Santanchè, A.

Harena Semantics: A Framework to Support Semantic Annotation in Citizen Science Systems.

DOI: 10.5220/0010785300003123

In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 5: HEALTHINF, pages 336-343

ISBN: 978-989-758-552-4; ISSN: 2184-4305

to create and curate artiﬁcial Clinical Cases inspired

by real-life clinical encounters; and (2) Interface

Player, to execute the simulations of the Clinical

Cases narratives.

Through our approach, the Harena system adheres

to the Semantic Web, enabling authors to produce

semantic annotations over free-text content of their

Clinical Cases. The authors of Clinical Cases (i.e.,

the Harena system users) are also Citizen Scientists

who embed medical knowledge in Clinical Case nar-

ratives.

Citizen Science projects promote the collection

and analysis of scientiﬁc data by members of the gen-

eral public and professional scientists. By adhering

to the Semantic Web, Citizen Science systems enable

content producers to add a higher level of abstraction

to the crowdsourced information. Harena Semantics

enables volunteer users to produce agent-centered re-

sources (e.g., Clinical Cases and beyond) and there-

fore cooperate on the gradual building of an intercon-

nected network of Knowledge Bases. Agent-centered

resource engagement is one of the three types of en-

gagement in citizen science projects (Jackson et al.,

2020) which have the potential to increase engage-

ment in the early stages of training in a volunteer

learning scenario. Preliminary results of this study

reinforce the claim about the need for mechanisms to

engage users in the production of agent-centered re-

sources.

The remaining of this paper is organized as fol-

lows: Section 2 gives some background foundations

and related work; Section 3 describes the Harena Se-

mantics framework and some results of the NER task

evaluation; Section 4 presents a case study of our ap-

proach running over the Harena system; Section 5

presents our concluding remarks.

2 FOUNDATIONS AND RELATED

WORK

The scope of this review section is twofold. First,

we give a background about Clinical Cases, Vir-

tual Patients, and Semantic Web applied to the Clin-

ical Reasoning research ﬁeld. Lastly, we brieﬂy

present some concepts of Natural Language Process-

ing (NLP), Named Entity Recognition (NER), and the

just arrived Word Embeddings.

2.1 Virtual Patient

In the health context, there is a wide spectrum of

strategies to simulate patients for students’ train-

ing (Cook and Triola, 2009). The adopted strategy

depends on the available resources, the goal expected

from the training, the level of structure in the data and

the desired expressiveness of the clinical narrative of

simulation.

Virtual Patients (VP) are designed to present sce-

narios and narratives of a Clinical Case, guided by

computers. They represent the Clinical Case in a

graph of states affording structured guidance (Cook

and Triola, 2009).

By taking advantage of Semantic Web abstraction,

OLabX (extended OpenLabyrinth) uses mEducator

schema to discover, retrieve, share, and reuse medi-

cal educational resources (Daﬂi et al., 2015). Hege et

al. present a tool to foster the acquisition of clinical

reasoning skills through Virtual Patients and Concept

Maps (Hege et al., 2017).

Our approach differs from related work, as it de-

parts from a markdown-derived language, apt for hu-

man writing, reading, and annotation, combining it

with automatically generated superimposed annota-

tions.

2.2 Named Entity Recognition

Named Entity Recognition (NER) is a Natural Lan-

guage Processing (NLP) task to identify and clas-

sify entity types, such as People, Organization

and Location, . In the biomedical domain, research

works focus on Gene, Protein, Disease, Chemical,

Anatomy, etc.

There are many approaches to implement NER

tasks. Recent works using statistical approaches have

leveraged the NER state-of-the-art by using Deep

Neural Networks to learn Word Embeddings (Col-

lobert and Weston, 2008; Devlin et al., 2018). These

neural language models encode syntactic and seman-

tic information in vectors known as embeds in such

a way that those embeds with similar meanings have

similar representations. They feed the algorithm that

decides if it should tag a term within the given sen-

tence as a named entity.

Google released BERT (Bidirectional Encoder

Representations from Transformers) (Devlin et al.,

2018), an implementation of the Transformer archi-

tecture introduced on the paper “Attention is all you

need” (Vaswani et al., 2017). BERT is pre-trained

on the Masked Language Model (MLM), an unsu-

pervised NLP task whose objective is to predict the

hidden word in a given input sentence. MLM is

an expensive task since it requires millions of sam-

ple sentences. The pre-training phase produces the

Word Embeddings as a by-product of the task ob-

jective. One can easily reuse BERT embeddings by

Fine-Tuning (a Transfer Learning technique) them in

Harena Semantics: A Framework to Support Semantic Annotation in Citizen Science Systems

337

a downstream task (e.g., NER, Question-Answering,

Natural Language Inference). Generally, Fine-Tuning

is an inexpensive method since it requires a small la-

beled dataset.

BERT produces context-aware Word Embeddings

through the Attention Mechanism, which detects the

most representative parts in the whole sentence. The

Attention Mechanism is a procedure to capture the

sentence context based on the statistical relationship

between the current word and every other word in

the input sentence, providing a bidirectional context

– i.e., interpreting the sentence considering the pre-

vious context (left-right direction) and posterior con-

text (right-left direction). This bidirectional context

leveraged the state-of-the-art of NER methods since

the previous works were capable of dealing with the

context in just one direction (Devlin et al., 2018).

The NLP community sees the rising of transformer-

based Word Embeddings as a revolution in this re-

search ﬁeld.

There are works (Alsentzer et al., 2019; Lee et al.,

2020; Akhtyamova et al., 2020) specializing BERT-

embeddings to the Biomedical and clinical domains.

(Lee et al., 2020) introduce BioBERT, a BERT-based

model specialized in biomedical language. BioBERT

is pre-trained on large-scale biomedical corpora com-

posed of PubMed abstracts and PMC full-text articles.

BioBERT outperformed the state-of-the-art models in

a bunch of experiments over NER tasks. BioBERT

trained several models to recognize different named

entities (e.g., one model to handle diseases, another

one to deal with proteins, and so on). We based our

NER implementation on the BioBERT model, extend-

ing it to recognize multiple entities (anatomy, chemi-

cals, and disease) in a single model.

Recent works attempted to join Word Embeddings

and Semantic Web ideas. We intend to contribute

to this endeavor by investigating how to superim-

pose NER annotations produced by these distinct, al-

though related, research areas. According to our liter-

ature search, there is not a standardized, established

deﬁnition of ontology-based NER methods. Some

works deﬁne them as Concept Normalization (Mif-

tahutdinov et al., 2021; Do

gan et al., 2014), Entity

Linking (Basaldella et al., 2020), Entity Typing (Choi

et al., 2018), and so on.

(Kim et al., 2019) present BERN, a neu-

ral biomedical multi-type NER tool based on

the BioBERT model. BERN is equipped with

probability-based decision rules to treat overlapping

entities (polysemy – for instance, one can tag andro-

gen as gene or chemical) and synonyms (i.e., terms

described by multiple names). BERN normalizes the

recognized entities assigning an ID (linking to con-

trolled vocabularies) to each recognized entity.

Other initiatives also address NER extending

BERT, reﬁning the embeddings, or ﬁne-tunning with

specialized datasets (Basaldella et al., 2020; Lyu and

Zhong, 2021; Miftahutdinov et al., 2021).

3 HARENA SEMANTICS

This section presents Harena Semantics, a framework

consisting of two complementary components: (1) a

markdown-based language called Versum to enable

Citizen Scientists to create and curate Clinical Cases

adherents to Semantic Web; and (2) a hybrid approach

to perform a Named Entity Recognition (NER) task to

annotate Clinical Cases with semantic concept labels.

3.1 Versum

Versum enables one to add semantic structure into the

free-text content of Clinical Cases, aiming to allow

easy integration of semantic annotations into clinical

narratives. By making explicit the semantic of Clini-

cal Cases, Versum creates pedagogical resources ad-

herent to the Semantic Web while providing a step

forward to a machine-interpretable representation of

the natural human language.

In a previous research paper, Menezes et al. devel-

oped the ﬁrst version of Versum following the Narra-

tive Design approach, which provides elements to en-

able scenario building and ﬂow control of narratives.

In this paper, we release the Annotation Mecha-

nism as an improvement feature of Versum. The pro-

cess of semantic annotation using the Versum syntax

is straightforward through a predeﬁned set of reserved

markups to add high-level structured information to

the Clinical Case.

Using Versum markups, one can annotate a

text fragment as a semantic concept enclosing it

between the curly braces { } followed by the

concept label between parenthesis, e.g., {heart

attack}(disease). Moreover, one can link a

free-text entity to Knowledge Bases – e.g., ontolo-

gies, controlled vocabularies, taxonomies, thesaurus

– through the namespaces markup. As an example of

Versum usage, Figure 1 shows:

1. A Clinical Case narrative in free-text format. This

Clinical Case was authored in the Harena system.

2. The Semantic Clinical Case produced from the

original free-text Clinical Case. By annotating

with Versum, the narrative becomes more struc-

tured and semantically enriched. These annota-

tions could be manually made by a human agent

HEALTHINF 2022 - 15th International Conference on Health Informatics

338

or by an automatic process (like the Automatic

Annotation Mechanism depicted in the next Sec-

tion 3.2).

3. A visual representation of the Semantic Clinical

Case highlighting the semantic concepts anno-

tated.

Figure 1: (1) A Clinical Case narrative in free-text format;

(2) The Semantic Clinical Case produced from 1; (3) A vi-

sual representation of the Semantic Clinical Case.

3.2 Automatic Annotation Mechanism

Harena Semantics provides a mechanism to automat-

ically annotate the concepts within the Clinical Case

through a hybrid approach:

1. Transformer-based Named Entity Recognition

(NER) task to assign labels to clinical terms

within a given sentence. This method is based

on the Transformer architecture, a Deep Neural

Network capable of capturing linguistic features

based on statistical inferences.

2. Ontology-based Named Entity Recognition task

to link from free-text terms to concepts formally

deﬁned on biomedical ontologies.

As output, the Automatic Annotation mechanism

produces the class labels to terms of the sentence

given as input, as depicted by Figure 2. The labels

may be (1) ontology concepts, (2) named entities, or

(3) a combination of them.

Figure 2: The hybrid approach to NER.

3.2.1 Transformer-based Named Entity

Recognition

We developed a Transformer-based NER that attends

to the contextual information – i.e., syntactic and se-

mantic aspects related to the context of a sentence –

to decide if it should label or not a given term as a

named entity.

This method classiﬁes the sentence terms accord-

ing to the clinical domain-speciﬁc labels Anatomy,

Chemical, and Disease. In order to train the

model, we generated a small labeled corpus called

ACD (Anatomy, Chemical, Disease) from the con-

catenation of the two pre-existing labeled data sets:

BC5CDR (Li et al., 2016) and AnatEM (Pyysalo and

Ananiadou, 2014). The Disease class in BC5CDR

also includes disease mentions (Do

gan and Lu, 2012),

which comprises Signs and Symptoms. According

to (Crichton et al., 2017), BC5CDR and AnatEM do

not exhibit a signiﬁcant overlap between the training

sentences of one dataset and the test sentences of the

other one (it would expose the training algorithm to

sentences of the validation set), which indicates the

Harena Semantics: A Framework to Support Semantic Annotation in Citizen Science Systems

339

feasibility of concatenating them. Table 1 provides

some metrics about the ACD corpus.

Table 1: ACD corpus statistics.

Corpus # Sentences Entities # Annotations

Training Test Training Test

ACD 21,223 10,867 Anatomy 9,085 4,616

Chemical 10,550 5,378

Disease 8,428 4,424

We reused the BioBERT-embeddings which are

pre-trained on an unsupervised Masked language

Modeling task specialized on biomedical domain (see

Section 2.2 for more details). Then, we trained our su-

pervised NER algorithm by adjusting the BioBERT-

embeddings through a Transfer Learning technique

called Fine-tuning. Therefore, this NER algorithm is

a semi-supervised method once it reuses the embed-

dings produced by an unsupervised source task (i.e.,

Masked Language Modeling) in a supervised target

task (i.e., NER).

This proposed Deep Neural Network – called En-

voy – is a stack of 12 BioBERT layers plus an extra

Fine-Tuning layer liable for specializing the network

(by adjusting the parameters) to recognize the label of

each term inside the given input sentence.

The Fine-Tuning process involves adding an ex-

tra linear layer to the top of the pre-trained neural

model and adjusting its weights for each sentence on

the training dataset. Each neuron on the ﬁrst layer

processes a token of the input given sentence and for-

wards it to the next neuron layer. This process is re-

peated for each sample sentence on the training data

set.

We released our NER model (specialized to rec-

ognize anatomy, chemicals, and diseases) at Hug-

gingface Model Hub: https://huggingface.co/fagner/

envoy. The open-source code to extend the BioBERT

language model by ﬁne-tuning it in a multi-class NER

model is available at our fork of BioBERT: https:

//github.com/faguim/biobert-pytorch.

A REST-based implementation of the Seman-

tics framework can be deployed as a docker con-

tainer: https://github.com/datasci4health-incubator/

harena-semantics.

Model Evaluation. To validate our approach, we

conducted an intrinsic evaluation (Velupillai et al.,

2018) measuring the performance of the model at per-

forming the NER task objective. The experimental

setup is as follows:

• Pre-trained Model: We experimented with both

biobert

base cased

(containing 768 hidden

https://huggingface.co/dmis-lab/

states, 12 neuron layers and totaling 100 million

parameters) and biobert large cased

(24 lay-

ers, 1024 hidden states).

• Learning rate 5e

−6

using AdamW optimizer, cho-

sen by considering a threshold between perfor-

mance and stability, since high learning rate in-

creases performance while incurs instability on

the training (Chiu et al., 2016).

• Batch size: 32 sentences/batch (i.e., 4 sentences ∗

128 tokens = 512 tokens/batch).

Among the two versions adopted in the evaluation,

the envoy large version accomplished 85,8% of suc-

cess on the f1 score, while envoy base 85,5% .

Although envoy large present better results

(lower error rate and higher f1-score), we released

biobert base as the ofﬁcial version of Harena Se-

mantics due to its smaller model size (biobert base

is 432 MB, while biobert large is 1,5 GB) which

facilitates the deployment of Harena Semantics in per-

sonal pcs. Another relevant feature of smaller mod-

els is their stability on the training process (Mosbach

et al., 2020).

3.2.2 Ontology-based Named Entity Recognition

This rule-based NER method uses ontologies as

source information to label the sentence terms with

the concept labels formally deﬁned on biomedical on-

tologies. The algorithm looks for matches (exacts

or partials) between the free-text sentence terms and

ontology concepts. It provides two modes to match

against ontologies:

• External ontologies: This mode uses the bion-

tology annotator to match terms against ontolo-

gies stored on the open repository Bioportal (Noy

et al., 2009).

• Local ontologies: This mode uses a RDF database

to store RDF triples. We developed an API called

OntoMatch to query the RDF triples through

the RDFLib python library. Ontomatch enables

matches based on a range of metrics such as Lev-

enshtein, Jaccard, Cosine etc.

4 CASE STUDY ON THE HARENA

SYSTEM

To get a pilot evaluation of Harena Semantics, we

conducted a case study running it over the Harena sys-

biobert-base-cased-v1.1

https://huggingface.co/dmis-lab/

biobert-large-cased-v1.1/tree/main

HEALTHINF 2022 - 15th International Conference on Health Informatics

340

tem (de Menezes Mota et al., 2019). The Medicine

course from University of Campinas uses Harena as

a supporting pedagogical tool to situate the individ-

uals in an e-learning environment, which simulates

the context of an Emergency Care Unit (de Araujo

Guerra Grangeia et al., 2016).

This section presents a process to construct se-

mantically rich Virtual Patients (VPs). We intend to

reinforce the feasibility of a global knowledge net-

work connecting the information scattered in different

Virtual Patient systems.

This research paper intends to explore ways of in-

creasing the underlying structure of Virtual Patients

towards the glimpse of the Semantic Web. More

structured information can be interpreted by ma-

chines, expanding the possibilities of application: (i)

it becomes easier to ﬁnd, reuse, and group cases and

parts of cases – e.g., it becomes possible to query:

cases in which the patient experienced shortness of

breath; cases where the ECG was fundamental to di-

agnose a heart disease; (ii) data from cases can be

used beyond the scope of training as a Citizen Sci-

ence data source.

Our approach focuses on building a Semantic Vir-

tual Patient from free-text Clinical Case narratives.

The deployment of a Semantic Virtual Patient poten-

tially facilitates intelligent searches, complex queries,

and easy exchange between institutions. As detailed

in previous sections, Harena Semantics identiﬁes clin-

ical concepts and links them to ontology concepts

through the Automatic Annotation Mechanism and

Versum tags. Therefore, it creates a RDF graph repre-

senting key knowledge about the virtual patient and

integrating it to an interconnected network of con-

cepts envisioned by the Semantic Web research area.

To evaluate the feasibility of creating and curat-

ing Semantic Virtual Patient using our framework, we

departed from Virtual Patients manually annotated by

doctors. These annotations are part of the case ra-

tionale, they relate relevant symptoms to the problem

(disease) narrated on the Clinical Case.

At the authoring process, the author tags relevant

symptoms and indicate whether they are directly re-

lated to the clinical case (e.g., arterial hypertension

and acute onset of chest pain), or key to the diagno-

sis (e.g., pain radiating to neck and back), or just dis-

tractors to the learner which mislead her to a wrong

direction (e.g., symmetric radial pulses is a speciﬁc

sign but present in only one third of the patients). The

diagnosis, which will be presented in the ﬁnal of the

case presentation as a feedback is also annotated.

Harena can superimpose several layers of anno-

tation in the same text content and combine them

throughout superimposed contexts.

It is possible to assign a context to any segment

of text surrounding it by double curly braces {{ }}.

Each context can receive an identiﬁer preﬁxed by at

sign @. Segments with the same identiﬁer must re-

fer to the same textual content, even though they can

afford distinct superimposed annotations. For exam-

ple, the three following contexts refer to the same text

fragment through the identiﬁer @symp01:

{{@symp01 / e v i d e n c e : f i n d i n g r e l e v a n c e

mesh:D000784

A man , 52 y e a r s old , r e p o r t s he i s

f e e l i n g { v e r y s t r o n g c h e s t p a i n } /

e v i d e n c e : c o r r o b o r a t e f i n d i n g / .

}}

{{@symp01

A man , 52 y e a r s old , r e p o r t s he i s

{ f e e l i n g } ( loinc:MTHU021518 ) { v e r y

s t r o n g } ( l o i n c : LA284 4 1 −6)

{ c h e s t p a i n } ( lo i nc:LA 2 8 8 4 2 −5).

}}

{{@symp01

A man , 52 y e a r s o l d , r e p o r t s he i s

f e e l i n g v e r y s t r o n g { c h e s t } ( anatomy )

{ p a i n } ( d i s e a s e )

}}

The ﬁrst copy of the segment was annotated by

physicians, as previously described. Besides the con-

text id, it is possible to specify the target of the anno-

tations. In this case, evidence:finding relevance

mesh:D000784 indicates that the following annota-

tions point to the relevance of the symptom to the

Aortic Dissection disease (mesh:D000784). The sec-

ond copy was annotated by the ontology-based anno-

tation mechanism and the third by the Transformer-

based mechanism.

These superimposed annotations are transformed

into an RDF Graph (Schreiber and Raimond, 2014)

as shown in Figure 3. A uniquely identiﬁed RDF re-

source (node in the RDF graph) is associated with

each word. When the annotation refers to a word –

e.g., the chest is annotated with Anatomy (which in

turn refers to mesh:D000715) – the related RDF node

is connected by a skos:related association. SKOS

- Simple Knowledge Organization System is a data

model for knowledge organization (Miles and Bech-

hofer, 2009).

When annotations refer to a sentence with more

than one word – e.g., the sentence chest pain anno-

tated by the concept Chest pain in the LOINC Doc-

ument Ontology (loinc:LA28842-5) – a node aggre-

gating the sentence’s words is created and related to

the concept. An aggregation will reuse already ag-

gregated nodes whenever is possible, as in the case

of the node that aggregates “very strong” and “chest

Harena Semantics: A Framework to Support Semantic Annotation in Citizen Science Systems

341

Figure 3: The RDF graph representing a case of Aortic Dis-

section, built with the support of Harena Semantics.

Figure 4: Key elements of a Semantic Virtual Patient RDF

Graph extracted from a Clinical Case.

pain” aggregated nodes.

Following this process, annotated content will

converge to a semantic RDF proﬁle of a virtual pa-

tient, as shown in Figure 4. The diagram shows a sim-

pliﬁed version of the graph presenting an overview

of what we call: Semantic Virtual Patient. The case

in the example is related to an aortic dissection – a

dangerous injury to the innermost layer of the aorta,

which puts the life of the patient at risk.

In the long term, the produced semantic clinical

cases could be used to grasp knowledge from the un-

structured text within the clinical narrative. The nar-

rative scripts in a machine-interpretable format enable

sharing, versioning, and crowdsourcing. Such capa-

bilities are needed for a system with clinical case data.

5 CONCLUSION

This paper presented Harena Semantics, a framework

to enable Citizen Scientists to create semantic annota-

tions directly into the text narrative of Clinical Cases.

By adopting our approach, the data crowdsourced in

Citizen Science systems may incorporate the informa-

tion gathered in the Knowledge Network envisioned

by the Semantic Web research area.

The introduced Automatic Annotation Mech-

anism beneﬁts both from Rule-based (the

ontology-based NER) and Statistical Learning

(the Transformer-based NER) approaches. Our NER

task presents results comparable to the state-of-the-

art works in such research area. Our approach to

superimpose annotations enables to combine human

and automatic annotations to produce a knowledge

network representing our Semantic Virtual Patient.

The technology stack presented in this paper could

serve several purposes. As future works, we intend to

implement a search engine to retrieve Clinical Cases

aided by the support of semantic information. Be-

sides, the Semantic Virtual Patient can also be used

to train inference systems that automatically gener-

ate feedback to the users of Learning Environments.

These educational resources must be adherent to ped-

agogy practices, therefore it is necessary to develop

approaches to involve experts, professors, and scien-

tists in the creation of these resources. The Harena

Semantics is an initiative engaged in such effort.

ACKNOWLEDGEMENTS

This study was ﬁnanced in part by CNPq grant num-

ber 428459/2018-8. We thank dr. Tiago Grangeia for

the production and annotation of the Clinical Cases

addressed in this work.

REFERENCES

Akhtyamova, L., Mart

ınez, P., Verspoor, K., and Cardiff,

J. (2020). testing contextualized word embeddings to

improve ner in spanish clinical case narratives. IEEE

Access, 8:164717–164726.

Alsentzer, E., Murphy, J. R., Boag, W., Weng, W.-H.,

Jin, D., Naumann, T., and McDermott, M. (2019).

Publicly available clinical bert embeddings. arXiv

preprint arXiv:1904.03323.

Basaldella, M., Liu, F., Shareghi, E., and Collier, N. (2020).

Cometa: A corpus for medical entity linking in the

social media. arXiv preprint arXiv:2010.03295.

Chiu, B., Crichton, G., Korhonen, A., and Pyysalo, S.

(2016). How to train good word embeddings for

biomedical nlp. In Proceedings of the 15th workshop

on biomedical natural language processing, pages

166–174.

Choi, E., Levy, O., Choi, Y., and Zettlemoyer, L.

(2018). Ultra-ﬁne entity typing. arXiv preprint

arXiv:1807.04905.

Collobert, R. and Weston, J. (2008). A uniﬁed architec-

ture for natural language processing: Deep neural net-

works with multitask learning. In Proceedings of the

25th international conference on Machine learning,

pages 160–167.

HEALTHINF 2022 - 15th International Conference on Health Informatics

342

Cook, D. A. and Triola, M. M. (2009). Virtual patients:

a critical literature review and proposed next steps.

Medical Education, 43(4):303–311.

Crichton, G., Pyysalo, S., Chiu, B., and Korhonen, A.

(2017). A neural network multi-task learning ap-

proach to biomedical named entity recognition. BMC

bioinformatics, 18(1):1–14.

Daﬂi, E., Antoniou, P., Ioannidis, L., Dombros, N., Topps,

D., and Bamidis, P. D. (2015). Virtual patients on the

semantic web: a proof-of-application study. Journal

of medical Internet research, 17(1):e3933.

de Araujo Guerra Grangeia, T., de Jorge, B., Franci,

D., Martins Santos, T., Vellutini Setubal, M. S.,

Schweller, M., and de Carvalho-Filho, M. A. (2016).

Cognitive load and self-determination theories applied

to e-learning: impact on students’ participation and

academic performance. PloS one, 11(3):e0152462.

de Menezes Mota, M. F., Pantoja, F. L., Mota, M. S.,

Grangeia, T. d. A. G., de Carvalho Filho, M. A., and

Santanch

e, A. (2019). Analytical design of clinical

cases for educational games. In Joint International

Conference on Entertainment Computing and Serious

Games, pages 353–365. Springer.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

gan, R. I. and Lu, Z. (2012). An improved corpus of

disease mentions in PubMed citations. In Proc. 2012

Workshop on Biomedical Natural Language Process-

ing, pages 91–99.

gan, R. I., Leaman, R., and Lu, Z. (2014). Ncbi disease

corapus: a resource for disease name recognition and

concept normalization. Journal of biomedical infor-

matics, 47:1–10.

Hege, I., Kononowicz, A. A., and Adler, M. (2017). A clin-

ical reasoning tool for virtual patients: design-based

research study. JMIR medical education, 3(2):e8100.

Jackson, C. B., Østerlund, C., Crowston, K., Harandi, M.,

and Trouille, L. (2020). Shifting forms of engage-

ment: Volunteer learning in online citizen science.

Proceedings of the ACM on Human-Computer Inter-

action, 4(CSCW1):1–19.

Kim, D., Lee, J., So, C. H., Jeon, H., Jeong, M., Choi, Y.,

Yoon, W., Sung, M., and Kang, J. (2019). A neural

named entity recognition and multi-type normaliza-

tion tool for biomedical text mining. IEEE Access,

7:73729–73740.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,

and Kang, J. (2020). Biobert: a pre-trained biomedi-

cal language representation model for biomedical text

mining. Bioinformatics, 36(4):1234–1240.

Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H.,

Leaman, R., Davis, A. P., Mattingly, C. J., Wiegers,

T. C., and Lu, Z. (2016). Biocreative v cdr task cor-

pus: a resource for chemical disease relation extrac-

tion. Database, 2016.

Lyu, Y. and Zhong, J. (2021). Dsmer: A deep semantic

matching based framework for named entity recog-

nition. In European Conference on Information Re-

trieval, pages 419–432. Springer.

Miftahutdinov, Z., Kadurin, A., Kudrin, R., and Tutubalina,

E. (2021). Drug and disease interpretation learn-

ing with biomedical entity representation transformer.

arXiv preprint arXiv:2101.09311.

Miles, A. and Bechhofer, S. (2009). SKOS Simple Knowl-

edge Organization System Reference. Technical re-

port.

Mosbach, M., Andriushchenko, M., and Klakow, D. (2020).

On the stability of ﬁne-tuning bert: Misconceptions,

explanations, and strong baselines. arXiv preprint

arXiv:2006.04884.

Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M.,

Grifﬁth, N., Jonquet, C., Rubin, D. L., Storey, M.-A.,

Chute, C. G., et al. (2009). Bioportal: ontologies and

integrated data resources at the click of a mouse. Nu-

cleic acids research, 37(suppl 2):W170–W173.

Pyysalo, S. and Ananiadou, S. (2014). Anatomical entity

mention recognition at literature scale. Bioinformat-

ics, 30(6):868–875.

Schreiber, G. and Raimond, Y. (2014). RDF 1.1 Primer.

Technical report.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention is all you need. arXiv preprint

arXiv:1706.03762.

Velupillai, S., Suominen, H., Liakata, M., Roberts, A.,

Shah, A. D., Morley, K., Osborn, D., Hayes, J., Stew-

art, R., Downs, J., et al. (2018). Using clinical natu-

ral language processing for health outcomes research:

overview and actionable suggestions for future ad-

vances. Journal of biomedical informatics, 88:11–19.

Harena Semantics: A Framework to Support Semantic Annotation in Citizen Science Systems

343