Implementing Open-Domain Question-Answering in a College Setting:

An End-to-End Methodology and a Preliminary Exploration

Augusto Gonzalez-Bonorino

1,2

, Eitel J. M. Laur

ıa

1,∗

and Edward Presutti

School of Computer Science and Mathematics, Marist College, Poughkeepsie, New York, U.S.A.

Data Science & Analytics Dept., Information Technology, Marist College, Poughkeepsie, New York, U.S.A.

Enrollment Management, Marist College, Poughkeepsie, New York, U.S.A.

Keywords:

Deep Learning, Natural Language Processing, Transformers, AI in Higher-education, Open Domain

Question-Answering, BERT, roBERTa, ELECTRA, Minilm, Information Retrieval.

Abstract:

Advances in Artiﬁcial Intelligence and Natural Language Processing (NLP) can be leveraged by higher-ed

administrators to augment information-driven support services. But due to the incredibly rapid innovation rate

in the ﬁeld, it is challenging to develop and implement state-of-the-art systems in such institutions. This work

describes an end-to-end methodology that educational institutions can utilize as a roadmap to implement open

domain question-answering (ODQA) to develop their own intelligent assistants on their online platforms. We

show that applying a retriever-reader framework composed of an information retrieval component that encodes

sparse document vectors, and a reader component based on BERT -Bidirectional Encoder Representations

from Transformers- ﬁne-tuned with domain speciﬁc data, provides a robust, easy-to-implement architecture

for ODQA. Experiments are carried out using variations of BERT ﬁne-tuned with a corpus of questions and

answers derived from our institution’s website.

1 INTRODUCTION

Open-domain question-answering (ODQA) systems

have been a topic of research for many years. Such

architectures have been divided into two categories:

open-book and closed-book. The former encom-

passes models that are able to refer back to the cor-

pus of text they were trained on, as well as exter-

nal knowledge, to output answers. Moreover, the

most common framework for such systems is known

as a retriever-reader architecture and it was ﬁrst pro-

posed by researchers at Stanford and Facebook AI

in 2017 (Chen et al., 2017). The latter focuses on

ﬁne-tuning question-answering (QA) models for pre-

cise and rapid answer retrieval from a predetermined

dataset containing all possible answers. In this case,

the model focuses on ﬁnding the answer as efﬁciently

as possible in the curated dataset of question-answer

pairs instead of attempting to extract, or generate, it

from external knowledge.

We believe that educational institutions could

greatly beneﬁt from adopting Natural Language Pro-

cessing (NLP) technologies such as ODQA sys-

∗

Contact Author.

tems and intelligent chatbots. Potential beneﬁts in-

clude higher efﬁciency when processing administra-

tive tasks, such as providing information about the in-

stitution or guidance on using speciﬁc services, tasks

traditionally conducted by staff. Furthermore, such a

technology can aid in improving the interaction be-

tween the students and the different areas of the in-

stitution, whether educational or administrative, and

in making information easily available to both current

and prospective students. Case in point, the ongoing

COVID crisis has propelled the use of online envi-

ronments. The challenge in these environments is the

frequent changes in content and content sources. If

those virtual environments are enhanced through the

use of artiﬁcial intelligence, they have the potential of

streamlining processes and making them simpler or

even seamless. Also, as it relates the prospective stu-

dent, enrollment management must embrace the cur-

rent and emerging technologies of social media and

online content to maintain its focus on the next gener-

ation of students. Online platforms assisted by digital

assistants can certainly enhance the recruitment pro-

cess. Still, and due in part to the rapid evolution of

the ﬁeld of NLP, we notice that such beneﬁts have

not been a strong enough incentive to motivate in-

Gonzalez-Bonorino, A., Lauría, E. and Presutti, E.

Implementing Open-Domain Question-Answering in a College Setting: An End-to-End Methodology and a Preliminary Exploration.

DOI: 10.5220/0011059000003182

In Proceedings of the 14th International Conference on Computer Supported Education (CSEDU 2022) - Volume 2, pages 66-75

ISBN: 978-989-758-562-3; ISSN: 2184-5026

stitutions to allocate the necessary resources to im-

plement state-of-the-art architectures. For this rea-

son and with the purpose of showing the feasibility

of implementing such systems and encourage others

to follow a similar path, we have researched ODQA

architectures. In this paper we present an end-to-end

methodology depicting how to implement such an ar-

chitecture in an educational institution. The purpose

of this work is twofold: a) create a proof of concept of

a production-level ODQA system prototype based on

a sparse retriever-reader architecture that could be im-

plemented at our institution; b) develop a methodol-

ogy and guideline, and empirically test different com-

ponents of a retriever-reader architecture. We have

set this framework as our benchmark using the Best

Match BM25 (BM25) sparse text retrieval function

(Robertson and Zaragoza, 2009) as the retriever, and

a reader based on BERT -Bidirectional Encoder Rep-

resentations from Transformers (Devlin et al., 2019),

ﬁne-tuned with domain-speciﬁc data. We believe that

institutions can make use of this guideline to imple-

ment these architectures and take advantage of these

promising technologies.

The paper is organized in the following manner:

First, we provide a review of the extant existing liter-

ature emphasizing the rapid evolution of the ﬁeld of

question-answering since the 1960s and the various

techniques and model architectures that have been de-

veloped throughout the past decades. We then move

on to discuss the core methodology we have designed

and implemented at our institution. We follow with

the experimental setup, providing an in-depth dis-

cussion of the QA dataset, methods, computational

platform and software used throughout the experi-

ments, as well as the performance metrics used to

assess the performance of the different variations of

the retriever-reader architecture tested throughout our

research. The experiments include the ﬁne-tuning of

the BERT-based reader using a domain-speciﬁc QA

dataset extracted from the information found in our

institution’s website. We present and discuss our ﬁnd-

ings. Finally, we close the paper with comments on

the limitations of the research, our conclusions, and

future research avenues.

2 LITERATURE REVIEW

Question-answering is a challenging sub-discipline of

NLP concerned with understanding how computers

can be taught to answer questions asked in natural

language. Most QA systems are focused on answer-

ing questions that can be answered with short-length

texts, also known as factoids (Jurafsky and Martin,

2021). Before the deep learning boom of the 21st cen-

tury, researchers adopted various approaches to tackle

the challenging task of QA. Two paradigms, informa-

tion retrieval QA (also known as open-domain QA or

ODQA) and knowledge-based QA, have been driv-

ing the ﬁeld since the early 1960’s. Information re-

trieval (IR) focuses on gathering massive amounts of

unstructured data from different sources (i.e. social

media, forums, websites) and encoding those doc-

uments in vector representations such that, when a

question is received, the retriever component is able

to ﬁnd relevant documents, which are then input to a

reading comprehension algorithm (typically referred

as the reader) that returns the span of text most likely

to contain the answer. Although some QA systems

focus solely on the reader component, the reader is

given a question and a text fragment with or without

the answer, and it returns a resulting prediction an-

swer. It is the combination of both components in

the so called retriever-reader architecture that has be-

come the dominant paradigm for ODQA. The reason

is simple: whereas in the all-by-itself reader architec-

ture the reader has to deal with the whole document

corpus to ﬁnd an answer to the question, in the case of

a retriever-reader architecture the reader uses only the

top ranked documents selected by the retriever com-

ponent. This improves the system efﬁciency by re-

ducing run-time and increasing accuracy. Knowledge

based, or closed-domain QA, on the other hand, con-

sists of collecting structured data and formulating a

query over a structured database. One of the ﬁrst ex-

amples of this approach is BASEBALL (Green et al.,

1961), a question-answering system that was able to

answer questions about baseball games and statistics

from a structured database. Until the 80’s, knowledge

based QA systems grew rapidly by developing larger

and more detailed structured databases that targeted

narrower domains of knowledge. Nevertheless, since

these databases had to be hand-written by experts,

they were costly and time consuming. Thus, the fo-

cus of research slowly drifted towards ODQA which

leverages statistical processing of large unstructured

databases.

Traditional approaches for ODQA consisted of

the use of sequence-to-sequence models which sep-

arate sequences of text (i.e. the question and the cor-

pus of text) via a special token often denoted [SEP].

Research in machine comprehension and question-

answering started with one-directional Long-Short-

Term-Memory (LSTM) models (Wang and Jiang,

2016). Then, moved onto bi-directional architectures

(Seo et al., 2018). Currently, it is embracing the trans-

former architecture introduced in 2017 by Google re-

searchers (Vaswani et al., 2017) which provides a sig-

Implementing Open-Domain Question-Answering in a College Setting: An End-to-End Methodology and a Preliminary Exploration

niﬁcant advantage over complex recurrent and con-

volutional neural networks by focusing solely on at-

tention mechanisms, thereby reducing training time

and increasing efﬁciency. Until then, the inherent se-

quential nature of those recurrent neural nets limited

parallelization within training examples. The trans-

former architecture replaces the recurrent neural net-

work layer with a self-attention layer used to compute

a richer representation of a word in an input sequence

given all other words in the sequence at once, lead-

ing to simple matrix computations than can be easily

parallelized, therefore enhancing performance.

One of the biggest limitations QA-systems faced

in the early days was not being able to encode context.

This issue was successfully addressed by Google’s bi-

directional transformer model BERT which was ”de-

signed to pre-train deep bidirectional representations

from unlabeled text by jointly conditioning on both

left and right context in all layers” (Devlin et al.,

2019). Furthermore, the publication of the SQuAD

2.0 dataset (Rajpurkar et al., 2018), developed specif-

ically for ﬁne-tuning question-answering models, en-

abled researchers to train on both answerable and

unanswerable questions. This novel QA dataset, the

open-source release of BERT and the development of

transfer learning has skyrocketed the interest and re-

sources allocated to the discipline. Note that trans-

fer learning and domain adaptation ”refer to the situa-

tion where what has been learned in one setting is ex-

ploited to improve generalization in another setting”

(Goodfellow et al., 2016). Consequently, question-

answering -ODQA in particular- has been evolving

rapidly. For this reason, we have focused our work

on ODQA systems, speciﬁcally on retriever + BERT-

based reader architectures ﬁne-tuned with domain-

speciﬁc data.

3 METHODOLOGY

The methodology we describe in this paper is com-

prised of several steps:

Step 1 - Gather Domain-speciﬁc Data: The ﬁrst

step of our project, as in any AI project, is to collect

and pre-process data; in this case, the corpus of doc-

uments on which questions are formulated and their

corresponding answers are produced and delivered by

the QA system. For this, we developed a web scraper

which pulls the text information contained in every

ofﬁcial website from our institution. The scraper ex-

tracts all the data and adds it to a json ﬁle, from which

we then draw out the scraped text for pre-processing.

This pre-processing task includes eliminating special

characters, stop words, lemmatizing and tokenizing

activities. Next, we load all the clean data to a re-

lational database to optimize retrieving performance

(on a side note, we initially experimented with Mon-

goDB to exploit its NoSQL document model, but we

did not ﬁnd any signiﬁcant advantage; in fact, at scale,

it under-performed in terms of efﬁcacy and retrieval

time).

Step 2 - Choose ODQA Architecture: This step im-

plied some exploration of the extant ODQA existing

technologies as well as some trial and error work to

ﬁnd and implement the architecture and its compo-

nents that best ﬁtted our goals. In the context of a

retriever-reader architecture this implies identifying

the following items:

a) the information retrieval (IR) system, search en-

gine, or retriever component. The system in

which the document corpus is stored, encoded and

indexed. Encoding in this context means vector

encoding; at query time (i.e. when a question

is delivered by an end-user) questions are trans-

formed online into vector representations, and the

resulting vector-encoded questions are compared

by the IR system with the vector-encoded corpus

to retrieve a subset of the top ranked documents.

b) the reading comprehension algorithm, or

reader component, typically a deep learning,

transformer-based encoder model such as BERT,

that takes the top ranked document(s) fed by

the IR system and scans each document to ﬁnd

segments of text - spans - that can probably

answer the question.

c) the software platforms available to implement

these components.

Let’s consider each of these items in more detail:

Retriever Component: Information retrieval systems

uses the vector space model to identify and rank likely

documents given a query, in which documents and

queries are vector-encoded and matched by similarity

using a distance metric in the respective vector space

(Salton, 1971). We score (i.e measure similarity) of

a document d and a query q by computing the cosine

of their respective vector encodings d and q. If the

cosine is 1 there is perfect match.

score(q, d) =

q · d

|q||d|

(1)

IR systems can be classiﬁed as dense and sparse de-

pending on the nature of their vector encoding. We

considered both types of retriever systems:

Sparse Retrievers: Traditional IR systems encode

documents using word frequencies or similar derived

metrics such as TF-IDF (Term Frequency – Inverse

CSEDU 2022 - 14th International Conference on Computer Supported Education

Document Frequency) and BM25. These types of IR

systems are known as sparse retrievers since words

are considered independently of their position and

vector encoding is sparse (there is no concept of space

embedding based on context). Sparse retrievers are

fast both in terms of document encoding (sparse vec-

tors are easy and fast to compute) and query encoding

(encoding the query as a vector on the ﬂy) is a trivial

task.

The TF-IDF retriever is a popular algorithm used

in information retrieval tasks such as ﬁnding similar

books or documents given a query. It derives its name

from the TF-IDF metric, an enhanced replacement of

the Bag of Words (BOW) approach to encode text.

TF-IDF rewards larger frequency of terms while pe-

nalizing terms which occur frequently in documents

(e.g. stop words). The metric computes the frequency

of a term in a document times the logarithm of the

ratio between the number of documents in the corpus

and the number of documents where the term occurs.

In other words we calculate TF-IDF as follows:

TF-IDF(d, t) =

· log

(2)

where n

is the frequency (count) of term t in docu-

ment d, n is the total count of all terms in document

d, N is the total count of documents, and N

is the

count of documents containing term t.

To compute the distance between document d and

query q using TF-IDF we apply Eq.(1) and decom-

pose the inner product as a sum of products over the

TF-IDF scores. With some simpliﬁcation -see (Juraf-

sky and Martin, 2021) for details, we obtain:

score(q, d) =

∑

t∈q

TF-IDF(d, t)

∑

w∈d

TF-IDF(d, w)

(3)

Alternatively, the Best Match 25 ranking metric,

also known as Okapi BM25, or just BM25 (Robertson

and Zaragoza, 2009), improves retrieval efﬁciency

compared to TF-IDF by considering document length

normalization and term frequency saturation (i.e. lim-

iting through saturation the impact that frequent terms

in a document can have in the score).

BM25 tunes the term frequency (TF) factor in

the TF-IDF formula by adding two parameters: b,

which regulates the relevance of normalizing docu-

ment length; and k, which provides a balance between

term frequency and inverse document frequency. The

simpliﬁed scoring formula for BM25 as implemented

in the popular Lucene search engine (Apache Soft-

ware Foundation, 2021) is deﬁned as follows:

score(q, d) ≈

∑

t∈q

Tuned TF · IDF

Tuned TF =

· (1 + k)

+ k · ((1 − b) +b ·

avg(n)

)

IDF = log

(4)

A large value of parameter b means that the length

of the document matters more (with b = 0, the length

of the document has no inﬂuence on the score). High

values of parameter k, in turn, pull the score up when

the term occurs more frequently. When k = 0 term

frequency TF becomes 1, which means that TF-IDF

is solely driven by IDF.

In order to efﬁciently identify the most relevant

documents to the input query, the IR system needs a

data structure that can relate terms to documents. A

typical data structure is an inverted index which lists

every unique term that appears in any document, and

for each term, all of the documents the term occurs

in. This is what indexing means in the context of an

IR system. The search is a two-step process: the IR

system uses the inverted index to ﬁnd the subset of

documents that contains the terms in the query; only

then it proceeds to compute the similarity score on the

subset of documents and ranks them to complete the

document search.

Dense Retrievers: Sparse retrievers are powerful and

fast IR engines, but they have one critical ﬂaw: they

lack context. So, as the location of terms in the text

is irrelevant, they work in those cases in which there

is word-to-word correspondence between the query

and the document. Dense retrievers address this is-

sue by replacing sparse vectors with dense vector em-

beddings. Modern implementations of dense retriev-

ers like Dense Passage Retrieval, also known as DPR

(Karpukhin et al., 2020) use transformer models such

as BERT -more on BERT in the following paragraphs-

to encode the corpus of documents into dense vec-

tors. This can take a long time to process, especially if

the corpus of documents is large, as it requires train-

ing just like any transformer (deep learning) model.

Then, at retrieval time, the query is BERT-encoded

on the ﬂy into a dense vector and use to score likely

BERT-encoded documents in the corpus using a simi-

lar approach as that in Eq(1). Following (Jurafsky and

Martin, 2021) notation:

Implementing Open-Domain Question-Answering in a College Setting: An End-to-End Methodology and a Preliminary Exploration

q = BERT

(q)

d = BERT

(d)

score(q, d) = q · d

(5)

We considered the use of dense retrievers as part of

this work, but the time required to encode the corpus

of documents and the lack of evidence of signiﬁcant

improvement when compared to the use of sparse

retrievers made us favor the latter approach, at least

in this stage of our research. We therefore settled for

a sparse BM25 retriever.

Reader Component: Pre-training and transfer

learning have been, arguably, the most important

breakthroughs of the decade in deep learning. As

stated in (Erhan et al., 2010) pre-training ”guides the

learning towards basins of attraction of minima that

support better generalization from the training data

set”. In other words, transformer models are trained

on very large amounts of data for a general purpose

task, allowing the researcher to then ﬁne-tune such

models for speciﬁc tasks, like question-answering

in this case. The idea of transfer learning branches

out from the aforementioned technique. A model

pre-trained on natural language, for example, can be

reused (i.e. ﬁne-tuned) for a speciﬁc task without

losing its previous general knowledge, effectively ob-

taining a powerful model capable of solving complex

tasks. BERT-like encoders have therefore become

the de facto standard for reading-comprehension in

ODQA systems. For a list of the BERT-like readers

considered in this paper check the section on software

platforms.

Software Platforms: It is difﬁcult to create a taxon-

omy of software platforms that assist in implement-

ing ODQA systems, given the variety of available

open-source IR projects, QA frameworks and search

engines, and the various ways in which these plat-

forms handle document indexing and storage of in-

dexed documents - in some cases in a vertically-

integrated, monolithic fashion, in others as compo-

nents of a custom-built QA application. As part of

this work we investigated a number of the most rel-

evant projects. The list that follows should not be

considered as an exhaustive survey of software tools.

Instead, it is a reﬂection of our exploration and the

choices we made during this research project to im-

plement a sparse retriever-reader solution for ODQA

at our institution:

- Lucene (Apache Software Foundation, 2021) is

the widely used, high-performance, Apache text

search engine project written in Java.

- ElasticSearch (Elastic, 2021) is a software pack-

age developed by Elastic, based on the Lucene

library, that provides a distributed, JSON-based,

text-retrieval engine exposing an HTTP API.

ElasticSearch is offered under both the Server

Side Public License, and the company’s own li-

cense. Elasticsearch can store both sparse (TF-

IDF/ BM-25) and dense representations.

- FAISS (Johnson et al., 2017) is a library developed

by Facebook research team for similarity search

of dense vector representations of text. FAISS is

written in C++, can make use of GPUs to speed

up the similarity search, and has wrappers for

Python.

- Anserini (Yang et al., 2017) is a Java-based infor-

mation retrieval project based on Lucene.

- Pyserini (Lin et al., 2021) is a Python informa-

tion retrieval library that encodes documents in

both sparse and dense representations. Sparse re-

trieval is implemented through integration with

the Anserini library. Pyserini uses Facebook’s

FAISS library to implement dense retrieval.

- Haystack (Pietsch, Soni, Chan, M

oller, and

Kosti

c, 2021) is a recent open-source framework

developed by Deepset.ai for building search sys-

tems and ODQA systems that provides a simple

interface through which a) documents can be en-

coded in both sparse (TF-IDF and BM25) and

dense (DPR) vector representations and indexed

into a variety of data stores; b) a retriever-reader

architecture can be easily setup, leveraging a va-

riety of BERT models; c) QA models can be

ﬁne-tuned with domain speciﬁc data. Deepset.ai

has also partnered with HuggingFace (Wolf et al.,

2020), a leading NLP startup, to provide state-of-

the-art pre-trained models.

As we previously mentioned, the main goal of

our research is to provide educational institutions a

guideline to implement state-of-the-art ODQA archi-

tectures on their campus in a simple and fast way. Our

architecture has the structure depicted in Figure 1. We

chose the Haystack’s API that provides a convenient

pipeline to ﬁne-tune and deploy the ODQA architec-

ture. We employed an Elasticsearch server to store

our indexed documents to optimize retrieval, a BM25

sparse retriever and a BERT-like reader initially ﬁne-

tuned with SQuAD 2.0. We considered three varia-

tions of BERT:

- roBERTa (Liu et al., 2019), a successor of BERT

and trained on a much bigger (10x) dataset, us-

ing larger mini-batches and learning rates among

other enhancements. All of these differences al-

low roBERTa to generalize better to downstream

tasks and has been the state-of-the-art model for

QA tasks in recent years.

CSEDU 2022 - 14th International Conference on Computer Supported Education

Figure 1: (Sparse) Retriever - Reader Architecture.

- ELECTRA (Clark et al., 2020) introduces a new

approach to training that produces a model with

similar, and sometimes better, results than top-

performing transformers using only a fraction of

computing resources. ELECTRA replaces the

masked language modeling (MLM) pre-training

methods used by BERT and other transformers

with a more efﬁcient approach called ”replaced

token detection”. Strictly speaking, ELECTRA is

not really a variation of BERT, but we decided to

incorporate it on our experiments due to its novel

training approach and potential for QA tasks.

- Minilm (Wang et al., 2020) ”distilles” a smaller

(student) model from a much larger (teacher)

model, like BERT, and with it uses half of the

transformer’s parameters and computations of the

teacher model.

reader = F A R MRe a d e r (

mo d e l_ n am e _ or _ pa t h = " deepset /

elec tra -base - s q u a d 2 " , u s e _ g p u =

True )

trai n _ d ir = " p a th / to /

cu s t o m _ tr a in _ d a t a se t "

reader . tr a i n ( dat a _ d i r = train_dir ,

tr a i n_f i l en a m e = "

Ou r T ra i n Fi l e Na m e . js o n " ,

use_gpu =True ,

n_epo c h s = 20)

reader . sa ve ( direc t o r y = "

Ou r F i n e Tu n ed M od e l N a m e ")

Listing 1: Fine-tuning electra-base-squad2.

Step 3: Fine-tune Reader: Seeking to improve our

model’s performance, we found the next logical step

to ﬁne-tune the reader on our custom QA dataset. For

this we created a SQuAD-formatted training dataset

using an annotation tool that allowed contributing op-

erators - in this case several student workers- to a)

pick a random document from the College website;

b) formulate questions on the given document, iden-

tifying the answer to the question in the document

(ﬁrst and last word of the answer word of each an-

swer). That is what the BERT-reader does: ﬁnds from

the corresponding reading passage the beginning and

the end of a segment of text, or span, that is most

likely to contain the answer to a given question. List-

ing 1 depicts a sample fragment of Haystack code

used to load an ELECTRA model initially trained

with SQuAD 2.0 and ﬁne-tune it with our SQuAD2-

formatted domain-speciﬁc training dataset.

4 EXPERIMENTAL SET UP

In the experiments, we studied the performance of the

sparse retriever-reader architecture, considering dif-

ferent versions of the BERT-based reader before and

after ﬁne-tuning it using the College website data.

The primary goal was to explore the feasibility of the

architecture for question-answering at our institution.

A second goal was to analyze and measure if, given

domain-speciﬁc questions, ﬁne-tuning the reader on

custom examples would improve its predictive perfor-

mance.

4.1 Dataset and Methods

We developed a dataset with 594 questions and

their respective answers in SQuAD 2.0 format us-

ing Haystack’s annotation tool. We then proceeded

to randomly split the dataset into training and test-

ing partitions using a 80-20 ratio: the training data

set was used to ﬁne-tune the BERT-based reader (in

those runs where we ﬁne-tuned the model), and the

testing dataset was used to measure the performance

of the retriever and reader components. Each experi-

ment was performed by varying the random partition

seed. We used 30 randomly chosen seeds, creating

30 dataset pairs (training and testing) to perform ex-

periment runs, using 3 different readers (roBERTa,

ELECTRA and Minilm). Each of the readers was

used in plain-vanilla mode or was ﬁne-tuned using the

training datasets. This amounted to a total of 30 runs

repeated over 3 x 2 readers, for a total of 180 experi-

ments (for details see section 5 and Table 1).

Implementing Open-Domain Question-Answering in a College Setting: An End-to-End Methodology and a Preliminary Exploration

4.2 Performance Metrics

We collected the metrics listed below. We chose n=5

(the number of top relevant documents identiﬁed by

the retriever).

Retriever Recall: at n=5 reports how many actual rel-

evant documents (true positives, or TP) were shown

out of all actual relevant documents for the question.

Mathematically:

retriever recall@n =

TP@n

TP@n + FN@n

(6)

Retriever MAP: for cases where n>1 (the retriever

returns a ranked set of relevant documents), aver-

age precision, or AveP(q), is a metric that evaluates

whether all of the true positive documents retrieved by

the system for a given question q are ranked higher or

not. The mean average precision, or MAP, is the mean

of AveP for all questions q ∈ Q (Q is the total number

of questions).

retriever MAP =

∑

q=1

AveP(q) (7)

Reader Top-n-Accuracy: measures the number of

times where the (ground truth) answer is among the

top n predicted answers (i.e. the extracted answer

spans).

Reader EM:

the reader exact match metric, or EM,

is an uncompromising, rigid metric; it measures the

proportion of questions where the extracted answer

span exactly matches that of the correct (ground

truth) answer at a per character level.

Reader F1-score: a measure widely used in binary

classiﬁcation, is in this case computed over the in-

dividual words in the extracted answer span against

those in the correct (ground truth) answer, measuring

the average overlap between prediction and ground

truth. A good f1 score entails high classiﬁcations’ ac-

curacy (low false positives and negatives). The scores

range from 0 to 1, with 1 being a perfect score.

reader F1-score = 2 ×

recall × precision

precision + recall

(8)

Precision and recall in this context are, respectively,

the fraction of shared words with respect to the total

number of words in the extracted answer span, and

the fraction of shared words with respect to the total

number of words in the correct (ground truth) answer.

Fine-tuning Execution Time: measured in seconds

per epoch. This is of course a relative measure, totally

dependent on the power of the available computing

platform (see below). Anyway, it is insightful to

compare the time needed to ﬁne-tune each of the

readers considered in this work.

4.3 Computational Platform

We opted to run most of the experiments over the

cloud, using Google Colab(oratory). The Pro+ ver-

sion yielded a virtual machine (VM) with the follow-

ing characteristics:

- 8-core Xeon 2.0 GHz

- 54.8 GB RAM

- 167 GB HDD

- 1 Tesla V100-SXM2 16 GB

Additionally, the VM had the capability of run-

ning in the background without the need for hu-

man intervention which made running multiple exper-

iments non-stop possible.

Running our experiments in Colab limited us to

initialize and run ElasticSearch locally each time the

notebook ran. A production setting would obviously

require a cloud-based or on-premises GPU-enabled

platform with access to a retriever (e.g. Elasticsearch)

server running in distributed or all-in-one-box fash-

ion.

5 RESULTS AND DISCUSSION

Table 1 displays the assessment of mean predictive

performance of each retriever-reader conﬁguration.

We use the plain-vanilla reader (no ﬁne-tuning) as a

baseline for comparison purposes.

Each experiment was conducted 30 times, each

time recording the previously mentioned metrics. The

mean predictive performance of each conﬁguration

was computed by averaging the 30 outcomes of each

metric, together with its standard error. In each vari-

ation, ﬁne-tuning the reader on our domain-speciﬁc

dataset showed signiﬁcant improvements in the F1

score and accuracy. The following improvements in

the reader F1 score were recorded: roBERTa gained

29.57 percentage points on average (49.20% from

19.63%); Electra showed an improvement of 35.49

points (48.78% from 13.29%); and Minilm gained

29.73 (46.25% from 16.52%). In terms of reader

accuracy, roBERTa gained 14.02 points (86.69%

CSEDU 2022 - 14th International Conference on Computer Supported Education

Table 1: Results of experiments.

RoBERTa ELECTRA Minilm

ﬁne-tuned not ft ﬁne-tuned not ft ﬁne-tuned not ft

Mean SE Mean SE Mean SE Mean SE Mean SE Mean SE

retriever recall 0.90 0.0056 0.89 0.0054 0.90 0.0041 0.90 0.0039 0.89 0.0041 0.89 0.0045

retriever MAP 0.68 0.0076 0.68 0.0050 0.69 0.0058 0.69 0.0049 0.67 0.0055 0.70 0.0050

reader accuracy 86.69 0.5736 72.67 0.6669 85.64 0.4246 68.60 0.5300 84.85 0.5037 70.02 0.4591

reader EM 18.56 0.4307 5.99 0.3700 17.68 0.4759 1.87 0.1786 16.46 0.5250 2.87 0.1950

reader F1 49.20 0.6337 19.63 0.3889 48.78 0.5111 13.29 0.4084 46.25 0.6322 16.52 0.2763

ft exec time 76.59 1.2381 55.42 0.6500 28.79 0.3891

from 72.67%), Electra 16.84 points (85.64% from

68.60%), and Minilm improved 14.83 points on av-

erage (84.85% from 70.02%). Overall ﬁne-tuned

roBERTa slightly outperformed the ﬁne-tuned ver-

sions of both ELECTRA and Minilm.

The retriever component performed very well,

with recall and MAP values of 0.90 and 0.68 respec-

tively. We also computed the time in seconds per

epoch it took each model to ﬁne-tune on our 594 ques-

tions dataset, recording the following average execu-

tion times: a) roBERTa - 76.59 s/epoch; b) ELEC-

TRA - 55.42 s/epoch; c) Minilm - 28.79 s/epoch. As

can be seen, ELECTRA and Minilm took consider-

able less time per epoch than roBERTa, a much larger

model, with Minilm’s execution time being close to

one third of roBERTa’s execution time. This could

have an impact if the training dataset is large and

training time is a relevant factor in the models’ de-

ployment.

Interestingly, we found that, despite roBERTa’s

robustness and signiﬁcantly greater size. it did not

perform that much better than ELECTRA, which

showed the greatest gains in performance from ﬁne-

tuning (35.5 points in F1 and 16.8 points in accu-

racy). This suggests that ELECTRA’s novel training

approach may provide a signiﬁcant advantage over

bigger models such as roBERTa, specially when being

considered for production, given its similar accuracy

and faster training time.

6 CONCLUSIONS AND FUTURE

WORK

There is a plethora of beneﬁts to be garnered by ed-

ucational institutions from applying these technolo-

gies, such as helping students (both incoming and

current) and families learn more about the institution,

guide students through the process of common tasks

such as enrolling in classes or deciding on a major,

to name a few. More importantly, it would help de-

crease the number of administrative tasks currently

conducted by staff, opening the door for allocating

those resources more efﬁciently. As a result of the

automation of the tasks involved in assisting students

and their families there is the added beneﬁt of being

able to track what questions are being asked, how well

the AI replied to the question and the context of the

question. Leveraging these statistics can help an in-

stitution better focus staff on tasks and issues that are

most important and determine the areas of improve-

ment for the supporting systems and the ODQA appli-

cation. Further, it is conceivable that one could create

an evolutionary process that would automatically im-

prove the performance of the ODQA and by extension

the student support services.

Still, we do understand that under-budget and

time-constrained administrators may ﬁnd such imple-

mentation challenging. In this paper, with the in-

tent of minimizing the resources required, we have

described a methodology for the implementation of

ODQA system by abstracting the process into gen-

eral components, and provided a detailed descrip-

tion of the extant literature. The goal of this re-

search is not centered on tuning the models for op-

timal performance: a larger dataset with several thou-

sand question-answer pairs would improve the perfor-

mance metrics, and would open a new research av-

enue: tracing performance metrics vs number of train-

ing instances in the ﬁne-tuning process. This is cer-

tainly one aspect of this project which could be im-

proved.

We centered our efforts on sparse retrievers as

we did not ﬁnd considerable beneﬁt during the initial

tests that would justify the use of dense representa-

tions within the time constraints of this work. Still,

this is currently a hot topic of research, and there-

fore such conﬁgurations should probably be revisited

when considering the implementation of the retriever

component.

Moreover, other novel approaches could also

be considered in future work. For example,

recently Facebook researchers introduced a gen-

erative approach to Question-Answering, called

RAG (Retrieval-Augmented Generation) architecture

(Lewis et al., 2021). This alternative architecture im-

plements a dense retriever with a variation of BERT as

a reader. Their innovation lies in the answering mech-

Implementing Open-Domain Question-Answering in a College Setting: An End-to-End Methodology and a Preliminary Exploration

anism: in addition to retrieving the best documents,

RAG generates answers -instead of retrieving the best

span of text- and complements the process by evalu-

ating relevant words within the retrieved documents

related to the given question. Then, all individual an-

swer predictions are aggregated into a ﬁnal prediction,

a method which they found to enable the backpropa-

gation of the error signals in the output to the retrieval

mechanism, and with it potentially improving the per-

formance on end-to-end systems such as the one de-

scribed in this research. It would be therefore interest-

ing to gauge the performance of a RAG architecture,

compared to the architectures depicted in this paper.

Finally, we acknowledge that the methodology de-

picted in this research work could be applied in dif-

ferent industry sectors, but it does provide a roadmap

for researchers and practitioners in artiﬁcial intelli-

gence for higher education to implement open domain

question-answering systems. In this regard, this con-

ference is the right venue to promote a discussion on

these topics.

ACKNOWLEDGEMENTS

The authors would like to thank Marist College IT de-

partment for their sponsorship and support. We would

also like to thank the Data Science and Analytics

team for their collaboration throughout this research

project. A special mention goes to Maria Kapogian-

nis, Ethan Aug, Michael Volchko and Jack Spagna

who worked on the creation of the set of questions

and answers used to ﬁne-tune the models.

REFERENCES

Apache Software Foundation (2021). Lucene. https://

lucene.apache.org.

Chen, D., Fisch, A., Weston, J., and Bordes, A. (2017).

Reading wikipedia to answer open-domain questions.

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D.

(2020). Electra: Pre-training text encoders as discrim-

inators rather than generators.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding. eprint:

1810.04805.

Elastic (2021). Elasticsearch. https://github.com/elastic/

elasticsearch.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vin-

cent, P., and Bengio, S. (2010). Why does unsuper-

vised pre-training help deep learning? Journal of Ma-

chine Learning Research, 11(19):625–660.

Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep Learning, page 526. MIT Press. http://www.

deeplearningbook.org.

Green, B. F., Wolf, A. K., Chomsky, C. L., and Laughery,

K. (1961). Baseball: an automatic question-answerer.

In IRE-AIEE-ACM ’61 (Western).

Johnson, J., Douze, M., and J

egou, H. (2017). Billion-

scale similarity search with gpus. arXiv preprint

arXiv:1702.08734.

Jurafsky, D. and Martin, J. H. (2021). Speech and Language

Processing: An Introduction to Natural Language

Processing, Computational Linguistics, and Speech

Recognition. 3rd, draft edition. https://web.stanford.

edu/

∼

jurafsky/slp3/.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L.,

Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense

passage retrieval for open-domain question answer-

ing. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 6769–6781, Online. Association for

Computational Linguistics.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,

V., Goyal, N., K

uttler, H., Lewis, M., tau

Yih, W., Rockt

aschel, T., Riedel, S., and Kiela,

D. (2021). Retrieval-augmented generation for

knowledge-intensive nlp tasks.

Lin, J., Ma, X., Lin, S.-C., Yang, J.-H., Pradeep, R., and

Nogueira, R. (2021). Pyserini: An easy-to-use python

toolkit to support replicable ir research with sparse

and dense representations.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach.

Pietsch, Soni, Chan, M

oller, and Kosti

c (2021). Haystack

(version 0.5.0). https://github.com/deepset-ai/

haystack/.

Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you

don’t know: Unanswerable questions for squad.

Robertson, S. and Zaragoza, H. (2009). The probabilistic

relevance framework: Bm25 and beyond. Founda-

tions and Trends in Information Retrieval, 3:333–389.

Salton, G. (1971). The SMART Retrieval Sys-

tem—Experiments in Automatic Document Process-

ing. Prentice-Hall, Inc., USA.

Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H.

(2018). Bidirectional attention ﬂow for machine com-

prehension.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,

Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin,

I. (2017). Attention Is All You Need. eprint:

1706.03762.

Wang, S. and Jiang, J. (2016). Machine comprehension us-

ing match-lstm and answer pointer.

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou,

M. (2020). Minilm: Deep self-attention distillation for

task-agnostic compression of pre-trained transform-

ers.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,

Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,

CSEDU 2022 - 14th International Conference on Computer Supported Education

M., Davison, J., Shleifer, S., von Platen, P., Ma, C.,

Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S.,

Drame, M., Lhoest, Q., and Rush, A. M. (2020). Hug-

gingface’s transformers: State-of-the-art natural lan-

guage processing.

Yang, P., Fang, H., and Lin, J. (2017). Anserini: Enabling

the use of lucene for information retrieval research.

In Proceedings of the 40th International ACM SIGIR

Conference on Research and Development in Infor-

mation Retrieval, SIGIR ’17, page 1253–1256, New

York, NY, USA. Association for Computing Machin-

ery.

Implementing Open-Domain Question-Answering in a College Setting: An End-to-End Methodology and a Preliminary Exploration