Design Principles and a Software Reference Architecture for Big Data

Question Answering Systems

Leonardo Mauro Pereira Moraes

1,2 a

, Pedro Calciolari Jardim

1 b

and Cristina Dutra Aguiar

1 c

Department of Computer Science, University of S

ao Paulo, S

ao Carlos, Brazil

Machine Learning & Artiﬁcial Intelligence, Sinch, Stockholm, Sweden

Keywords:

Question Answering, Big Data, Software Reference Architecture, Design Principles.

Abstract:

Companies continuously produce several documents containing valuable information for users. However,

querying these documents is challenging, mainly because of the heterogeneity and volume of documents

available. In this work, we investigate the challenge of developing a Big Data Question Answering system,

i.e., a system that provides a uniﬁed, reliable, and accurate way to query documents through naturally asked

questions. We deﬁne a set of design principles and introduce BigQA, the ﬁrst software reference architecture to

meet these design principles. The architecture consists of high-level layers and is independent of programming

language, technology, querying and answering algorithms. BigQA was validated through a pharmaceutical

case study managing over 18k documents from Wikipedia articles and FAQ about Coronavirus. The results

demonstrated the applicability of BigQA to real-world applications. In addition, we conducted 27 experiments

on three open-domain datasets and compared the recall results of the well-established BM25, TF-IDF, and

Dense Passage Retriever algorithms to ﬁnd the most appropriate generic querying algorithm. According to the

experiments, BM25 provided the highest overall performance.

1 INTRODUCTION

The documents produced by companies have become

a valuable source of information for employees and

the general public. For example, a product manager

may look for information in the technical documenta-

tion to certify that a given ﬁnal product faithfully fol-

lows the original project speciﬁcations. A decision-

making employee may analyze ﬁnancial reports to

positively and quickly react to changes in business

conditions. Furthermore, ordinary users can search

for details in a product manual to identify if the prod-

uct fulﬁlls their needs.

However, querying these documents is challeng-

ing for users, mainly because of the diversity of the

large number of documents available. As a result,

the time spent to ﬁnd a given information can cause

a delay in their activities. It is also possible that they

access outdated information. Another aspect that bur-

dens the users is obtaining only unhelpful information

for their questions. In this paper, we use the terms

https://orcid.org/0000-0002-9553-9978

https://orcid.org/0000-0001-9475-2526

https://orcid.org/0000-0002-7618-1405

query and question interchangeably.

A suitable solution to this challenge is to use a Big

Data Question Answering system. The system stores

and processes several documents in different formats

(e.g. PDF, Open Document Format) and web based

ﬁles (e.g. ﬁles from the Internet). These ﬁles are usu-

ally spread in knowledge bases like Data Lakes and

Document Stores. It also provides a uniﬁed and ac-

curate way to query documents. Moreover, the sys-

tem supports the characteristics of volume, velocity,

and variety from the Big Data concept (Laney et al.,

2001), thus it is able to scale effectively.

Differently from traditional search engines like

Google, the Big Data Question Answering sys-

tem supports questions in natural language sen-

tences. Furthermore, it understands context, sub-

ject, and question intention, among other characteris-

tics (Athira et al., 2013; Zhang et al., 2013; Karpukhin

et al., 2020). Therefore, it tends to be more efﬁcient

and accurate than traditional search engines (Athira

et al., 2013), introducing advantages for many appli-

cations. For example, the system can be used to de-

velop a more advanced search engine that scans le-

gal documents or a user-friendly chatbot-based FAQ

(Frequently Asked Questions) application that allows

Moraes, L., Jardim, P. and Aguiar, C.

Design Principles and a Software Reference Architecture for Big Data Question Answering Systems.

DOI: 10.5220/0011842700003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 57-67

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

users to ask questions about products and services.

The Big Data Question Answering system em-

ploys Question Answering (QA) algorithms to query

documents. These algorithms are composed of two

steps: Document Retriever and Document Reader.

The ﬁrst step takes questions in natural language from

the users as input and then looks for related docu-

ments that may have the answer to the user questions.

The second step produces summarized answers from

the retrieved documents.

QA algorithms use Natural Language Processing

(NLP) techniques. According to Petroni et al., 2019;

Karpukhin et al., 2020; Romualdo et al., 2021, current

NLP techniques recall factual knowledge without any

ﬁne-tuning, demonstrating their potential as unsuper-

vised open-domain QA algorithms.

However, little attention has been devoted to the

problem of investigating how to build a Big Data

Question Answering system. To face this problem,

we must propose design principles and a software

reference architecture. Design principles are guide-

lines, biases, and design considerations that should

be followed to select, create, and organize compo-

nents and features. A reference architecture is a soft-

ware template where the structures, respective com-

ponents, and relations provide a concrete system ar-

chitecture for a particular application or a family of

software systems (Galster and Avgeriou, 2011; Klein

et al., 2016; Derras et al., 2018).

As discussed in Section 2, there are state-of-the-

art solutions in the literature that study the prob-

lem separately, presenting several drawbacks. On

the one hand, general-purpose Big Data architectures

do not focus on QA solutions. On the other hand,

general-purpose Question Answering architectures do

not meet the concepts of Big Data or a software refer-

ence architecture. To the best of our knowledge, there

is still no solution that considers Big Data and Ques-

tion Answering architectures in the same setting and

deﬁnes related design principles. In this paper, we in-

vestigate this gap in the literature.

We propose BigQA, the ﬁrst Big Data Question

Answering architecture. The features of the proposed

architecture are described as follows. The architec-

ture collects structured, semi-structured, and unstruc-

tured data from different sources through the Input

Layer. Data can be available in several formats, such

as multi-documents and web pages. The Big Data

Storage Layer is needed at the bottom to prepare high-

quality data for all kinds of analytical demands re-

quired by the upper layers. The Big Querying Layer is

responsible for processing the users questions that are

sent from the Connection Layer. Also, all the layers

are securely connected by the Security Layer and gen-

erate analytical data managed by the Insights Layer.

We highlight the main contributions of this paper:

1. A set of design principles based on Business (B),

Data (D), and Technical (T) requirements to de-

sign Big Data Question Answering systems.

2. A software reference architecture to meet the de-

sign principles, called BigQA.

3. A case study to demonstrate the applicability of

BigQA to real-world applications.

4. A set of experiments to compare the well-

established BM25, TF-IDF, and Dense Passage

Retriever algorithms to indicate the best candidate

to implement the Document Retriever.

This paper is organized as follows. Section 2 re-

views related work. Section 3 introduces the design

principles. Section 4 describes the proposed architec-

ture. Section 5 validates the architecture. Section 6

describes experiments performed considering differ-

ent QA algorithms. Section 7 concludes the paper.

2 RELATED WORK

There are few Big Data and Question Answering ar-

chitectures that have been proposed in the literature,

but they are different from our work on their purpose

and features. In Sections 2.1 and 2.2 we survey, re-

spectively, research papers and private technologies

regarding these architectures.

2.1 Research Proposals

In this section, we analyze Big Data and Question

Answering architectures by dividing them into two

groups. Group (i) includes general-purpose software

reference architectures developed for data analysis

and applications in the context of Big Data. Group

(ii) encompasses Question Answering architectures

for speciﬁc use cases.

In Table 1, we compare the investigated archi-

tectures and our work considering the main features

of a Big Data Question Answering Architecture dis-

cussed in Section 3. We consider the characteristics

described as follows.

(c.1) Fits as a software reference architecture.

(c.2) Meets Big Data requirements.

(c.3) Implements security components.

(c.4) Introduces design principles.

(c.5) Implements a Question Answering solution.

(c.6) Can retrieve documents from multiple domains.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Table 1: Characteristics of the proposed BigQA architecture and related work.

(c.1) (c.2) (c.3) (c.4) (c.5) (c.6) (c.7)

Studies Reference Big Security Design QA Open Case

Architecture Data Principles Domain Study

Big Data

Architecture

(Zhu et al., 2019) ✓ ✓ ✓ ✓

(Li et al., 2020) ✓ ✓ ✓

(Ataei and Litchﬁeld, 2021) ✓ ✓

(Cassavia and Masciari, 2021) ✓ ✓ ✓

(Yousﬁ et al., 2021) ✓ ✓ ✓

Question

Answering

Architecture

(Sucunuta and Riofrio, 2010) ✓ ✓ ✓ ✓

(Nielsen et al., 2010) ✓ ✓ ✓ ✓

(Zhang et al., 2013) ✓ ✓ ✓

(Novo-Loures et al., 2020) ✓ ✓ ✓

(Karpukhin et al., 2020) ✓ ✓ ✓

BigQA (our proposal) ✓ ✓ ✓ ✓ ✓ ✓ ✓

Legend: The ✓symbol indicates the challenges addressed by each study.

(c.7) Evaluates the architecture in a real-world case.

Regarding Group (i), Big Data architectures, the

studies detailed in (Zhu et al., 2019; Li et al., 2020;

Ataei and Litchﬁeld, 2021; Cassavia and Masciari,

2021; Yousﬁ et al., 2021) meet Big Data requirements

and ﬁt as software reference architectures. However,

the proposed architectures are difﬁcult to adapt to the

needs of Question Answering due to its constraints

and conﬁgurations of layers and components. Fur-

thermore, the two studies that introduce design prin-

ciples consider only technical aspects.

Considering Group (ii), Question Answering ar-

chitectures, the main objective of the studies de-

scribed in (Sucunuta and Riofrio, 2010; Nielsen et al.,

2010; Zhang et al., 2013; Novo-Loures et al., 2020;

Karpukhin et al., 2020) is to propose QA algorithms.

Based on these algorithms, these studies develop ar-

chitectures and apply their solution to real-world case

studies. The main drawback of the studies in this

group is related to the fact that the proposed architec-

tures are not generic or ﬂexible. The architectures are

highly dependent on the proposed algorithms. Specif-

ically assessing the studies of (Sucunuta and Riofrio,

2010; Nielsen et al., 2010), they focus only on techni-

cal aspects of the design principles.

The studies described in (Sucunuta and Riofrio,

2010; Novo-Loures et al., 2020) are the closest to our

work. Differently from our proposed BigQA, the ar-

chitecture introduced in (Sucunuta and Riofrio, 2010)

has a ﬁxed schema for query processing and uses

an outdated algorithm compared with current mod-

ern NLP algorithms. Also, it does not involve secu-

rity features or consider Big Data traits. In (Novo-

Loures et al., 2020), the authors introduce an NLP ar-

chitecture based on BDP4J (Big Data Pipelining For

Java) to preprocess textual data using data pipelines.

This architecture is mainly to perform text processing.

Thus, it has not been designed and used to deal with

Question Answering. The architecture does not even

consider design principles or security artifacts.

Our proposed BigQA architecture overcomes the

aforementioned shortcomings and fulﬁlls all the char-

acteristics analyzed in Table 1. BigQA is a software

reference architecture that considers Big Data aspects

and the problem of open-domain Question Answering

in the same setting. It also includes security aspects to

guarantee data protection and to support related gen-

eral laws. Moreover, BigQA consists of high-level

layers with speciﬁc functionalities. Therefore, it is in-

dependent of programming language, QA algorithm,

and technology. Besides, our work deﬁnes a set of

design principles based on business, data, and techni-

cal aspects. We also validate BigQA by considering a

real-world case study.

2.2 Private Technologies

There are some private technologies like IBM Watson

Discovery

, Amazon Kendra

, and Sinch AskFrank

that complies with Big Data Question Answering sys-

tems. We do not investigate these technologies in this

section because they have a proprietary technology

and, to the best of our knowledge, do not provide pub-

lic research papers.

Furthermore, ChatGPT

is a technology recently

released by OpenAI. It leverages the information re-

https://www.ibm.com/cloud/watson-discovery

https://aws.amazon.com/pt/kendra/

https://askfrank.ai/home

https://openai.com/blog/chatgpt/

Design Principles and a Software Reference Architecture for Big Data Question Answering Systems

trieval experience, enabling a high level of context un-

derstanding and answer generation. ChatGPT has a

great capacity of generating natural and human-like

answers for complex questions. Therefore, it can an-

swer any query, regardless of its complexity. Apart

from the common NLP problems of large models, like

hallucination and misleading answers (Ji et al., 2022),

the results provided by ChatGPT are very close to hu-

man intelligence.

Although these surprising results, we can high-

light some drawbacks in the solution:

• Large models are usually trained with static data.

Thus, ChatGPT has a frozen knowledge limitation

when applied to real-world and dynamic applica-

tions.

• It is unknown whether the generated answers are

real or completely hallucinated as ChatGPT does

not provide references to the source information.

• It is not very clear how one would incorporate

business rules to ChatGPT reasoning. Therefore,

it is difﬁcult to ﬁlter out or enrich its answers.

In summary, ChatGPT cannot perform queries

that would allow answers based on dynamic data or

live events. It also does not incorporate data from ex-

ternal data sources, although data is constantly being

updated. Currently, ChatGPT knowledge is based on

static Internet data up to 2021. As a result, users can

access outdated information or get unhelpful informa-

tion for their questions. Our proposed BigQA archi-

tecture overcomes the aforementioned challenges by

allowing insertion and updating of data. Furthermore,

there are no public research papers that describe and

detail ChatGPT logic and implementation.

3 DESIGN PRINCIPLES

Despite the vast literature on Big Data and Question

Answering, it remains unclear how to design a suit-

able Big Data Question Answering system. From our

point of view, the system must be designed accord-

ing to accurate principles that consider business, data,

and technical aspects. We highlight that the principles

of quality inﬂuence the system quality.

In this section, we introduce a set of design

principles for Big Data Question Answering sys-

tems. These principles are inspired by business mod-

els (M

uller et al., 2019; Schaffer et al., 2020), the ag-

ile manifesto (Misra et al., 2012), and the characteris-

tics of the Big Data concept (Laney et al., 2001). We

deﬁne each principle as follows.

Principles related to Business (B):

B1: The user must retrieve a proper answer to a given

question. The answer may be unknown due to a

lack of information in the documents or a misun-

derstanding of the QA algorithm employed. In

this case, the system must inform the user.

B2: The user must access only allowed documents.

Therefore, the system should support the imple-

mentation of data governance policies to ensure

that only authorized users can access portions of

data and documents.

B3: The user may write the question using natural

language. Thus, the system needs to automati-

cally understand the context, subject, and purpose

of the user question.

B4: The answer must summarize the contents of the

documents related to the question. The system

usually provides the answer in two formats: as

a FAQ answer or as intrinsic information within

documents, such as parts of texts.

Principles related to Data (D):

D1: The system must persist the documents. Raw

documents should be stored in a Data Lake or

a similar repository, even if the documents are

processed and the system uses a small portion of

them. The evolution of NLP algorithms and docu-

ment processing techniques motivates storing raw

documents. Therefore, the system can reuse them

in the future if needed.

D2: The system must work with documents from

different data sources. Examples of sources in-

clude data systems, databases, website crawlers,

and web-based collaborative platforms like Con-

ﬂuence

and SharePoint

pages.

D3: The system must support documents in a vari-

ety of formats and with different sizes. Given the

characteristics of a data source, a document can be

structured according to a speciﬁc format, such as

web page, PDF, Word, and JSON (JavaScript Ob-

ject Notation). Furthermore, a document can be

small or large, depending on its number of pages.

In this case, the system must be able to deal with

structured, semi-structured, and unstructured data

related to Big Data variety.

D4: The amount of produced documents and texts

written can easily reach large volumes of data.

Therefore, the system must be able to extend its

functionalities to Big Data volume.

D5: After inserting or updating the raw documents,

data must be ready for consumption. Therefore,

https://www.atlassian.com/software/conﬂuence

https://www.ofﬁce.com/sharepoint

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

the system must be able to process the raw docu-

ments considering Big Data velocity. Thus, docu-

ments are available at the right time to support the

best business decisions.

Principles related to Technical aspects (T):

T1: Modularity. Each component has a speciﬁc sys-

tem functionality and works as an independent

module that contains everything necessary to ex-

ecute its functionality. But, at the same time, all

components are connected to deliver the proper

value to the system.

T2: Flexibility. The system must easily encompass

new components as needed, each with its com-

plexity and functionality. Furthermore, the system

should be independent of software programming

language and technology.

T3: Analytic. The system must store data, meta-

data, and usage information for analytical analy-

sis. That is, the system must store analytical in-

formation for system managers.

T4: Security. The system must use security artifacts

to ensure the system integrity, such as user au-

thentication systems, credentials for adding doc-

uments, data governance policies, and encrypted

connections between components.

4 PROPOSED ARCHITECTURE

From the design principles introduced in Section 3,

we propose BigQA, the ﬁrst Big Data Question An-

swering architecture. It comprises six layers, as de-

picted in Figure 1.

There are two types of layers: horizontal and ver-

tical. Horizontal layers are those that have an explicit

connection between them. For instance, there is a

connection between the Communication and the Big

Querying layers. Vertical layers refer to those that

can connect to any other layer of the architecture. For

example, the Security Layer can provide to the Com-

munication Layer secure credentials for an API and to

the Insights Layer act as an authentication system for

a Data Warehouse.

In this section, we detail the functionality of each

layer. We ﬁrst describe the horizontal layers, i.e., (i)

Input; (ii) Big Data Storage; (iii) Big Querying; and

(iv) Communication. Then, we discuss the vertical

layers, i.e., (v) Security; and (vi) Insights. Finally, we

discuss general aspects of the architecture.

Before detailing the layers, we introduce in Exam-

ple 1 a business application where the BigQA archi-

tecture is required. We employ this case as a running

example throughout this section.

Example 1. Consider a large pharmaceutical com-

pany that offers a wide range of health care products

and manages many documents, including pharmaceu-

tical leaﬂets, products reports, ﬁnancial and contract

documents. The company needs a single knowledge

base where its employees can quickly look for infor-

mation using natural language. To comply with this

requirement, the company implemented a system us-

ing BigQA as architecture.

4.1 Input Layer

The Input Layer ingests documents into the system.

This layer collects documents from the data sources,

such as company ﬁles, reports, and trusted websites.

It also sends these documents to the Data Lake with-

out (or with minimal) preprocessing. Example 2 il-

lustrates the use of the Input Layer.

Example 2. Data providers use the Input Layer to

add documents to the system. In the context of Exam-

ple 1, data providers are employees who write docu-

ments on web-based collaborative platforms and ex-

ternal sources, such as market research documents.

These data sources generate different formats of doc-

uments, such as Word, JSON, and PDF ﬁles.

4.2 Big Data Storage Layer

The ﬁrst functionality of the Big Data Storage Layer

refers to data storage. The layer receives raw doc-

uments from the Input Layer and stores them in the

Data Lake. For each document, the system performs

several related activities. First, it identiﬁes the doc-

ument type. Then, it applies a set of predeﬁned pro-

cessing techniques to clean and transform the docu-

ment data to fulﬁll the requirements of the Document

Store. In the sequence, the system stores the trans-

formed data in this repository. Next, the system up-

dates the Metadata Repository with information about

the new, or updated, document obtained from the Data

Lake and the Document Store.

The Data Storage Layer is also responsible for

data processing. To implement this feature, a big

data infrastructure is needed, as well as a distributed

and parallel processing framework (e.g., Apache

Spark (Zaharia et al., 2010)). The objective is to

manage the volume and size of the documents. Be-

sides, the Document Store should store ﬁles using a

distributed ﬁle system (e.g., Hadoop Distributed File

System (HDFS) (Shvachko et al., 2010)).

Design Principles and a Software Reference Architecture for Big Data Question Answering Systems

Security Layer

Document

Retriever

Big Querying Layer

Horizontal Layers

Data

Governance

Authentication

System

Credentials &

Permissions

Artificial

Intelligence

Reporting

Tools

Data

Warehouse

Document

Reader

API

Communication Layer

Front-end

Metadata

Repository

Big Data Storage Layer

Document

Store

Data

Lake

Other

Input Layer

FilesWebsites

Insights Layer

Vertical Layers

Figure 1: The BigQA, the ﬁrst Big Data Question Answering architecture.

4.3 Big Querying Layer

The Big Querying Layer is the core of the architec-

ture. It is the query engine responsible for processing,

interpreting, and producing answers to the user ques-

tions. This layer retrieves documents from the Big

Data Storage Layer, receives queries from the Com-

munication Layer, and produces answers that are sent

back to the Communication Layer. Example 3 shows

how the Big Querying Answering Layer operates.

Example 3. A pharmacist may need information

about the immune system. In the context of Exam-

ple 1, the pharmacist accesses a web page through the

Communication Layer and sends the following ques-

tion “How do pathogens avoid detection?”. The Big

Querying Layer processes the query by retrieving the

immune system-related documents, summarizing the

contents, and producing the answer “pathogens can

rapidly evolve and adapt”. Finally, the Communica-

tion Layer receives the response and presents it to the

pharmacist on the web page.

The Big Querying Layer encompasses two com-

ponents. The Document Retriever retrieves the most

valuable documents that may contain the answer to a

given question. These documents are obtained from

the Document Store using a big data search engine

(e.g., Apache Lucene (Lydia et al., 2020)). The sec-

ond component, Document Reader, examines the re-

trieved documents and produces a suitable answer for

the user. It may execute in parallel on computing

clusters (e.g., Kubernetes (Poniszewska-Mara

nda and

Czechowska, 2021)).

4.4 Communication Layer

The Communication Layer acts as an interface

through which users submit queries and receive their

answers. This layer can encompass several compo-

nents. In Figure 1, we illustrate two of them. Front-

end and API are two components that allow users and

applications to send questions to the system.

The Communication Layer requires connectiv-

ity through data streaming applications (e.g., Apache

Kafka (Lepenioti et al., 2020)). Therefore, it enables

near real-time response retrieval.

4.5 Security Layer

The Security Layer addresses security issues like net-

work connection, credentials, and data governance.

Because it is a vertical layer, it can apply security ar-

tifacts to any other layer. Example 4 illustrates how

to use the Security Layer to provide a secure network

connection between two components.

Example 4. In implementing the system described in

Example 1, the Communication Layer must securely

connect to the Big Querying Layer. Otherwise, unau-

thorized people may access the system and query re-

stricted internal information about the company and

its products. The development team must implement

an internal network and a ﬁrewall system to ensure a

secure connection between the components.

The Security Layer consists of, but is not limited

to, the components described as follows. The Authen-

tication System is responsible for authenticating users

operations. The Credentials & Permissions compo-

nent should guarantee the deﬁnition of appropriate

credentials and permissions to apply to the network

connection between the components. Finally, Data

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Governance refers to managing the availability, us-

ability, integrity, and security of the data in the system

according to well-deﬁned policies and constraints.

4.6 Insights Layer

The Insights Layer comprises the data analysis. This

layer receives and processes data from multiple lay-

ers. Examples of components that are present in the

Insights Layer are: (i) Reporting Tools to provide us-

age reports, data telemetry, and system monitoring;

(ii) Data Warehouse, a core component of business

intelligence activities that organizes data multidimen-

sionally to support reporting and data analysis; and

(iii) Artiﬁcial Intelligence models.

4.7 Architecture Discussion

BigQA presents the high-level functionality needed to

ﬁt the design principles rather than a restricted set

of technology for implementing a Big Data Ques-

tion Answering system. Therefore, each development

team can choose the technology, software program-

ming language, and QA algorithm that better suit their

application requirements. Furthermore, each team

should adapt the architecture to the application re-

quirements by instantiating only the appropriate lay-

ers and components.

BigQA covers the design principles listed in Sec-

tion 3 as follows: (i) B1-B4: Communication, Big

Querying, and Security layers; (ii) D1-D5: Input and

Big Data Storage layers; (iii) T1-T2: all layers; (iv)

T3: Insights Layer; and (iv) T4: Security Layer.

Regarding the implementation, we encourage lean

development using iterative approaches and modular

components. Each component has a speciﬁc function

in the system and can function independently. So,

the components can be developed independently and

evolutionary, as stated by the Agile manifesto (Misra

et al., 2012). To have agile teams, it is essential to

simplify the development of the components and to

facilitate collaboration across time, location, and or-

ganizational boundaries (Schaffer et al., 2020).

5 CASE STUDY

In this section, we present a case study to show how

BigQA can be deployed to enable a knowledge base

containing real-world documents. Our goal is not to

perform an extensive analysis of the architecture com-

ponents. Instead, we implement a real-world case to

assess the architecture purpose. Section 5.1 describes

how to instantiate BigQA. Section 5.2 details queries.

5.1 Architecture Instantiation

Figure 2 depicts the BigQA components and layers in-

stantiated in the case study. The Input Layer contains

JSON documents obtained from the training sets from

two real-world datasets: (i) the Stanford Question

Answering Dataset (SQuAD) v1.1 (Rajpurkar et al.,

2016); and (ii) the COVID-QA (M

oller et al., 2020).

SQuAD is an open-domain Question Answering

dataset that stores over 18,800 unique documents with

over 87,500 questions and answers about Wikipedia

articles. Its contents refers to several different top-

ics, such as pharmacy, antibiotics, databases, software

testing, TV series, car companies, and geology.

COVID-QA consists of over 2,000 questions and

answers annotated by volunteer biomedical experts

on 147 scientiﬁc articles related to COVID-19. This

dataset is not open-domain. We used it as data aug-

mentation to show that BigQA supports adding data

from different formats to improve answers.

The Big Data Storage Layer does not keep the raw

documents since the datasets were processed before

being inserted into the Document Store. Therefore,

data transformations converted the JSON documents

into Data Storage records. We used the Elasticsearch

tool as Document Store. According to Kononenko

et al., 2014, “Elasticsearch is an open-source text

search engine written in Java that is designed to be

distributive, scalable, and near real-time capable”.

We employed the Haystack

tool to build the Big

Querying and Communication layers. Haystack is

an open-source framework in Python that supports

pipelines for different search applications and in-

cludes several state-of-the-art NLP models. We used

the well-established QA algorithms BM25 (Robert-

son and Jones, 1976) and RoBERTa (Liu et al., 2019)

as Document Retriever and Document Reader, re-

spectively. Regarding BM25, it was the QA algorithm

that provided the best performance in the experiments

described in the Section 6.

The code was written in Python using Jupyter

Notebooks. It is publicly accessible from the link pro-

vided in the Conclusion (Section 7).

5.2 Queries

We describe three queries that can execute on top of

the instantiated architecture described in Section 5.1.

We issued distinct types of queries to analyze differ-

ent aspects related to real-world applications. The

queries were deﬁned considering the context of the

pharmaceutical company detailed in Example 1.

https://www.elastic.co/

https://haystack.deepset.ai/

Design Principles and a Software Reference Architecture for Big Data Question Answering Systems

JSON

documents

Document

Store

Big Querying LayerInput Layer

Big Data

Storage Layer

API

Communication Layer

User

Document

Retriever

BM25

Document

Reader

RoBERTa

Figure 2: A real-world case study setup using BigQA.

For each query, the Document Retriever returned

the top-20 documents containing related information,

and the Document Reader returned the 3 most likely

answers. Therefore, there are three possible answers

for each query, each one annotated with a probability

score provided by the Data Reader. Higher scores in-

dicate more conﬁdence in the prediction. Each query

was an unseen question reformulated from the origi-

nal dataset to avoid bias from the Document Retriever

and Document Reader algorithms. Also, all queries

returned answers since we did not evaluate the no-

answer scenario.

Query 1: What law regulates drug marketing in

the pharmaceutical industry? This query represents

the interest of pharmacists, marketing, and legal em-

ployees in knowing about regulatory laws on drug

marketing. It is a named-entity query since it looks

for a regulatory law name. This type of query aims

at generating an appropriate answer considering that

only one document has the correct answer. We exe-

cuted Query 1 against the SQuAD dataset.

Table 2 shows the results of Query 1. The ﬁrst two

answers refer to documents from the pharmaceutical

industry, while the last indicates a United States legal

penalty document. The answer with the highest prob-

ability score is the right and expected answer. We can

conclude that the instantiated architecture can identify

the answer to the Query 1 with a score of about 76%.

Table 2: Results of the named-entity Query 1.

Answer Document Score

Prescription

Drug Marketing

Act of 1987

Pharmaceutical

industry

76.34%

Food and Drug

Administration

(FDA)

Pharmaceutical

industry

19.77%

Torture Capital 11.01%

Regulation punishment in

the United States

Query 2: When was the Luria–Delbr

uck? This

query represents the interest of microbiologists in ex-

tracting information about a bacterial experiment for

antibiotics, which occurred in 1943. It is a date-

oriented query since it searches for speciﬁc date. This

type of query aims at investigating the architecture

ability to identify dates from documents. We executed

Query 2 against the SQuAD dataset.

Table 3 depicts the results of Query 2. The ﬁrst

document is the only one related to antibiotics; the

others are associated with Arnold Schwarzenegger.

The answer with the highest probability score is the

right and expected answer. However, since the score

is below 50%, the Document Reader struggles to con-

sider the answer as correct. Usually, when scores are

lower than 50%, the algorithm returns that it was not

able to ﬁnd an answer. We can conclude that the

instantiated architecture can identify the answer to

the date-oriented Query 2, but the Document Reader

should be ﬁne-tuned for date question samples.

Table 3: Results of the date-oriented Query 2.

Answer Document Score

1943 Antibiotics 29.89%

14 Arnold Schwarzenegger 6.84%

14 Arnold Schwarzenegger 3.06%

Query 3: What is the novel Coronavirus? This

query is informative for any pharmaceutical employee

and the general public. We executed Query 3 against

the SQuAD dataset augmented with the COVID-QA

dataset. This type of query, augmented query, ex-

plores the architecture ability to extract knowledge

from new documents of different formats.

Table 4 depicts the results of Query 3. All the re-

turned documents refer to the Coronavirus and pro-

vide a score over 70%. The ﬁrst and third answers are

the correct ones. We can conclude that the instanti-

ated architecture can return the answer to the Query 3

by extracting data from new documents once they are

processed and inserted into the Document Store.

5.3 Case Study Discussion

In summary, this case study demonstrated the appli-

cability of BigQA to real-world applications, using

documents from Wikipedia articles and FAQ ques-

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Table 4: Results of the augmented Query 3.

Answer Document Score

SARS-CoV-2 COVID-QA 87.70%

Prevention for 2019 COVID-QA 76.78%

SARS-CoV-2 COVID-QA 71.66%

tions and answers. As discussed in Section 5.1, we

adapt the instantiated architecture to the application

requirements by implementing only the appropriate

layers and components. Finally, in Section 5.2, we

presented distinct types of queries analyzing different

aspects related to business applications. The instanti-

ated architecture was able to properly answer named-

entity and data augmented queries, however struggles

with the probability of the date-oriented Query 2.

6 DOCUMENT RETRIEVER

ALGORITHMS EVALUATION

Because BigQA is agnostic, the Document Retriever

can implement any algorithm. In this section, we con-

duct 27 experiments to evaluate three well-established

QA algorithms and investigate their recall score. We

are motivated by the fact that the use of a higher re-

call algorithm provides a higher end-to-end querying

and answering performance (Karpukhin et al., 2020).

Section 6.1 describes the experiments setup and Sec-

tion 6.2 details the experiment results.

6.1 Experiment Setup

We used almost the same instantiation described in

Section 5.1 to evaluate the following well-established

QA algorithms: BM25 (Robertson and Jones, 1976),

TF-IDF (Sammut and Webb, 2010), and Dense Pas-

sage Retriever (DPR) (Karpukhin et al., 2020).

The differences refer to the data characteristics

and the employed datasets. We used question-

document pairs as input data to carry out this eval-

uation. The purpose of the experiments was to eval-

uate the performance of the algorithm to retrieve the

correct document for a given question. Moreover, we

employed the validation sets of the following three

open-domain and real-world datasets:

• SQuAD v1.1 (Rajpurkar et al., 2016), with 10,570

question-document pair samples.

• AdversarialQA (Bartolo et al., 2020), a QA

dataset in which humans have created adverse and

complex questions, so the models cannot answer

these questions easily. This dataset stores 3,000

question-document pair samples.

• DuoRC (Saha et al., 2018), a dataset of movie plot

questions and answers on articles from Wikipedia

and IMDb, containing 12,845 question-document

pair samples.

We compared the performance of the algorithms

considering the recall measure. This measure indi-

cates how many times an algorithm retrieves the cor-

rect document from the k retrieved documents. We

varied the value of k in [3, 10, 20]. The literature usu-

ally works with k = 20 or more. However, to provide

fast answers without losing performance, we used

these values because business applications require the

retrieval of fewer documents.

6.2 Experiment Results

Table 5 depicts the recall results of the investigated

algorithms. The results demonstrate that the recall in-

creases as the value of k also increases, indicating that

retrieving more documents impacts the probability of

recovering the correct document.

In most of the cases, BM25 provided the best per-

formance. This is related to the fact that BM25 ex-

tends TF-IDF using a probabilistic information re-

trieval model, improving recall. Compared to the

DPR dense algorithm, BM25 is a sparse algorithm.

Dense algorithms are computationally expensive re-

garding time and secondary memory usage. Accord-

ing to these results, we employed BM25 as the Docu-

ment Retriever in the case study detailed in Section 5.

Considering the AdversarialQA dataset, DPR pro-

vided better results than BM25 and TF-IDF for the

values of k equal to 10 and 20. In these cases, DPR

was more efﬁcient in understanding the subject and

the context of the questions because dense algorithms

tend to perform better over complex datasets.

7 CONCLUSION

In this paper, we proposed a set of design princi-

ples based on business, data, and technical aspects

to support the development of reliable and secure

systems. Based on these principles, we introduced

BigQA, the ﬁrst Big Data Question Answering archi-

tecture. BigQA is a software reference architecture

composed of the following layers: (i) Input, the in-

gestion of documents; (ii) Big Data Storage, the stor-

age and processing of the data; (iii) Big Querying, the

query engine; (iv) Communication, the user interface;

(v) Security, the security artifacts; and (iv) Insights,

the data analysis support. The architecture is agnos-

tic, i.e., is independent of programming language, QA

algorithm, and technology.

Design Principles and a Software Reference Architecture for Big Data Question Answering Systems

Table 5: Recall results of the QA algorithms investigated to implement the Document Retriever.

SQuAD AdversarialQA DuoRC

BM25 TF-IDF DPR BM25 TF-IDF DPR BM25 TF-IDF DPR

k = 3 88.74% 81.12% 69.21% 71.89% 69.95% 69.85% 86.05% 77.37% 23.29%

k = 10 94.43% 92.01% 85.72% 81.35% 81.81% 89.17% 91.49% 87.47% 35.83%

k = 20 96.29% 95.83% 91.38% 84.89% 85.56% 99.43% 93.76% 90.80% 44.78%

We validated BigQA by implementing a case study

in the context of a pharmaceutical company. We used

two real-world datasets, one consisting of Wikipedia

articles and another storing frequently asked ques-

tions about COVID-19. We issued different queries,

demonstrating the potential of BigQA in the develop-

ment of real-world applications. We implemented the

BM25 algorithm as Document Retriever since it pro-

vided the best results according to our evaluation. In

this evaluation, we conducted 27 experiments over

three open-domain datasets to compare the BM25,

TF-IDF, and Dense Passage Retriever algorithms. All

code is available on GitHub

We are currently conducting experiments to assess

the performance of different algorithms to implement

the Document Reader. We also plan to investigate the

Insights and Security layers in terms of technologies

and algorithms available. Another future work is to

analyze new case studies that instantiate the proposed

architecture to different real-world applications.

ACKNOWLEDGEMENTS

We thank Sinch, S

ao Paulo Research Founda-

tion (FAPESP), Brazilian Federal Research Agency

CNPq, and Coordenac¸

ao de Aperfeic¸oamento de Pes-

soal de N

ıvel Superior, Brazil (CAPES) [Finance

Code 001] for support this work. C. D. Aguiar has

been supported by the grant #2018/22277-8, FAPESP.

REFERENCES

Ataei, P. and Litchﬁeld, A. (2021). Neomycelia: A soft-

ware reference architecture for big data systems. In

Proceedings of 28th Asia-Paciﬁc Software Engineer-

ing Conference, pages 452–462.

Athira, P., Sreeja, M., and Reghuraj, P. (2013). Architec-

ture of an ontology-based domain-speciﬁc natural lan-

guage question answering system. International Jour-

nal of Web & Semantic Technology, 4(4): article num-

ber 31.

BigQA experiments and implementation codes.

https://github.com/leomaurodesenv/big-qa-architecture

Bartolo, M., Roberts, A., Welbl, J., Riedel, S., and Stene-

torp, P. (2020). Beat the AI: Investigating adversarial

human annotation for reading comprehension. Trans-

actions of the Association for Computational Linguis-

tics, 8:662–678.

Cassavia, N. and Masciari, E. (2021). Sigma: a scalable

high performance big data architecture. In Proceed-

ings of 29th Euromicro International Conference on

Parallel, Distributed and Network-Based Processing,

pages 236–239.

Derras, M., Deruelle, L., Michel Douin, J., L

evy, N.,

Losavio, F., Pollet, Y., and Reiner, V. (2018). Ref-

erence architecture design: A practical approach. In

Proceedings of 13th International Conference on Soft-

ware Technologies, pages 633–640.

Galster, M. and Avgeriou, P. (2011). Empirically-grounded

reference architectures: A proposal. In Proceedings

of the Joint ACM SIGSOFT Conference - QoSA and

Architecting Critical Systems, page 153–158.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,

Bang, Y., Madotto, A., and Fung, P. (2022). Survey

of hallucination in natural language generation. ACM

Computing Surveys.

Karpukhin, V., O

guz, B., Min, S., Lewis, P., Wu, L.,

Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense

passage retrieval for open-domain question answer-

ing. arXiv preprint arXiv:2004.04906.

Klein, J., Buglak, R., Blockow, D., Wuttke, T., and Cooper,

B. (2016). A reference architecture for big data sys-

tems in the national security domain. In Proceed-

ings of IEEE/ACM 2nd International Workshop on Big

Data Software Engineering, pages 51–57.

Kononenko, O., Baysal, O., Holmes, R., and Godfrey,

M. W. (2014). Mining modern repositories with elas-

ticsearch. In Proceedings of the 11th Working Confer-

ence on Mining Software Repositories, page 328–331.

Laney, D. et al. (2001). 3D data management: Controlling

data volume, velocity and variety. META group re-

search note, 6(70):1.

Lepenioti, K., Bousdekis, A., Apostolou, D., and Mentzas,

G. (2020). Prescriptive analytics: Literature review

and research challenges. International Journal of In-

formation Management, 50:57–70.

Li, Q., Xu, Z., Wei, H., Yu, C., and Wang, S. (2020).

General big data architecture and methodology: An

analysis focused framework. In Proceedings of On

the Move to Meaningful Internet Systems: OTM 2019

Workshops, pages 33–43.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Lydia, E. L., Satyanarayan, S., Kumar, K. V., and Ramya,

D. (2020). Indexing documents with reliable indexing

techniques using Apache Lucene in Hadoop. Interna-

tional Journal of Intelligent Enterprise, 7(1-3):203–

214.

Misra, S., Kumar, V., Kumar, U., Fantazy, K., and

Akhter, M. (2012). Agile software development prac-

tices: evolution, principles, and criticisms. Interna-

tional Journal of Quality & Reliability Management,

29(9):972–980.

oller, T., Reina, A., Jayakumar, R., and Pietsch, M.

(2020). COVID-QA: A question answering dataset

for covid-19. In Proceedings of the 1st Workshop on

NLP for COVID-19 at Association for Computational

Linguistics, page 1.

uller, M., Vorraber, W., and Slany, W. (2019). Open prin-

ciples in new business models for information sys-

tems. Journal of Open Innovation: Technology, Mar-

ket, and Complexity, 5(6):1–13.

Nielsen, R. D., Masanz, J., Ogren, P., Ward, W., Martin,

J. H., Savova, G., and Palmer, M. (2010). An archi-

tecture for complex clinical question answering. In

Proceedings of the 1st ACM International Health In-

formatics Symposium, page 395–399.

Novo-Loures, M., Pavon, R., Laza, R., Ruano-Ordas, D.,

and Mendez, J. R. (2020). Using natural language

preprocessing architecture (NLPA) for big data text

sources. Hindawi Scientiﬁc Programming, 2020:1–

13, article id 2390941.

Petroni, F., Rockt

aschel, T., Riedel, S., Lewis, P., Bakhtin,

A., Wu, Y., and Miller, A. (2019). Language models

as knowledge bases? In Proceedings of the Confer-

ence on Empirical Methods in Natural Language Pro-

cessing and the 9th International Joint Conference on

Natural Language Processing, pages 2463–2473.

Poniszewska-Mara

nda, A. and Czechowska, E. (2021). Ku-

bernetes cluster for automating software production

environment. Sensors Jornal, 21(5): article number

1910.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P.

(2016). SQuAD: 100,000+ Questions for Ma-

chine Comprehension of Text. arXiv e-prints, page

arXiv:1606.05250.

Robertson, S. E. and Jones, K. S. (1976). Relevance weight-

ing of search terms. Journal of the American Society

for Information Science, 27(3):129–146.

Romualdo, A., Real, L., and Caseli, H. (2021). Measur-

ing brazilian portuguese product titles similarity using

embeddings. In Proceedings of XIII Brazilian Sym-

posium on Information Technology and Human Lan-

guage, pages 121–132. SBC.

Saha, A., Aralikatte, R., Khapra, M. M., and Sankara-

narayanan, K. (2018). Duorc: Towards complex lan-

guage understanding with paraphrased reading com-

prehension. CoRR, abs/1804.07927.

Sammut, C. and Webb, G. I., editors (2010). TF-IDF, pages

986–987. Springer Science & Business Media.

Schaffer, N., Weking, J., and St

ahler, O. (2020). Require-

ments and design principles for business model tools.

In Proceedings of Americas Conference on Informa-

tion Systems Proceedings, pages 1–10.

Shvachko, K., Kuang, H., Radia, S., and Chansler, R.

(2010). The Hadoop distributed ﬁle system. In Pro-

ceedings of IEEE 26th symposium on mass storage

systems and technologies, pages 1–10.

Sucunuta, M. E. and Riofrio, G. E. (2010). Architecture of

a question-answering system for a speciﬁc repository

of documents. In Proceedings of 2nd International

Conference on Software Technology and Engineering,

pages V2–12–V2–16.

Yousﬁ, S., Rhanoui, M., and Chiadmi, D. (2021). To-

wards a generic multimodal architecture for batch

and streaming big data integration. arXiv preprint

arXiv:2108.04343.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,

and Stoica, I. (2010). Spark: Cluster computing with

working sets. In Proceedings of 2nd USENIX Work-

shop on Hot Topics in Cloud Computing, pages 1–7.

Zhang, G., Jiang, T., Bie, R., Liu, X., Wang, Z., and Rao, J.

(2013). The architecture of ProMe instant question an-

swering system. In Proceedings of International Con-

ference on Cyber-Enabled Distributed Computing and

Knowledge Discovery, pages 237–242.

Zhu, J. Y., Tang, B., and Li, V. O. (2019). A ﬁve-layer ar-

chitecture for big data processing and analytics. Inter-

national Journal of Big Data Intelligence, 6(1):38–49.

Design Principles and a Software Reference Architecture for Big Data Question Answering Systems