FAQ-Based Question Answering Systems with Query-Question and

Query-Answer Similarity

Vijay Kumari

, Miloni Mittal

, Yashvardhan Sharma

and Lavika Goel

Birla Institute of Technology and Science, Pilani, Rajasthan, India

Malaviya National Institute of Technology, Jaipur, Rajasthan, India

Keywords:

FAQ Retrieval, Question-Answering System, Information Retrieval.

Abstract:

A Frequently Asked Question (FAQ) Answering System maximizes knowledge access by enabling users to

request a natural language question using the FAQ database. Retrieving FAQs is challenging due to the lin-

guistic difference between a query and a question-answer pair. This work explores methods to improve on

this linguistic gap in FAQ retrieval of the Question Answering System. The task is to retrieve frequently

asked question-answer pairs (FAQ pairs) from the database that are related to the user’s query, thus providing

answers to the user. We do so by leveraging natural language processing models like BERT and SBERT and

ranking functions like BM25. The best results are obtained when BERT is trained in a triplet fashion (question,

paraphrase, non-matching question) and combined with the BM25 model, which compares query with FAQ

question answer concatenation.

1 INTRODUCTION

Building a Question Answering (QA) System is an

important problem statement in the ﬁeld of Natural

Language Processing (NLP). It involves extracting

the most relevant information from abundant informa-

tion, which may be classiﬁed into relevant, useful, or

irrelevant. Due to the information overload with re-

spect to quantity and categories, searching for the rel-

evant answer to the posed query is of utmost impor-

tance. A question-answering system can be built for

two domains: 1) Open Domain: There is no bound-

ary on the content category for extracting the answer.

Example: Google search engine, Yahoo search en-

gine. 2) Closed Domain: There is a certain bound-

ary on what queries this type of system can answer.

These systems restrict a particular category of ques-

tions and sometimes even answers by referring to

a document—for example, QA services on various

business websites and solution providers for school

textbooks. The speed of deriving the answers to the

user query plays a major role in such systems, which

demands a need to ﬁnd methods to extract relevant

answers to the queries in the fastest manner possible.

One way to do this is to look into the database of al-

ready asked and answered questions, or FAQs, and

provide the most relevant answers from that list.

The Frequently Asked Questions (FAQ) database

is made up of manually generated question-answer

pairs. Large-scale service providers tend to use FAQ

lists to present easily accessible information to their

clients. These collections focused on the closed do-

main of interest of QA system. Users can browse FAQ

databases on their own, using a faster and more efﬁ-

cient FAQ retrieval model. A FAQ retrieval system

offers a natural language interface for making queries

about the FAQ list. The model generates a series of

question-answer pairs ranked by their relevancy ac-

cording to a user’s query. A method for answering

frequently asked questions (FAQs) gives users quick

access to internal FAQ databases, which improves ser-

vice quality and efﬁciency (Karan and

Snajder, 2018).

Many questions have usually been answered in the

FAQ section of a question-answering system. How-

ever, the vastness of such a database might make it

difﬁcult for a user to search for a particular relevant

query. Thus, answers to a user query posed to a

closed domain QAS can be derived using the exist-

ing FAQ database. By checking if a similar question

exists in the database, a faster and more efﬁcient QAS

can be built that does not need to access the original

database for every query. In this work, we retrieve

the FAQ pairs from the database that are pertinent to

the user’s query, thus providing answers to the user.

Initially, this problem statement was solved only by

referring to the similarity between the FAQ question

Kumari, V., Mittal, M., Sharma, Y. and Goel, L.

FAQ-Based Question Answering Systems with Query-Question and Query-Answer Similarity.

DOI: 10.5220/0012454800003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 1189-1196

ISBN: 978-989-758-680-4; ISSN: 2184-433X

1189

(Q henceforth) and user query (q henceforth). But in

recent times, researchers are leaning towards more ro-

bust modeling involving FAQ answers (referred to as

A henceforth) and similarity comparison (Mass et al.,

2020). This method tends to give better results as the

other can compensate for lexical gaps in either com-

parison (q-Q or q-A).

Labeled data is essential for training a model to

predict the relationship between user queries and FAQ

questions. A dataset of this kind is often created man-

ually or collected through query logs (Mass et al.,

2020). The FAQIR dataset speciﬁcally contains pairs

of questions and answers without any labeling to

query, but a relevance score is given to each query.

The proposed system’s goal is to train the differ-

ent models for the task of retrieving frequently asked

questions (FAQs) and assessing their performance to

provide the most accurate model. This comprehensive

approach aims to enhance the effectiveness and preci-

sion of the system in handling user queries and re-

trieving relevant FAQs. The signiﬁcant contributions

of the paper are as follows:

1. We developed a FAQ retrieval model and

experimented with various ranking techniques

[weighted measures, re-ranking after initial re-

trieval] to rank top FAQ pairs. Explored and

implemented various techniques for generating

embeddings in order to ﬁnd query-question and

query-answer similarity.

2. We trained and evaluated the outcomes of various

models in the context of the FAQ retrieval task to

provide the most accurate model.

3. Built a website using HTML, CSS, JavaScript,

and Flask and integrated our ﬁnal model [BM25

q(Q+A) + BERT qQ training] with it. Created an

end-to-end website that gives top answers based

on FAQ from the FAQIR dataset and 5 FAQ pairs

that are similar to that category.

The paper structure comprises a review of FAQ

Question Answering System-related work in Section

2, an overview of task techniques in Section 3, exper-

iments and results in Section 4, and a conclusion in

Section 5.

2 RELATED WORK

FAQ models simply need to extract the FAQ pairs in-

stead of the complete context-speciﬁc answer. These

FAQ pairs are made up of a question and an answer.

The correspondence between the query and the FAQ

pairs is determined by comparing the query to either

the questions, answers, or the concatenation of both.

The appropriate class label must be present for su-

pervised learning to rank the FAQ pairs. Recent ap-

proaches shown in Table 1 utilize both supervised and

unsupervised techniques for the FAQ retrieval task.

Unsupervised methods can act more effectively as

they require no labeling of the data. (Sakata et al.,

2019) proposes a supervised technique for FAQ re-

trieval. It leverages the TSUBAKI model for retriev-

ing the q-Q similarity and BERT for q-A matching

(Sakata et al., 2019). A novel technique that gener-

ates question paraphrases compensates for the lack of

a query-question matching training data (Mass et al.,

2020). For the re-ranking, it uses elastic search,

passage re-ranking, and ﬁnally ranks on the basis

of query-answer and query-question similarity. This

model uses BERT for training query-question and

query-answer similarity.

(Piwowarski et al., 2019) uses an attention mech-

anism for FAQ Retrieval. It compares various aggre-

gation methods to effectively represent query, ques-

tion, and answer information. It is observed that at-

tention mechanisms are consistently the most effec-

tive way to aggregate the inputs for ranking. Atten-

tive matching in FAQ retrieval eliminates the need

for feature engineering to effectively combine query-

answer and query-question embeddings. (Jeon et al.,

2005) assumed that if answers demonstrate seman-

tic resemblance, their associated questions will also

possess a comparable level of similarity. The author

employed different similarity metrics, including co-

sine similarity with TF-IDF weights, LM-Score, and a

symmetric version of the LM-Score. LM-Score mea-

sures semantic similarity by converting answers into

queries and using query likelihood language model-

ing for retrieval. However, its resulting scores are not

symmetric. The measure gauges the semantic simi-

larity between answers, with higher scores indicating

stronger semantic connections. To address the non-

symmetry issue, a modiﬁcation is introduced known

as Symmetric LM-Score which employs a harmonic

mean of ranks for a balanced assessment in question-

answering systems. It uses the rank method instead of

scores, where the similarity between answers A and B

is determined by the reverse harmonic mean based on

their respective ranks.

3 DATASETS USED

3.1 FAQIR Dataset

We used the FAQIR (Karan and

Snajder, 2016)

dataset for evaluation, which is derived from the

“maintenance & repair” domain of the Yahoo! An-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1190

Table 1: Existing State-of-the-art FAQ retrieval models.

Method Dataset Additional experiments and ﬁndings

Two BERT models are trained for query-answer

and query-question matching using unsuper-

vised FAQ retrieval and then augmented with

the BM25 similarity measure for effective re-

ranking according to user queries (Mass et al.,

2020).

FAQIR The unsupervised model competes with or out-

performs existing supervised approaches. How-

ever, entire dataset ﬁndings are omitted; alterna-

tive models built for query-answer and query-

question similarity give opportunities for cap-

turing semantic details.

The research compared three aggregation

techniques—Deep Matching Network (DMN),

Multihop Attention Network (MAN), and Sym-

metrical Bilateral Multi-Perspective Match-

ing—and ranked them using cosine similarity

(Gupta and Carvalho, 2019).”

SemEval-CQA

task3, Tax-

Domain QA

Using an attention technique during the aggre-

gation step improves performance.

The unsupervised technique TSUBAKI anal-

yses query-question similarity, BERT tests

query-answer similarity, and the suggested

method picks the top 10 BERT question-answer

pairings, followed by questions with the high-

est TSUBAKI score based on the OKAPI BM25

similarity measure (Sakata et al., 2019).

LocalgovFAQSet-

in Japanese

language,

StackExchange

Retrieval can be combined with the query-

question and query-answer matching score also

The study considers that the semantic similar-

ity among answers indicates similar queries and

employs three metrics for symmetry: Cosine +

TFIDF for symmetry, LM-Score for query con-

version with non-symmetric probability, and a

symmetric variation of LM-Score(Jeon et al.,

2005).

5200 question-

answer pairs

from NHN

Corp.

The approach recovers comparable question

pairs through answer matching across varied

collections, utilizing provided similarity met-

rics for clustering, with potential applications

in automating FAQ development and improv-

ing question and answer retrieval system per-

formance.

swers community website. The dataset contains 4313

FAQ pairs and 1233 queries with corresponding man-

ually annotated relevance judgments. The judgments

are described as 1- relevant, 2- useful, 3- useless, and

4- irrelevant. Each query has at least one FAQ pair

annotated as “relevant”. However, it is possible for a

FAQ-pair to be irrelevant for all queries. We utilized

the FAQIR dataset for the FAQ task as it also has an-

swers to the questions.

3.2 Quora Question Pairs (QQP)

Each logically distinct question is represented by

a single question page in the dataset, enhancing

knowledge-sharing efﬁciency. For instance, queries

such as ”What are the symptoms of inﬂuenza?” and

”How can I identify if I have the ﬂu?” should not be

present as separate entities on Quora, as they share

identical intent. The purpose of the data set is to

facilitate a study on whether question text pairings

match semantically identical queries. The dataset fa-

cilitates the development and evaluation of models

of semantic equivalence using authentic Quora data,

which holds 363,871 question pairs. Each line in-

cludes a binary value (0/1) indicating if the line con-

tains a duplicate pair, the complete text of each ques-

tion, and the IDs for each question in the pair (Chen

et al., 2017). The reason for not utilizing this dataset

is that it doesn’t contain data on the answer aspect of

the FAQ; it only contains question pairs.

3.3 Other Datasets

There are other datasets, such as StackFAQ and

COUGH Dataset (Zhang et al., 2020), which provide

FAQ questions and answers along with the queries.

StackFAQ holds some ambiguity with respect to what

is to be treated as a FAQ pair and what is to be treated

as a query. The COUGH dataset is a multilingual

dataset and can be explored for the multilingual task.

4 METHOLOGY

This section will elaborate on the model adopted for

the FAQ retrieval system, followed by the details re-

garding the web interface.

The FAQ section of a QA system typically con-

tains answers to many commonly asked questions.

An FAQ comprises sets of questions and their corre-

sponding answers. The FAQ retrieval task entails the

process of ranking question-answer pairs based on a

provided user query. The developed model provides

responses to user queries within a closed domain

question-answering system (QAS) by leveraging the

information stored in the pre-existing FAQ database,

FAQ-Based Question Answering Systems with Query-Question and Query-Answer Similarity

1191

Figure 1: Flowchart for the Developed Question-Answering

Model.

achieved through the process of verifying whether a

similar question is present within the database. Upon

receiving a user query, the model ﬁrst utilizes an SVM

(support vector machine) classiﬁer(Cortes and Vap-

nik, 1995)to determine the question’s context, check-

ing if it aligns with the domain. If the query re-

lates to the domain, the model conducts a search

in the FAQ database. Conversely, if the query is

not domain-related, the model provides relevant links

from other systems based on the question’s context.

The question-answering model, as depicted in Fig-

ure 1, initially conducts a search for the user’s query

within the FAQ database. In the event that the query

is not located in the database, the model proceeds to

derive the answer from the context (paragraph). Fol-

lowing the extraction of the answer from the con-

text, the question and answer pairs are subsequently

stored within the FAQ database. We employed the

transformer-based Bidirectional Encoder Representa-

tions from Transformers (BERT) model (Devlin et al.,

2018) to extract answers from the context. We con-

ducted experiments with various FAQ retrieval tech-

niques to optimize the provision of answers to user

queries. The completed model suggests the most per-

tinent list of FAQ pairs to the user, along with the re-

sponse to their query. The details pertaining to the

techniques employed and the training of FAQ models

are presented as follows:

4.1 Data Preprocessing

Data Preprocessing is a very important step to ﬁne-

tune the dataset. The major sub-tasks we performed

during data preprocessing include:

1. Made all the FAQ pairs and queries lowercase

2. Removing punctuations

3. Removing stopwords

4. Removing question numbers

The Sentence BERT (SBERT) and DistilBERT mod-

els used pre-processing steps 1, 2, and 4. The BM25

models used pre-processing steps 1, 2, 3 and 4. The

BERT models used pre-processing steps 1 and 2.

4.1.1 Building the Training Data for the

Query-Answer (q-A) Model

The original FAQIR dataset is composed of question-

answer pairs sourced from the FAQ database. Addi-

tionally, it incorporates queries, each accompanied by

a list of questions from the FAQ database that corre-

sponds to the query. The ﬁnal q-A model should be

able to give a similarity score of whether the given an-

swer matches the query or not. To build it, we need to

ﬁne-tune the model with a dataset that contains:

• (Q, A) the matching FAQ pair

• (Q, A’) an FAQ question with a non-matching an-

swer

This is done by randomly selecting A’ for every ques-

tion. The (Q, A’) pairs were labeled as 0, whereas the

(Q, A) pairs were labeled as 1, as in Figure 2

Figure 2: A sample of the dataset used for training the q-A

model.

4.1.2 Building the Training Dataset for the

Query-Question( q-Q) Model

The model built with this dataset should be able to

give a similarity score of whether an FAQ question

matches the query. The training dataset was derived

from the model proposed in (Mass et al., 2020). La-

bel 1 was assigned to the corresponding question-

paraphrase pairs. The label 0 was assigned to the

question pairs that did not match. The second half of

the dataset was built by random selection of a ques-

tion from the FAQ database.

4.2 FAQ Models

4.2.1 BERTScore

BERTScore was introduced as a metric for evaluating

language generation (Zhang et al., 2019). The central

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1192

idea is that, given a candidate sentence and a reference

sentence, contextual embeddings are used to represent

tokens (in the sentences) and compute cosine similar-

ity. To compute the cosine similarity between tokens,

greedy matching is used for matching of every token

in a sentence to the most similar token in the other

sentence, thus maximizing the aggregated matching

score. We used BERTScore to compute the similarity

score of a given input query with the FAQ pairs pro-

vided in the dataset for FAQ retrieval. The intuition

here is that a relevant FAQ pair to a query will con-

tain words similar to those in the query, thus yielding

a higher similarity score. Using this method, we can

retrieve FAQ pairs in a completely unsupervised and

scalable manner.

We compute similarity scores as shown in equa-

tion 1 for each input query q with every FAQ pair p

provided in the dataset. Here, p represents the contex-

tual embedding of the concatenation of both the ques-

tion and answer for the given FAQ pair, and q rep-

resents the contextual embedding of the input query

generated by the BERT model.

similarity

score

= Cosine

contextual embeddings

, p

)

(1)

Using the bert-score library, recall, precision, and F1

scores can be obtained for each such query and FAQ

pair. Using these candidates, we then use the top-

k FAQ pairs based on the F1 score as our retrieved

FAQ pairs and compute our evaluation metrics (P@5,

MAP, MRR).

4.2.2 SBERT

Sentence-BERT (Reimers and Gurevych, 2019) is an

adapted version of the pre-trained BERT model. It

employs siamese and triplet network structures to

generate semantically meaningful sentence embed-

dings, allowing for comparison using cosine similar-

ity. We trained SBERT for query-answer (qA) com-

parison in two ways: 1) Taking 1:1 ratios in A vs A’

ratio for the dataset 2) Taking 1:5 ratios in A vs A’

ratio for the dataset. We tried two variants of SBERT

for query-question (qQ) comparison. SBERT encod-

ing is directly used to obtain the similarity score for

query and FAQ questions. Fine-tuned SBERT is built

by training the SBERT model using the dataset de-

scribed in the previous section. For the query-answer

(qA) comparison, a bert-base uncased model is used.

4.2.3 DistilBERT

DistilBERT is a rapid, cost-effective, and lightweight

transformer-based model derived from the BERT ar-

chitecture. The model is obtained using knowledge

distillation during the pre-training stage to decrease

its size. The model has fewer parameters than the

original BERT model but preserves over 95 % of

BERT’s performance. (Sanh et al., 2019).

For the query-answer (qA) model, distilbert base-

uncased is used, and for the query-question (q-Q)

comparison model, distillery-bert-uncased is used to

obtain the sentence embeddings.

4.2.4 BM25 qQ

The corpus is built using the pre-processing methods

applied to the FAQ questions. Then, the BM25 model

is applied using the rank-bm25 library. The top 100

results are retrieved, and the performance metrics are

calculated for them (Robertson et al., 2009).

4.2.5 BM25 q(Q+A)

Each corresponding FAQ question and answer is con-

catenated and is represented as Q+A. The corpus is

built using the pre-processing on the FAQ Q+A. Then,

the BM25 model is applied using the rank-bm25 li-

brary. The top 100 results are retrieved, and the per-

formance metrics are calculated for them.

4.2.6 BM25 q(Q+A) + BERT qA

The BERT model is trained on triplets (question,

corresponding answer, non-corresponding answer)

to understand the intricacies of matching and non-

matching answers. The learning rate is 2e-5, and the

number of epochs is 3.

Figure 3: Training example of BERT qA model.

The top 100 FAQ pairs are picked using BM25

Q+A. The encoding of the answers of these 100 FAQ

pairs is found and compared with the query encoding

using cosine similarity. The FAQ pairs are re-ranked

based on these cosine similarity scores. The relevance

of the retrieved FAQ pairs is cross-checked with the

relevance score in the dataset, and the performance

metrics are calculated accordingly. An example for

training BERT qA is given in Figure 3.

FAQ-Based Question Answering Systems with Query-Question and Query-Answer Similarity

1193

Figure 4: Training example of BERT qQ model.

4.2.7 BM25 q(Q+A) + BERT qQ

The BERT (Devlin et al., 2018) model is trained

on triplets (question, paraphrase, non-matching ques-

tion) to understand the intricacies of matching and

non-matching questions. The learning rate is 2e-5,

and the number of epochs is 3. Top 100 FAQ pairs

are picked using BM25 Q+A. The encoding of the

questions of these 100 FAQ pairs is found and com-

pared with the query encoding using cosine similarity.

The FAQ pairs are re-ranked based on these cosine

similarity scores. The relevance of the retrieved FAQ

pairs is cross-checked with the relevance score in the

dataset and the performance metrics are calculated ac-

cordingly. An example for training BERT qQ is given

in Figure 4.

5 EXPERIMENTS AND RESULTS

The experiments have been executed employing

methods for FAQ retrieval, and subsequent compar-

isons of results have been conducted. Various models

and ranking techniques were explored. The following

performance metrics have been used for the retrieval:

1. Mean Precision at 5 (P@5) is the measure of a

number of relevant documents within the ﬁrst ﬁve

retrieved documents. It helps to determine how

many relevant documents are ranked in the top 5.

The more documents in the top 5, the better the

information retrieval system is.

2. Mean Average Precision (MAP) is a measure of

whether all of the relevant documents get ranked

highly or not. It is needed because a relevant

document being retrieved but present lower in the

list would not be very useful for a user entering

his/her query.

3. Mean Reciprocal Rank (MRR) is a measure of

the position at which the ﬁrst relevant document

occurs within the retrieved documents.

5.1 Evaluation on FAQIR Dataset

We compared the results with the original full dataset

and the ﬁltered dataset. In our studies, training the

model on the ﬁltered dataset resulted in greater ac-

curacy than training on the complete dataset. Initial

retrieval for the ﬁltered FAQIR dataset is performed

on a subset of 789 FAQ pairings pertinent to at least

one user query (Mass et al., 2020). The metrics ob-

tained for different methods on the ﬁltered dataset

have been displayed in Tables 2. Using the BertScore,

we retrieved the FAQ pairs completely in an unsu-

pervised way with comparable accuracy. The model

with BM25 q(Q+A)+BERT qQ gives better results

than others. While the ﬁltered dataset led to higher

accuracy, it is essential to acknowledge potential lim-

itations, such as reduced diversity and the risk of in-

troducing biases. Utilizing the complete dataset, we

intend to capture the full range of variances and com-

plexity included in the data, allowing for a more thor-

ough study. The metrics obtained for different meth-

ods on the full dataset have been displayed in Tables

3. An example of the top 5 retrieved answers with

relevance scores is shown in Figure 6.

We observed that the BM25 q(Q+A) + BERT qQ

model gives the best results among all other models.

This is because the BM25 model focuses on lexicons

in the corpus, and the BERT model focuses on se-

mantic meaning. Hence, they complement each other.

BM25 q(Q+A) works better than BM25 qQ because

concatenating the answer with the question provides

more scope for matching lexicons. Words present

in the query may be absent in the FAQ question but

present in the FAQ answer. The BM25 q(Q+A) +

BERT qA model does not work well in comparison

to the BM25 q(Q+A) + BERT qQ model. This is be-

cause the semantics of a query and answer are usually

very different. And hence, BERT qA does not per-

form that well. It is also observed that on adding the

BERT qA model to the BM25 q(Q+A) model, the per-

formance worsens.

5.2 Evaluation on QQP Dataset

We rigorously evaluated the top-performing models

on the QQP dataset, adjusting both the number of

examples and the threshold for determining question

similarity based on the cosine similarity score. Our

evaluation methodology involved varying the number

of examples and introducing thresholds (random val-

ues with a higher similarity score) to discern ques-

tion similarity. For instance, with a threshold set

at 0.8, we selectively retrieved examples with scores

surpassing this threshold, calculating accuracy based

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1194

Table 2: Performance Comparison on FAQIR Dataset (on Filtered Queries).

Model Ranking Method P@5 MAP MRR

BM25 q(Q+A) (top qA, sort qQ) 0.48 0.44 0.74

BM25 q(Q + A) + BERT qQ (All pairs ) 0.423 0.515 0.681

BM25 q(Q + A) + BERT qQ (Only {1, 2} pairs ) 0.499 0.751 0.788

BertScore (all pairs) 0.215 0.552 0.554

Table 3: Performance Comparison on FAQIR Dataset (on the full dataset).

Model Ranking Method P@5 MAP MRR

1:1 qA training SBERT (top qA, sort qQ) 0.14 0.32 0.33

1:5 qA training SBERT (top qA, sort qQ) 0.19 0.35 0.37

1:5 qA + qQ training DistilBERT (0.2*qA score+0.8*qQ score) 0.18 0.30 0.40

BM25 qQ training (top 100 FAQ Q) 0.30 0.38 0.57

BM25 q(Q+A) training (top 100 FAQ Q+A) 0.34 0.39 0.60

BM25 q(Q+A) + BERT qA training (BM25 top 100 + rerank qA) 0.27 0.32 0.52

BM25 q(Q+A) + BERT qQ training BM25 top 100 + rerank qQ) 0.42 0.51 0.69

on this subset of retrieved examples. This meticu-

lous approach allowed us to comprehensively assess

the models’ performance under different conditions,

providing valuable insights into their effectiveness in

capturing question similarity nuances. The accuracy

metrics for the experiments are shown in Table 4.

Table 4: Performance comparison on QQP dataset.

Threshold 0.8 0.85 0.9 0.95 Example

count

BERT 61.6 62.8 63.3 62.4 1000

SBERT 69.3 70.02 71.6 69.5

BERT 63.28 64.12 63.98 62.62 10,000

SBERT 70.33 72.82 72.61 69.60

BERT 65.38 67.12 69.98 65.62 100,000

SBERT 73.33 74.82 75.61 74.60

We noticed that SBERT consistently outperforms

BERT over all the different numbers of training sam-

ples. The accuracy peak occurs at a threshold value of

0.9 for 1000 examples, 0.85 for 10,000 examples, and

0.9 for 100,000 examples. The maximum accuracy

achieved is 75.61 for SBERT on 100,000 examples of

the QQP dataset.

5.3 Web Interface

The web interface is built using HTML, CSS,

JavaScript, and Flask. The BM25 q(Q+A) + BERT

qQ model is implemented at the front end. The front

end consists of a home page (Figure 5 ), where the

user enters the query to get the responses. The result

has two aspects: an answer and a ”People also asked”

section, which lists ﬁve pairs of FAQs that are similar

to the query. Users can directly access these recom-

mended question-answer pairs according to the query.

Figure 5: Web Interface.

6 CONCLUSION

FAQ retrieval is a common natural language process-

ing task where the goal is to match a user’s query to

a relevant, frequently asked question (FAQ) from a

dataset. We trained and evaluated different unsuper-

vised techniques for the task of FAQ retrieval using

the FAQIR dataset. Starting from the vanilla SBERT

model with an MRR of 0.33, we present the ﬁnal

BERT qQ + BM25 q(Q+A) model with P@5, MAP,

and MRR of 0.69 on the full FAQIR dataset. We

observe that SBERT consistently outperforms BERT

over all the different numbers of training samples on

the QQP dataset by matching the query with FAQ

questions. Our ﬁnal model is a fusion of triplet-

trained BERT and BM25 ranking functions. The

current model limitation involves reliance on con-

textual or external sources for unanswerable ques-

tions, prompting a need for innovative solutions. Fu-

ture work may extend model evaluations to diverse

datasets to comprehensively assess generalizability.

FAQ-Based Question Answering Systems with Query-Question and Query-Answer Similarity

1195

Figure 6: List of some generated answers by the FAQ model.

ACKNOWLEDGMENT

The authors extend sincere gratitude to the

Department of Science and Technology

(DST/ICPS/CLUSTER/DataScience/2018/Proposal

16:(T-856)) for ﬁnancial support at CSIS, BITS

Pilani, India.

REFERENCES

Chen, Z., Zhang, H., Zhang, X., and Zhao, L. (2017). Quora

question pairs.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine learning, 20:273–297.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Gupta, S. and Carvalho, V. R. (2019). Faq retrieval using

attentive matching. In Proceedings of the 42nd Inter-

national ACM SIGIR Conference on Research and De-

velopment in Information Retrieval, pages 929–932.

Jeon, J., Croft, W. B., and Lee, J. H. (2005). Finding se-

mantically similar questions based on their answers.

In Proceedings of the 28th annual international ACM

SIGIR conference on Research and development in in-

formation retrieval, pages 617–618.

Karan, M. and

Snajder, J. (2016). Faqir–a frequently asked

questions retrieval test collection. In Text, Speech, and

Dialogue: 19th International Conference, TSD 2016,

Brno, Czech Republic, September 12-16, 2016, Pro-

ceedings 19, pages 74–81. Springer.

Karan, M. and

Snajder, J. (2018). Paraphrase-focused

learning to rank for domain-speciﬁc frequently asked

questions retrieval. Expert Systems with Applications,

91:418–433.

Mass, Y., Carmeli, B., Roitman, H., and Konopnicki, D.

(2020). Unsupervised faq retrieval with question gen-

eration and bert. In Proceedings of the 58th Annual

Meeting of the Association for Computational Lin-

guistics, pages 807–812.

Piwowarski, B., Chevalier, M., and Gaussier,

E. (2019). Si-

gir’19: Proceedings of the 42nd international acm si-

gir conference on research and development in infor-

mation retrieval. In 42nd International ACM SIGIR

Conference on Research and Development in Infor-

mation Retrieval (SIGIR 2019). ACM: Association for

Computing Machinery.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Robertson, S., Zaragoza, H., et al. (2009). The probabilis-

tic relevance framework: Bm25 and beyond. Founda-

tions and Trends® in Information Retrieval, 3(4):333–

389.

Sakata, W., Shibata, T., Tanaka, R., and Kurohashi, S.

(2019). Faq retrieval using query-question similar-

ity and bert-based query-answer relevance. In Pro-

ceedings of the 42nd International ACM SIGIR Con-

ference on Research and Development in Information

Retrieval, pages 1113–1116.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter. arXiv preprint arXiv:1910.01108.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and

Artzi, Y. (2019). Bertscore: Evaluating text genera-

tion with bert. arXiv preprint arXiv:1904.09675.

Zhang, X. F., Sun, H., Yue, X., Lin, S., and Sun, H. (2020).

Cough: A challenge dataset and models for covid-19

faq retrieval. arXiv preprint arXiv:2010.12800.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1196