MRNN: A Multi-Resolution Neural Network with Duplex Attention for

Deep Ad-Hoc Retrieval

Tolgahan Cakaloglu

1,2

, Xiaowei Xu

and Roshith Raghavan

Walmart Labs, Dallas, Texas, U.S.A.

University of Arkansas, Little Rock, Arkansas, U.S.A.

CVS Health, Boston, Massachussets, U.S.A.

Keywords:

Deep Learning, Ad-Hoc Retrieval, Learning Representations, Ranking, Text Matching.

Abstract:

The primary goal of ad-hoc retrieval is to ﬁnd relevant documents satisfying the information need posted by

a natural language query. It requires a good understanding of the query and all the documents in a corpus,

which is difﬁcult because the meaning of natural language texts depends on the context, syntax, and semantics.

Recently deep neural networks have been used to rank search results in response to a query. In this paper, we

devise a multi-resolution neural network (MRNN) to leverage the whole hierarchy of representations for ad-hoc

retrieval. The proposed MRNN model is capable of deriving a representation that integrates representations

of different levels of abstraction from all the layers of the learned hierarchical representation. Moreover, a

duplex attention component is designed to reﬁne the multi-resolution representation so that an optimal context

for matching the query and document can be determined. More speciﬁcally the ﬁrst attention mechanism

determines optimal context from the learned multi-resolution representation for the query and document. The

latter attention mechanism aims to ﬁne-tune the representation so that the query and the relevant document are

closer in proximity. The empirical study shows that MRNN with the duplex attention is signiﬁcantly superior to

existing models used for ad-hoc retrieval on benchmark datasets including SQuAD, WikiQA, QUASAR, and

TrecQA.

1 INTRODUCTION

Ad-hoc retrieval (Voorhees and Harman, 2005) allows

a user to specify the information need using a natu-

ral language query, which is instrumental for many

applications including question answering and infor-

mation retrieval. Traditional approach uses simple sta-

tistical features such as term frequency and document

frequency to represent the query and the document

(Salton and McGill, 1986). The query and documents

are matched by using some similarity measure like co-

sine similarity. However, this approach is less effective

because the representation doesn’t consider the rich

context of texts.

Recently deep neural networks have been used to

rank search results in response to a query for ad-hoc

retrieval (Palangi et al., 2016) (McDonald et al., 2018).

The fundamental idea of deep learning is that a hierar-

chical representation is learned automatically, where

each layer is a representation that is a high-level ab-

straction of the representation from the previous layer.

The most abstract representation from the last layer

of the hierarchy is then used for the machine learning

task. However, an abstract representation from a single

layer or only a few layers may not be able to capture

the semantic relationship of concepts across different

levels of an ontology. A text may contain concepts

from different levels of the ontology. For example, ”a

banana is a fruit” where ”fruit” is a high level concept

comparing to ”banana”.

In this paper, we present a new Multi-Resolution

Neural Network for ad-hoc retrieval, called MRNN,

which leverages representations across all levels of

abstractions in the learned hierarchical representation.

The learned representation achieves a multi-resolution

effect that can represent concepts and their relation-

ships across all the levels. As illustrated in Figure 1 the

proposed model consists of the following components:

•

Multi-Resolution Feature Maps, which transforms

the input query and document into a multi-

resolution representation.

•

Duplex Attention, which implements two attention

mechanisms to reﬁne the multi-resolution repre-

466

Cakaloglu, T., Xu, X. and Raghavan, R.

MRNN: A Multi-Resolution Neural Network with Duplex Attention for Deep Ad-Hoc Retrieval.

DOI: 10.5220/0011896300003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 2, pages 466-477

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

sentation. The ﬁrst attention mechanism deter-

mines an optimal context based on

-Grams from

the learned multi-resolution representation for the

query and document. The latter attention mecha-

nism aims to ﬁne-tune the representation so that

the query and the relevant document are closer in

proximity.

•

Aggregation, which calculates the similarity be-

tween the pair of query and document through

aggregation of the duplex attention as the input to

the loss function.

•

Distance Metric Loss, which is a loss function used

to train the model by minimizing the loss.

Figure 1: Overall training ﬂow diagram for the proposed

Multi-Resolution Neural Network with a query

, a positive

document D

and a negative document D

−

The main contribution of the paper is as follows:

•

We ﬁrst propose a new deep learning model

MRNN that leverages representations across all

levels of abstraction from the learned hierarchical

representation for ad-hoc retrieval.

•

A duplex attention mechanism is designed to re-

ﬁne the multi-resolution so that an optimal match-

ing of the query and document can be achieved.

More speciﬁcally, the ﬁrst mechanism determines

a proper context, and the latter ﬁne-tunes the rep-

resentation by considering the interplay between

the query and document.

•

The proposed model signiﬁcantly outperforms ex-

isting models for ad-hoc retrieval on major bench-

mark datasets.

The rest of the paper is organized as follows. First,

we review recent advances in ad-hoc retrieval in Sec-

tion 2. In Section 3 we describe the details of the

proposed model. An empirical study to compare the

proposed method to the existing approaches is con-

ducted. The experiment and the result are reported in

Section 4 and Section 5 respectively. Finally, we con-

clude the paper with some future research in Section 6.

2 RELATED WORK

The last few decades has seen increased adoption of

machine learning for information retrieval (IR) tasks.

More recently, deep learning techniques (Lu and

Li, 2013)(Hu et al., 2014)(Palangi et al., 2016)(Guo

et al., 2016)(Hui et al., 2017) have successfully been

applied to IR, or words document relevance ranking

also known as ad-hoc retrieval (Voorhees and Har-

man, 2005). In the ad-hoc retrieval task, the number

of words in documents are generally greater than the

number of words in queries and some might not even

be in natural language form. This prevents aforemen-

tioned methods from other tasks that focus on pairs

of short contents making them unsuitable for the task.

Document ranking methods can be deﬁned under the

two titles: separation-oriented such as (Palangi et al.,

2016) and interaction-oriented such as (McDonald

et al., 2018). In the separation-oriented, a query and

a document representations are generated separately.

At the ﬁnal step, interactions of these documents are

getting calculated through dot-product whereby the

result indicates relevance. In the interaction-oriented

approach, speciﬁc encoding between the query and

document pairs are induced which satisﬁes the exact

matching as well as similarity matching that constitute

the most important conditions for ad-hoc retrieval.

In the area of machine reading-style question an-

swering (Rajpurkar et al., 2016) (Yang et al., 2015)

(Dhingra et al., 2017) (Wang et al., 2007), the system

needs to ﬁnd the answer in the given corresponding

document or context. The models have to combine in-

formation retrieval and machine reading. Note that we

do not benchmark the quality of the extraction phase,

therefore we do not study extracting the answer from

the retrieved document, but compare the quality of

retrieval methods, and the feasibility of learning spe-

cialized neural models for retrieval purposes. DrQA

(Chen et al., 2017a) is built on top of two components:

a Document Retriever and a Document Reader. The

Document Retriever is a TF-IDF (Salton and McGill,

1986) retrieval system built upon Wikipedia corpus.

Whereas, ConvRR (Cakaloglu and Xu, 2019) is a con-

volutional residual retrieval network that focuses on

achieving better retrieval performance by employing

a hard triplet mining. WordCnt, WgtWordCnt, PV,

PV+Cnt, and CNN+Cnt are the models derived from

the following study (Yang et al., 2015).Word Count

method counts the number of non-stopwords in the

question that also occur in the answer sentence, and

Weighted Word Count re-weights the counts by the

IDF values of the question words. PV represents the

paragraph vector (Le and Mikolov, 2014) which is the

similarity score between a question vector and doc-

ument vector. CNN+Cnt is built on top of a bigram

CNN (Yu et al., 2014) model with average pooling.

PV+Cnt and CNN+Cnt are trained using a logistic re-

gression classiﬁer. QA-LSTM (Tan et al., 2016) is a

MRNN: A Multi-Resolution Neural Network with Duplex Attention for Deep Ad-Hoc Retrieval

467

biLSTM based model where the ﬁnal representations

of question and document are taken by max or mean

pooling over all the hidden vectors. ABCNN (Yin

et al., 2016) is an attention-based convolution neu-

ral network model that employs an attention feature

matrix to inﬂuence convolutions to optimize the task.

RNN-POA (Chen et al., 2017b) is positional attention

based RNN model that incorporates the positional con-

tent of the question words into the documents' attentive

representations.

: Simple Ranker-Reader,

Reinforced Ranker-Reader (Wang et al., 2018b) are

proposed to improve the performance of the machine

reading-style question answering tasks. They utilized

the Apache Lucene-based search engine and a deep

neural network ranker that re-ranks the documents re-

trieved by the search engine incorporated by a machine

reader. The latter model is trained using reinforcement

learning. The results derived from the models show

that the neural network ranker can learn to rank the

documents based on semantic similarity with the ques-

tion. InferSent Ranker (Conneau et al., 2017)(Htut

et al., 2018) is used to produce distributed represen-

tations for question and documents and, then, the in-

put feature representation is built by concatenating

the question representation, document representations,

their difference, and their element-wise product. On

receiving that input feature representation, the sim-

ilarity score is calculated using a feed-forward neu-

ral network. Relation-Networks (RN) Ranker (San-

toro et al., 2017)(Htut et al., 2018), further, calculates

the relevance or local interactions between words in

the question and paragraph. Thus, this model is built

to interpret the relation between question-document

pairs. Tree Edit Model (Heilman and Smith, 2010)

is represented as sequences of tree transformations

involving complex reordering phenomena and demon-

strate a method for modeling pairs of semantically

related contexts. They utilize a tree kernel in a greedy

search routine to extract sequences of edits and use

them in a logistic regression model to classify them.

LSTM (Wang and Nyberg, 2015) uses a stacked bidi-

rectional Long-Short Term Memory (BiLSTM) net-

work to consecutively extract words from question

and answer documents and then outputs their rele-

vance scores. CNN (Severyn and Moschitti, 2015) is

based on a convolutional neural network architecture

for re-ranking pairs of short texts, that learns the opti-

mal representation of document pairs and a similarity

function to relate them. AP-LSTM (Tan et al., 2016)

is developed considering hybrid models that handle

the documents using both convolutional and recurrent

neural networks incorporating attention mechanism

to relate question and answer documents. AP-LSTM,

AP-CNN (dos Santos et al., 2016) are based on two-

way attention mechanisms for discriminative model

(CNN, RNN, LSTM) training. AP allows the pool-

ing layer to be aware of the input pair, in a way that

information from the two can impact each other’s rep-

resentations. The model learns a similarity measure

over projected

-Grams of the pair, and generate the

attention representation for each input to lead the pool-

ing. Self-LSTM, Multihop-Sequential-LSTM (Tran

and Nieder

ee, 2018) are developed to ﬁnd the relation-

ship between question and answer documents captured

by attention. These models generate multiple repre-

sentations that target different parts of the question.

Additionally, they utilize sequential attention mecha-

nism which uses context information for computing

context-aware attention weights. The proposed Multi-

Resolution Neural Network (MRNN) model belongs

to this class of models of ad-hoc retrieval, but further

incorporates the context of boosted interaction signals.

3 PROPOSED APPROACH

3.1 Overview

In this section, we introduce the proposed model called

Multi-Resolution Neural Network as demonstrated in

Figure 1. MRNN leverages the document relevance

ranking and is composed of two major components:

the initial component is responsible for generating

-Gram feature maps, the latter one matches those

feature chains.

The model begins with a series of word inputs

, e

, ...., e

, that can establish a phrase, a sentence,

a paragraph or a document. The model, then, builds

Multi-Resolution Feature Maps through densely con-

nected

-Gram blocks. To conscientiously extract most

useful features for the target retrieval and matching

tasks, a Duplex Attention component is introduced.

The duplex attention component consists of two sub at-

tention components that are called Multi-Resolution

Gram Attention and Document Aware Query Attention

respectively. The Multi-Resolution

-Gram Attention

is responsible for reevaluating these Multi-Resolution

feature maps while Document Aware Query Attention

emphasizes matching those features between contex-

tualized document and query to determine if they are

similar or not. The ﬁnal output is, then, used to im-

prove the matching and retrieval performances.

3.2 Multi-Resolution Feature Maps

The ﬁrst component called Multi-Resolution Feature

Maps of the proposed model is shown in Figure 2. The

component begins with a series of word inputs that

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

468

are initialized using one of the pre-trained embedding

models. Let,

∈ R

be the

-dimensional embed-

ding vector of

-th word of the input text. Then, a

matrix that is composed of all words of the input text

is represented as follows:

E = [e

, e

, ··· , e

]

h×w

(1)

where

E ∈ R

h×w

be the

h × w

-dimensional matrix of

the input text and

denotes the number of words in a

given text. Embedded text, then, is fed to ﬁrst

-Gram

block of the component.

3.2.1 n-Gram Blocks

-Gram block is representing ﬁve cascaded opera-

tions: a convolution (CONV), a batch normalization

(BN)(Ioffe and Szegedy, 2015), a parametric rectiﬁed

linear unit (PReLU)(He et al., 2015), pooling opera-

tions (POOLS), and a scale unit SU. As the input text

is represented with the matrix E, the output represen-

tations of each

-Gram block within the component is

denoted as below:

= [g

, g

, ··· , g

]

h×s

(2)

where

denotes the dimension of the transformed fea-

ture representation,

represents the index of an

Gram block, and a total number of

-Gram blocks are

labeled as

. Note that, while upstream

-Gram blocks

are deﬁned to emphasize the blocks that are located

above the current block and close to the input layers of

the network while the downstream blocks are located

below the current block and close to the output layers

of the network. In other words, feature maps derived

from upstream blocks are associated with smaller

Grams (unigram, bigram, and etc), on the other hand,

downstream blocks are associated with larger

-Grams

(6-grams, 7-grams, and etc) by reﬂecting on the feature

maps derived from upstream n-Gram blocks.

Each

-Gram block utilizes

gram

(·, ·, ·)

function to

generate

∈ R

h×s

representations.

gram

(·, ·, ·)

can

be described as the below:

= f

gram

(UB, ws, s) (3)

where

denotes representations generated from up-

stream

-Gram blocks.

represents the window

size and the dimension of the transformed feature rep-

resentation respectively. The deﬁnition of

is con-

ditioned to the index of the current n-Gram block.

UB =



E, if n = 1

, G

, ··· , G

n−1

], otherwise

where

, G

, ··· , G

n−1

]

shows the concatenation

of the transformed representations derived from up-

stream

-Gram blocks (

1 ≤ n − 1

), in other words,

densely connected upstream n-Gram blocks.

3.2.2 Densely Connected Blocks

Conventional models that are designed by placing con-

volution blocks consecutively, aim to extract hierar-

chical feature representations. Speciﬁcally, for text-

oriented task scenarios, convolutions can be evaluated

as to derive

-Gram features over a word sequence. Al-

though traditional connections of convolution blocks

extract hierarchical feature representations, they can

not fulﬁll the requirements of natural languages for the

following reasons:

•

Traditional convolution based networks exploit the

kernels of a constant size where a constant size

window slides across all text to generate feature

representations (Wang et al., 2017). This is called

constant size

-Gram representations and it is not

able to gather ﬂexible size of

-Gram represen-

tations that are needed for better understanding

of the text that depends on context, syntax, and

semantics.

•

In order to handle the extraction of ﬂexible size

Gram representations, one can employ the kernels

with various window sizes, but another issue is

ﬁred up with such settings: What would be the

right architecture of using different kernel sizes? In

other terms, how much expansion does the network

require for different kernel sizes to produce the

best features? It would be a huge search space for

the greedy search due to an exponential number of

parameter combinations.

•

Although the kernels with the variety of window

sizes would still be seen as a better or an advanced

architecture, it actually does not utilize the inter-

play between the representations derived from the

different kernel size, since this type of approach

consists of different independent parallel networks.

Therefore, we propose MRNN for ad-hoc retrieval,

which leverages representations across all levels of

abstractions in the learned hierarchical representation.

The learned representation achieves a multi-resolution

effect that can represent concepts and their relation-

ship across all the levels of deep architecture. In order

to consider all the resolutions (mixture of representa-

tions from adaptive

-Grams such that feature maps

of words or short phrases from the upstream blocks

affect the downstream blocks to compose feature maps

for longer context) of the representations throughout

the proposed network, we ﬁrst employ dense connec-

tions between each

-Gram blocks inspired by ideas

from computer vision (Huang et al., 2017) and text

classiﬁcations (Kim et al., 2018) (Wang et al., 2018a).

gram

(·, ·, ·)

has input parameters of upstream block

representations

, window size

and, dimen-

MRNN: A Multi-Resolution Neural Network with Duplex Attention for Deep Ad-Hoc Retrieval

469

Figure 2: Multi-Resolution Feature Maps Component.

sion of the transformed representations

. Specif-

ically,

gram

(·, ·, ·)

computes the teachable weights

∈ R

s×ws×s

by using the cascading aforementioned

operations: CONV, BN, PReLU, POOLS, and SU. The

teachable weight tensor

is composed of

ﬁlters

where each of them has a matrix

∈ R

ws×s

, convolving

contiguous vectors. It is important to emphasize

that in order to transform embedding dimension

, in the case where

n = 1

or in other words

UB = E

we employ

ﬁlters where each of them has a matrix

∈ R

(ws=1)×w

in the ﬁrst phase. Additionally to prevent

the output size of the feature maps being different after

each

-Gram block, we use padding and pooling oper-

ations. We, further, apply a scalar unit to the feature

map. That scalar

is also teachable and it weights the

feature map of the current

-th

-Gram block to decide

how much the current block contributes to the down-

stream

-Gram blocks. Likewise, for the case where

UB = [G

, G

, ··· , G

n−1

]

or, in other words, where

n > 1

gram

(·, ·, ·)

computes the teachable weights

∈ R

(n−1)×s×ws×s

by using the same cascaded op-

erations in the block. After considering all

-Gram

blocks, the multi-resolution feature maps tensor is rep-

resented as follows:

G = [G

, G

, ··· , G

] ∈ R

N×h×s

and

consists of

matrices where each of them is rep-

resenting the feature map matrix

from each

-Gram

block.

3.3 Duplex Attention Component

Duplex attention component is one of the most im-

portant components of our proposed network. Since

attention (Vaswani et al., 2017) evolves into an effec-

tive component within the neural network for extract-

ing useful information, which achieves a remarkable

result for many machine learning tasks, we compose

this component with two sub attention components

that are called Multi-Resolution

-Gram Attention and

Document Aware Query Attention respectively. The

multi-resolution feature maps tensors of each query(

document(

) pair are denoted as

. Hence, the

Multi-Resolution

-Gram Attention is responsible for

reevaluating

, and

feature maps while Docu-

ment Aware Query Attention emphasizes matching

those features between contextualized document and

query to determine if they are similar or not.

3.3.1 Multi-Resolution n-Gram Attention

The multi-resolution feature maps tensor

(

) contains feature maps from all

-Gram blocks.

More speciﬁcally,

Gq = [Gq

, Gq

, ··· , Gq

] ∈

N×h

×s

, and

Gd = [Gd

, Gd

, ··· , Gd

] ∈ R

N×h

×s

the number of words in the query and the document

are denoted as

and

respectively. Although the

network has rich features at this step, some of those

features still need to be pruned. In order to prune

them conscientiously for the next component, we intro-

duce a multi-resolution

-Gram attention component

as shown in Figure 3. The multi-resolution

-Gram

attention component has two consecutive functions

called transformer

(·)

and conductor

(·, ·)

respec-

tively. For the sake of simplicity, we explain the func-

tions by taking the multi-resolution feature maps of

the query (Gq) into consideration.

The transformer

(·)

is a function that is formal-

ized as below:

Tq = f

(Gq) (4)

where

Tq ∈ R

N×h

is a matrix of scalar adjusters.

Since we know that

= [gq

, gq

, ··· , gq

]

×s

and each

represents the

dimensional feature rep-

resentation in the

-th location of a query at

-th

Gram block. Particularly, for each

feature repre-

sentation,

(·)

calculates the the scalar adjuster vector

= [

∑

j=1

[ j],

∑

j=1

[ j], · · · ,

∑

j=1

[ j]] (5)

where

∈ R

is a

-dimensional vector, thus,

Tq =

[tq

, tq

, ··· ,

]

is a matrix of scalar adjusters. The motivation

behind this procedure is that the sum of all the values

in the

-dimensional vector of

is positioned by

feature importance.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

470

Figure 3: Multi-Resolution n-Gram Attention.

The scalar adjusters matrix

and the multi-

resolution feature map tensor of query

are, further,

passed to conductor function

(·, ·)

to conduct the

feature maps from variety of

-Gram scales by calcu-

lating attention weights via the softmax block.

(·, ·)

is described as:

MRAq = f

(Gq, Tq) (6)

where

MRAq ∈ R

×s

is a matrix of multi-resolution

gram attention vectors. For each scalar vector

(·, ·)

, ﬁrst, calculate the attention weights via soft-

max block represented as:

= f (tq

) (7)

where

f (·)

deﬁnes the softmax block that is a percep-

tron of the following cascaded operations: fully con-

nected layers (FC), a parametric rectiﬁed linear unit

(PReLU), and a softmax operation (SOFT).

aw ∈ R

is an attention weights vector. Thus, representations

of tq

and aw

can be shown as:

= [tq

,tq

, ··· , tq

]

= [aw

, aw

, ··· , aw

]

As next steps,

(·, ·)

computes the ﬁnal multi-

resolution

gram attention vector using the following

equations:

mraq

∑

j=1

· Gq

(8)

where

mraq

∈ R

is a multi-resolution

gram atten-

tion vector of

-th position of a query. The ﬁnal output

representations of

(·, ·)

is a multi-resolution

gram

attention matrix MRAq that is evaluated as:

MRAq = [mraq

, mraq

, ··· , mraq

] ∈ R

×s

Likewise,

MRAd ∈ R

×s

is the multi-resolution

n-Gram attention matrices for document. Both of ma-

trices, then, are passed to next attention component

called Document Aware Query Attention.

3.3.2 Document Aware Query Attention

In order to compute document aware query encod-

ing using attention mechanism, we present Docu-

ment Aware Query Attention component as illustrated

in Figure 4. The component emphasizes matching

features between the multi-resolution

-Gram atten-

tion matrices of query

MRAq ∈ R

×s

and document

MRAd ∈ R

×s

pair. The document aware query atten-

tion component has a function called encoder f

(·, ·).

(·, ·)

is an encoder function that is formalized as

below:

qa = f

(MRAq, MRAd) (9)

where

qa ∈ R

is a vector of document aware query

encodings generated via attention weights. We, ﬁrst,

calculate a dot-product attention weights

′

for each

position of

MRAd

relative to

mraq

via the softmax

block as follows:

′i

= f (mraq

•

MRAd) (10)

where

f (·)

deﬁnes the softmax block that is a percep-

tron with the same architecture (FCs, PReLU, SOFT).

Note that, dot-products have a larger spectrum than

the similar similarity functions, promising low en-

tropy attention distributions.

′i

∈ R

is an attention

weights vector. Thus, representation

′i

can be shown

as:

′i

= [aw

′i

, aw

′i

, ··· , aw

′i

]

As a next step, we sum the document aware encodings

of the

-locations, scaled by their attention weights,

to create an attention-based representation

sae

of doc-

ument representations MRAd from the query repre-

sentation mraq

formulated as below:

sae

∑

j=1

′i

· mrad

(11)

The element-wise distance (euclidean) between the

attention-based document representation

sae

and

MRNN: A Multi-Resolution Neural Network with Duplex Attention for Deep Ad-Hoc Retrieval

471

Figure 4: Document Aware Query Attention.

mraq

is, further, calculated and adopted as document

aware query encoding

∈ R

that can be deﬁned as

a document aware query encoding of

-th position of

a query. The ﬁnal output representations of encoder

(·, ·)

is a document aware query encoding attention

vector qe that is evaluated as:

qe = [qe

, qe

, ··· , qe

] ∈ R

In other words, if the document includes more posi-

tions of

mrad

that are very much alike to the position

mraq

in the query, the document aware query at-

tention component indicates mostly those positions

and, therefore sae

will be close to mraq

As a ﬁnal step, we aggregate these similarities qe

via the aggregate component to deﬁne the distance of

query-document pair as follows:

dist =

∑

i=1

(12)

where

dist ∈ R

is the ﬁnal distance descriptor between

the query-document pair.

3.4 Distance Metric Loss Function

In order to train the proposed MRNN to perform well

on retrieval and matching tasks, such that it also gen-

eralizes well on unseen data, we speciﬁcally utilized

triplet loss function (Schroff et al., 2015) during the

training period as shown in Figure 1 as distance metric

loss. With this setup, the network is encouraged to

reduce distances between positive pairs so that they

are smaller than negative ones. A particular query

would be a query anchor close in proximity to a doc-

ument

as the positive pair to the same question

than to any document

−

as they are positive pairs

to other questions. The key point of the

triplet

is to

build the correct triplet structure which should meet

the condition of the following equation:

∥Q, D

∥

+ m <∥Q, D

−

∥

For each query, the document

is selected as:

argmax

∥Q, D

∥

and likewise the hardest docu-

ment

−

argmin

−

∥Q, D

−

∥

to form a triplet. This

triplet selection strategy is called hard triplets mining.

Let

T = (Q, D

, D

−

)

be a triplet input. Given

, the proposed approach computes the distances be-

tween the positive and negative pairs via the proposed

MRNN.

triplet

= [∥Q, D

∥

−∥Q, D

−

∥

+ m]

(13)

where

m > 0

is a scalar value called margin, and

∥·, ·∥

represents the distance score between two objects.

4 EXPERIMENTS

4.1 Datasets

In order to evaluate our proposed approach, we con-

ducted extensive experiments on four datasets, includ-

ing SQuAD (Rajpurkar et al., 2016), WikiQA (Yang

et al., 2015), QUASAR (Dhingra et al., 2017), and

TrecQA (Wang et al., 2007).

4.1.1 SQuAD

The Stanford Question Answering Dataset (SQuAD)

(Rajpurkar et al., 2016) is a large reading compre-

hension dataset that is built with

100, 000+

questions.

Each of these questions is composed by crowd workers

on a set of Wikipedia documents where the answer to

each question is a segment of text from the correspond-

ing reading passage. In other words, the consolidation

of retrieval and extraction tasks are aimed at measuring

the success of the proposed systems.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

472

4.1.2 WikiQA

The Wikipedia open-domain Question Answering

(WikiQA) (Yang et al., 2015) dataset is collected using

Bing query logs. For each question, clicked Wikipedia

pages (issued by at least 5 unique users) and used sen-

tences in the summary section of Wikipedia page as

the candidates, is further marked on a crowdsourcing

platform. Note that we excluded questions that have

no candidates. Based on training, dev and test subsets,

1, 242 questions are used in the experiment.

4.1.3 QUASAR

The Question Answering by Search And Reading

(QUASAR) is a large-scale dataset consisting of

QUASAR-S and QUASAR-T. Each of these datasets

is built to focus on evaluating systems devised to un-

derstand a natural language query, a large corpus of

texts and to extract an answer to the question from

the corpus. Similar to SQuAD QUASAR is primarily

used to measure the success of the proposed systems

for ad-hoc retrieval and extraction tasks. Speciﬁcally,

QUASAR-S comprises

37, 012

ﬁll-in-the-gaps ques-

tions that are collected from the popular website Stack

Overﬂow using entity tags. Since our research is not

to address the ﬁll-in-the-gaps questions, we want to

pay attention to the QUASAR-T dataset that fulﬁlls

the requirements of our focused retrieval task. The

QUASAR-T dataset contains

43, 012

open-domain

questions collected from various internet sources. The

candidate documents for each question in this dataset

are retrieved from an Apache Lucene based search

engine built on top of the ClueWeb09 dataset (Callan

et al., 2009).

4.1.4 TrecQA

Text Retrieval Conference Question Answering

(TrecQA)(Wang et al., 2007) is a popular benchmark

dataset for question answering. TrecQA dataset is

based on QA track (8-13) of TREC. The dataset con-

sists of factoid questions, each of which has a single

sentence as a candidate answer. In order to make the

comparison parallel to the previous works, we pur-

sue the same strategy they applied where all questions

with only positive or negative answers are excluded.

In total, we end up having

1, 295

questions within all

training, dev and test subsets of TrecQA.

The number of queries in each dataset including

their subsets is listed in Table 1.

Table 1: Datasets Statistics: Number of queries in each train,

validation, and test subsets.

DATASET TRAIN VALID. TEST TOTAL

SQUAD 87,599 10,570 HIDDEN 98,169+

WIKIQA 873 126 243 1,242

QUASAR-T 37,012 3,000 3,000 43,012

TRECQA 1,162 65 68 1,295

4.2 Performance Measure

The matching and retrieval tasks aim to improve the

recall@k

, the Mean Reciprocal Rank (

MRR

) and

Mean Average Precision (MAP). The recall@k score

is calculated by selecting the correct pair among all

candidates. Basically,

recall@k

is deﬁned as the num-

ber of correct documents as listed within top-

returns

out all possible documents. Likewise,

MRR

and

MAP

metrics are also commonly used in information re-

trieval and question answering researches (Manning

et al., 2008).

4.3 Implementation

4.3.1 Input

We adopt the multi-resolution word embedding

(Cakaloglu and Xu, 2019) (Cakaloglu et al., 2022)

using Bert(Devlin et al., 2018), ELMo(Peters et al.,

2018), FastText(Mikolov et al., 2018) for each ques-

tion and document in datasets. We conﬁgure the multi-

resolution word embedding stated in (Cakaloglu and

Xu, 2019):

mixture

(·, ·, ·, ·)

and

ensemble

(·, ·)

conﬁgura-

tions are shown in Table 2 and Table 3 respectively.

Table 2:

mixture

(·, ·, ·)

conﬁguration of the multi-resolution

word embedding.

E w

id f

m f

mix

OUT

BERT 0 [

,..,0] conc. X

ELMO 1 [0, 0, 1] sum X

FASTTEXT 1 [1] sum X

Table 3:

ensemble

(·)

conﬁguration of the multi-resolution

word embedding.

X’ u f

ensemble

, X

} [

] concat.

4.3.2 MRNN Training Conﬁguration

The proposed MRNN is implemented with Tensorﬂow

1.8+ by (Abadi et al., 2015) and trained on NVIDIA

Tesla K40c GPUs. Speciﬁcally, the network is trained

MRNN: A Multi-Resolution Neural Network with Duplex Attention for Deep Ad-Hoc Retrieval

473

using ADAM optimizer (Kingma and Ba, 2014) with

a batch size of

512

. The learning rate is set to

−4

Additionally, the weight decay is set to

−3

to tackle

over-ﬁtting. The triplet loss is, then, chosen as an

objective function with different margins for each of

the datasets. (

m = 1

: SQuAD,

m = 0.8

: QUASAR-

m = 0.5

{

WikiQA ,TrecQA

}

). We conﬁgured 6

-Gram blocks (

N = 6

) for SQuAD and QUASAR-T

datasets, and 4

-Gram blocks (

N = 4

) for WikiQA

and TrecQA datasets. Last but not least, window size

and transformed feature representation dimension are

set to ws = 3, s = 1024 respectively.

5 RESULTS

We compare our approach with different models pro-

posed by other researchers for each dataset using their

evaluation measures and test subsets.

5.1 SQuAD

The

recall@5

result is calculated for SQuAD in or-

der to compare with the document retrieval compo-

nent of the multi-layer recurrent neural network (Chen

et al., 2017a) (DrQA) and the convolutional resid-

ual retrieval network (ConvRR) (Cakaloglu and Xu,

2019)(Cakaloglu et al., 2018). The comparisons are

shown in Table 4.

5.2 WikiQA

WordCnt, WgtWordCnt, and CNN-Cnt are the models

derived from the following initial study (Yang et al.,

2015). On top of the baseline models, the Paragraph

Vector (PV) and PV + Cnt models (tau Yih et al., 2013)

are taken into consideration. We, further, consider

even more advanced models: QA-LSTM (Tan et al.,

2016), Self-LSTM, Multihop-Sequential-LSTM (Tran

and Nieder

ee, 2018), ABCNN (Yin et al., 2016), Rank

MP-CNN (Rao et al., 2016), RNN-POA (Chen et al.,

2017b). In order to compare the proposed model with

the aforementioned models for the WikiQA dataset,

we calculate the Mean Reciprocal Rank (

MRR

) and

Mean Average Precision (

MAP

) metrics and all the

results are presented in Table 5.

5.3 QUASAR

BM25 (Robertson and Zaragoza, 2009)(Htut et al.,

2018),

: Simple Ranker-Reader,

: Reinforced

Ranker-Reader (Wang et al., 2018b), InferSent Ranker

(Conneau et al., 2017)(Htut et al., 2018), convolutional

residual retrieval network (ConvRR) (Cakaloglu and

Xu, 2019), and Relation-Networks (RN) Ranker (San-

toro et al., 2017)(Htut et al., 2018) are the models

that are evaluated using the

recall@1

recall@3

, and

recall@5. The comparisons are listed in Table 6.

5.4 TrecQA

We compute the Mean Reciprocal Rank (

MRR

) and

Mean Average Precision (

MAP

) metrics for the pro-

posed model as well as following models: Tree Edit

Model (Heilman and Smith, 2010), LSTM (Wang and

Nyberg, 2015), CNN (Severyn and Moschitti, 2015),

AP-LSTM (Tan et al., 2016), AP-CNN (dos Santos

et al., 2016), RNN-POA (Chen et al., 2017b), and

Self-LSTM, Multihop-Sequential-LSTM (Tran and

Nieder

ee, 2018). The results are stated in Table 7.

5.5 Evaluation

The result on all benchmark datasets shows that the

proposed MRNN clearly outperforms existing mod-

els. Relation-Networks Ranker is the only model

that achieved a slightly better result for recall@3

and recall@5 on QUASAR-T dataset. The pro-

posed MRNN model however outperforms Relation-

Networks Ranker on recall@1.

Table 4: Performances on SQUAD.

recall@k

retrieved doc-

uments using the baseline models and the proposed model.

MODEL @5

CONVRR 75.6

DRQA DOCUMENT-RETRIEVAL 77.8

MRNN 80.4

Table 5: Performances on WikiQA. The baseline models and

the proposed model are listed based on the results derived

from MAP and MRR metrics.

MODEL MAP MRR

WORDCOUNT 0.4891 0.4924

WGTWORDCNT 0.5099 0.5132

PV 0.511 0.516

PV + CNT 0.599 0.609

CNN + CNT 0.652 0.6652

QA-LSTM 0.654 0.665

AP-LSTM 0.670 0.684

AP-CNN 0.689 0.696

SELF-LSTM 0.693 0.704

ABCNN 0.692 0.71

RANK MP-CNN 0.701 0.718

RNN-POA 0.721 0.731

MULTIHOP-SEQUENTIAL-LSTM 0.722 0.738

MRNN 0.731 0.745

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

474

Figure 5: For the sample query and document extracted from SQuAD, the visualization of their attention weights that are ﬁred

in the sub attention components of Duplex Attention.

Table 6: Performances on QUASAR-T.

recall@k

retrieved

documents using the baseline models and the proposed

model.

MODEL @1 @3 @5

BM25 19.7 36.3 44.3

: SIMPLE R-R 28.8 46.4 54.9

INFERSENT RANKER 36.1 52.8 56.7

: REINFORCED R-R 40.3 51.3 54.5

CONVRR 50.67 63.1 67.4

RELATION-NETWORKS R 51.4 68.2 70.3

MRNN 52.8 67.7 69.9

Table 7: Performances on TrecQA. The baseline models and

the proposed model are listed based on the results derived

from MAP and MRR metrics.

MODEL MAP MRR

TREE EDIT MODEL 0.609 0.692

LSTM 0.713 0.791

CNN 0.746 0.808

AP-LSTM 0.753 0.830

AP-CNN 0.753 0.851

SELF-LSTM 0.759 0.830

RNN-POA 0.781 0.851

MULTIHOP-SEQUENTIAL-LSTM 0.813 0.893

MRNN 0.822 0.898

5.6 Visualization and Analysis

The attention visualizations of the duplex attention

component for the sample query and correspond-

ing document extracted from the SQuAD dataset are

shown in Figure 5. When window-size is set to

(

ws = 3

), and the number of

-Gram blocks is set to

(

N = 6

), then each row of Figure 5, a) and b) indi-

cate an attention weight distribution over

as well

and

is the position of each words in the ma-

trices of the query and document. Hence,

and

corresponds to feature maps of (2n-1)-gram, e.g.,

: 1-gram,

:3-gram, ... etc. The proposed

MRNN puts more emphasizes on some segments of

the query and the parts of the document as in the multi-

resolution

-Gram attention. In the document aware

query attention, MRNN gives more attention to addi-

tional segments of the question and also some other

parts of the document that have crucial interactions, as

shown in Figure 5, c).

6 CONCLUSION

Ad-hoc retrieval is an important task for question an-

swering and information retrieval. This paper pro-

poses a new multi-resolution neural network for ad-

hoc retrieval, which is the ﬁrst model that leverages the

strength of representations of different abstract levels

in the learned hierarchical representation. The pro-

posed model incorporated with a new duplex attention

mechanism can signiﬁcantly improve the performance

of ad-hoc retrieval as demonstrated by the superior

results in comparison to other existing methods on ma-

jor benchmark datasets. More speciﬁcally our study

shows that MRNN with the duplex attention improves

gain on existing model performances over different

evaluation metrics and benchmark datasets by a range

of 0.5% to 1.5% .In the future, we want to apply the

proposed model to other areas including pattern recog-

nition and computer vision.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G.,

Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur,

M., Levenberg, J., Man

e, D., Monga, R., Moore, S.,

Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner,

B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke,

V., Vasudevan, V., Vi

egas, F., Vinyals, O., Warden,

P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X.

(2015). TensorFlow: Large-scale machine learning

on heterogeneous systems. Software available from

tensorﬂow.org.

Cakaloglu, T., Szegedy, C., and Xu, X. (2018). Text embed-

MRNN: A Multi-Resolution Neural Network with Duplex Attention for Deep Ad-Hoc Retrieval

475

dings for retrieval from a large knowledge base. arXiv

preprint arXiv:1810.10176.

Cakaloglu, T. and Xu, X. (2019). A multi-resolution word

embedding for document retrieval from large unstruc-

tured knowledge bases. arXiv preprint.

Cakaloglu, T., Xu, X., and Raghavan, R. (2022). Emboost:

Embedding boosting to learn multilevel abstract text

representation for document retrieval. In Rocha, A. P.,

Steels, L., and van den Herik, H. J., editors, Proceed-

ings of the 14th International Conference on Agents

and Artiﬁcial Intelligence, ICAART 2022, Volume 3,

Online Streaming, February 3-5, 2022, pages 352–360.

SCITEPRESS.

Callan, J., Hoy, M., Yoo, C., and Zhao, L. (2009). Clueweb09

data set.

Chen, D., Fisch, A., Weston, J., and Bordes, A. (2017a).

Reading wikipedia to answer open-domain questions.

In ACL.

Chen, Q., Hu, Q., Huang, X., He, L., and An, W. (2017b).

Enhancing recurrent neural networks with positional

attention for question answering. In SIGIR.

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bor-

des, A. (2017). Supervised learning of universal sen-

tence representations from natural language inference

data. In EMNLP.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Dhingra, B., Mazaitis, K., and Cohen, W. W. (2017). Quasar:

Datasets for question answering by search and reading.

arXiv preprint arXiv:1707.03904.

dos Santos, C. N., Tan, M., Xiang, B., and Zhou, B. (2016).

Attentive pooling networks. CoRR, abs/1602.03609.

Guo, J., Fan, Y., Ai, Q., and Croft, W. B. (2016). A deep rel-

evance matching model for ad-hoc retrieval. In CIKM.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep

into rectiﬁers: Surpassing human-level performance

on imagenet classiﬁcation. In 2015 IEEE International

Conference on Computer Vision (ICCV), pages 1026–

1034.

Heilman, M. and Smith, N. A. (2010). Tree edit models

for recognizing textual entailments, paraphrases, and

answers to questions. In HLT-NAACL.

Htut, P. M., Bowman, S. R., and Cho, K. (2018). Training a

ranking function for open-domain question answering.

In NAACL-HLT.

Hu, B., Lu, Z., Li, H., and Chen, Q. (2014). Convolutional

neural network architectures for matching natural lan-

guage sentences. In NIPS.

Huang, G., Liu, Z., v. d. Maaten, L., and Weinberger, K. Q.

(2017). Densely connected convolutional networks.

In 2017 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 2261–2269.

Hui, K., Yates, A., Berberich, K., and de Melo, G. (2017).

Pacrr: A position-aware neural ir model for relevance

matching. In EMNLP.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-

celerating deep network training by reducing internal

covariate shift. In Proceedings of the 32Nd Interna-

tional Conference on International Conference on Ma-

chine Learning - Volume 37, ICML’15, pages 448–456.

JMLR.org.

Kim, S., Hong, J.-H., Kang, I., and Kwak, N. (2018).

Semantic sentence matching with densely-connected

recurrent and co-attentive information. CoRR,

abs/1805.11360.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Le, Q. V. and Mikolov, T. (2014). Distributed representations

of sentences and documents. In ICML.

Lu, Z. and Li, H. (2013). A deep architecture for matching

short texts. In NIPS 2013.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

McDonald, R., Ding, Y., and Androutsopoulos, I. (2018).

Deep relevance ranking using enhanced document-

query interactions. In EMNLP.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and

Joulin, A. (2018). Advances in pre-training distributed

word representations. In Proceedings of the Interna-

tional Conference on Language Resources and Evalua-

tion (LREC 2018).

Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen,

J., Song, X., and Ward, R. K. (2016). Deep sen-

tence embedding using long short-term memory net-

works: Analysis and application to information re-

trieval. IEEE/ACM Transactions on Audio, Speech,

and Language Processing, 24:694–707.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark,

C., Lee, K., and Zettlemoyer, L. (2018). Deep con-

textualized word representations. In Proceedings of

the 2018 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, Volume 1 (Long Papers),

pages 2227–2237.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016).

Squad: 100, 000+ questions for machine comprehen-

sion of text. arXiv preprint arXiv:1606.05250.

Rao, J., He, H., and Lin, J. J. (2016). Noise-contrastive esti-

mation for answer selection with deep neural networks.

In CIKM.

Robertson, S. and Zaragoza, H. (2009). The probabilistic

relevance framework: Bm25 and beyond. Foundations

and Trends in Information Retrieval, 3:333–389.

Salton, G. and McGill, M. J. (1986). Introduction to modern

information retrieval.

Santoro, A., Raposo, D., Barrett, D. G. T., Malinowski,

M., Pascanu, R., Battaglia, P. W., and Lillicrap, T. P.

(2017). A simple neural network module for relational

reasoning. In NIPS.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

Facenet: A uniﬁed embedding for face recognition and

clustering. 2015 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 815–823.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

476

Severyn, A. and Moschitti, A. (2015). Learning to rank short

text pairs with convolutional deep neural networks. In

SIGIR.

Tan, M., dos Santos, C. N., Xiang, B., and Zhou, B. (2016).

Improved representation learning for question answer

matching. In ACL.

tau Yih, W., Chang, M.-W., Meek, C., and Pastusiak, A.

(2013). Question answering using enhanced lexical

semantic models. In ACL.

Tran, N. K. and Nieder

ee, C. (2018). Multihop attention

networks for question answer matching. In SIGIR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,

Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017).

Attention is all you need. In Advances in Neural Infor-

mation Processing Systems 30, pages 5998–6008.

Voorhees, E. M. and Harman, D. K. (2005). TREC: Experi-

ment and Evaluation in Information Retrieval (Digital

Libraries and Electronic Publishing). The MIT Press.

Wang, D. and Nyberg, E. (2015). A long short-term mem-

ory model for answer sentence selection in question

answering. In ACL.

Wang, M., Smith, N. A., and Mitamura, T. (2007). What is

the jeopardy model? a quasi-synchronous grammar for

qa. In EMNLP-CoNLL.

Wang, S., Huang, M., and Deng, Z. (2018a). Densely con-

nected cnn with multi-scale feature attention for text

classiﬁcation. In IJCAI.

Wang, S., Yu, M., Guo, X., Wang, Z., Klinger, T., Zhang, W.,

Chang, S., Tesauro, G., Zhou, B., and Jiang, J. (2018b).

3: Reinforced ranker-reader for open-domain ques-

tion answering. In AAAI 2018.

Wang, Z., Wang, Z., Zhang, D., and Yan, J. (2017). Com-

bining knowledge with deep convolutional neural net-

works for short text classiﬁcation. In IJCAI.

Yang, Y., tau Yih, W., and Meek, C. (2015). Wikiqa: A

challenge dataset for open-domain question answering.

In EMNLP.

Yin, W., Sch

utze, H., Xiang, B., and Zhou, B. (2016). Abcnn:

Attention-based convolutional neural network for mod-

eling sentence pairs. Transactions of the Association

for Computational Linguistics, 4:259–272.

Yu, L., Hermann, K. M., Blunsom, P., and Pulman, S. G.

(2014). Deep learning for answer sentence selection.

CoRR, abs/1412.1632.

MRNN: A Multi-Resolution Neural Network with Duplex Attention for Deep Ad-Hoc Retrieval

477