EmBoost: Embedding Boosting to Learn Multilevel Abstract Text

Representation for Document Retrieval

Tolgahan Cakaloglu

1,2

, Xiaowei Xu

and Roshith Raghavan

Walmart Labs, Dallas, Texas, U.S.A.

University of Arkansas, Little Rock, Arkansas, U.S.A.

Walmart Labs, Bentonville, Arkansas, U.S.A.

Keywords:

Natural Language Processing, Information Retrieval, Deep Learning, Learning Representations, Text Matching.

Abstract:

Learning hierarchical representation has been vital in natural language processing and information retrieval.

With recent advances, the importance of learning the context of words has been underscored. In this paper

we propose EmBoost i.e. Embedding Boosting of word or document vector representations that have been

learned from multiple embedding models. The advantage of this approach is that this higher order word

embedding represents documents at multiple levels of abstraction. The performance gain from this approach

has been demonstrated by comparing with various existing text embedding strategies on retrieval and semantic

similarity tasks using Stanford Question Answering Dataset (SQuAD), and Question Answering by Search

And Reading (QUASAR). The multilevel abstract word embedding is consistently superior to existing solo

strategies including Glove, FastText, ELMo and BERT-based models. Our study shows that further gains can

be made when a deep residual neural model is speciﬁcally trained for document retrieval.

1 INTRODUCTION

The objective of question answering system (QA) is to

generate concise answers to arbitrary questions asked

in natural language. Given the recent successes of

increasingly sophisticated neural attention based ques-

tion answering models, (Yu et al., 2018), the QA task

can be broken into two as suggested by (Chen et al.,

2017), (Cakaloglu and Xu, 2019):

•

Document retrieval: Retrieval of the document

most likely to contain all the information to answer

the question correctly.

•

Answer extraction: Utilizing one of the above

question- answering models to extract the answer

to the question from the retrieved document.

We used a collection of unstructured natural lan-

guage documents as knowledge base to retrieve an-

swers for questions. In this study we aim to explore

and compare the quality of various retrieval strategies

and further investigate the feasibility of training spe-

cialized neural network models for document retrieval.

Towards this end embedding strategies play a pivotal

role in retrieval by converting the text to vector repre-

sentation. Traditional word embedding methods learn

hierarchical representations of documents where each

layer gives a representation that is a high-level abstrac-

tion of the representation from a previous layer. Most

text embedding strategies utilize either the highest

layer like Word2Vec (Mikolov et al., 2013), or an ag-

gregated representation from the last few layers, like in

ELMo (Peters et al., 2018) to generate representation

for information retrieval.

In this paper, we present a new text embedding

strategy called EmBoost that consists of two steps

as shown in Figure 1. In the ﬁrst step, a mixture of

weighted representations across the entire hierarchy

of text embedding model is formed so as to preserve

all levels of abstraction. In the second step, all

representations from various models are combined

to generate an ensemble representation and used for

document retrieval task. This strategy takes advantage

of the abstraction capabilities of individual embedding

models and various models complement one another

to create higher quality embedding. Taking the

example of ”

···

match

···

” in Figure 1, different

levels of representation of ”match” including word

level (word sense) and concept level (abstract meaning

like competition, resemblance, and burning wick)

are aggregated to form a mixture of representations.

In the second step, all these mixture representations

from different word embedding models are aggregated

352

Cakaloglu, T., Xu, X. and Raghavan, R.

EmBoost: Embedding Boosting to Learn Multilevel Abstract Text Representation for Document Retrieval.

DOI: 10.5220/0010822900003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 3, pages 352-360

ISBN: 978-989-758-547-0; ISSN: 2184-433X

to form an ensemble representation, which takes

advantage of the complementary strength of individual

models and corpora. Consequently, EmBoost delivers

the power of multilevel abstraction with the strength

of individual models.

Further in this study we introduce a convolutional

residual network (ConvRR) over the embedding vec-

tors to improve the performance of document retrieval

task. The retrieval performance is further improved by

employing triplet learning with (semi-)hard negative

mining on the target corpus i.e. add a margin to

the positive sample such that the negative sample

is closer to the anchor thereby forcing the model to

learn to solve such hard cases:

||q

anchor

, d

positive

|| <

||q

anchor

, d

negative

|| < ||q

anchor

, d

positive

|| + margin

This paper is structured as follows: First, we re-

view recent advances in text embedding in Section

2. In Section 3 we dive deeper into the details of our

approach. More speciﬁcally, we describe EmBoost

approach followed by a formulation of deep residual

retrieval model that can be used to augment the text

embedding and thereby enhance the quality of docu-

ment retrieval. In this paper we compare the proposed

method to the baselines that utilize existing popular

text embedding models. Experimentation details such

as datasets, evaluation metrics and implementation

details can be found in Section 4. The results are re-

ported in Section 5. Future work in Section 6 discusses

potential improvements and spin off studies.

Figure 1: The illustration of EmBoost method using an

example of ”··· match ··· ”

2 RELATED WORK

One of the most widely used measure for ranking im-

portance of a word or token in a document is the term

frequency-inverse document frequency or TF-IDF as

proposed by (Salton and McGill, 1986). TF-IDF cal-

culates a weighting factor for each token and is widely

employed in information retrieval and text mining. Sig-

niﬁcant recent advances have been made in word em-

bedding which is the way to convert text to numeric

vectors. The literature on embedding strategies is ex-

tensively covered by (Perone et al., 2018). Word2Vec

by (Mikolov et al., 2013), which is built upon on the

neural language model for distributed word represen-

tations by (Bengio et al., 2003), has become widely

adopted in natural language processing. It is a shal-

low network that can conserve semantic relationships

between words and their context; or in other terms,

surrounding words. The two approaches proposed in

Word2Vec are the Skip-gram model which predicts

surrounding words from the target word and the Con-

tinuous Bag-of-Words (CBOW) which predicts target

word given the surrounding words. Global Vectors

(GloVe) by (Pennington et al., 2014), was proposed

to address some of the limitations of Word2Vec by

focusing on the global context instead of the immedi-

ate surrounding words for learning the representations.

The global context is calculated by utilizing the word

pair co-occurrences in a corpus. During this calcula-

tion, a count-based approach is performed, unlike the

prediction-based method in Word2Vec.

Another very popular embedding approach has

been fastText, by (Mikolov et al., 2018). Concep-

tualy Word2Vec and Fasttext work in a similar fashion

to learn vector representations of words. But unlike

Word2Vec, which uses words to predict words, fast-

Text treats each word as being composed of character

n-grams. The vector for a word is made of the sum of

this character n-grams. To further extract high quality

meaningful representation embedding from Language

Models (ELMo) was proposed by (Peters et al., 2018).

ELMo extracts representations from a bi-directional

Long Short Term Memory (LSTM), (Hochreiter and

Schmidhuber, 1997), that is trained with a language

model (LM) objective on a very large text corpus.

ELMo representations are a function of the internal

layers of the bi-directional Language Model (biLM)

that outputs good and diverse representations about

the words/token (a convolutional neural network over

characters). ELMo is also incorporating character n-

grams, as in fastText, but there are some constitutional

differences between ELMo and its predecessors.

The encoder-decoder approaches however func-

tion by compressesing the input source sentence into

a ﬁxed length vector. This has shown to lead to de-

cline in performance when dealing with long sentences.

Additionally, the sequential nature of the model archi-

tecture prevents parallelization. To overcome these

challenges, attention based transformer architecture

EmBoost: Embedding Boosting to Learn Multilevel Abstract Text Representation for Document Retrieval

353

was proposed (Bahdanau et al., 2016; Vaswani et al.,

2017).

One of the earliest Transformer based model to

come out was BERT (Bidirectional Encoder Repre-

sentations from Transformers) (Devlin et al., 2019)

which was pre-trained on a large corpus of unlabeled

text including entire Wikipedia (2.5 billion words) and

Book corpus (800 million words). A key takeaway

about BERT is that it is a deeply bidirectional model

allowing it to learn from both the left and right side

of a token’s context during the training phase. This

bidirectionality is important for truly understanding

the meaning of language. BERT has been optimized

further with XLNet (Yang et al., 2020) and RoBERTa

(Liu et al., 2019) whereas DistilBERT (Sanh et al.,

2020) improves on the inference speed. This was then

followed by sentence-BERT (Reimers and Gurevych,

2019) which adapted the BERT architecture by using

siamese and triplet network structures to derive seman-

tically meaningful sentence embeddings that can be

compared using cosine-similarity.

Last, but not least, distance metric learning is a

technique to learn a distance metric for invariant data

representations in a way that retains the related vec-

tors close to each other while separating different ones

in the vector space, as stated by (Lowe, 1995), (Cao

et al., 2013), and (Xing et al., 2002). However, in-

stead of using standard distance metric learning, us-

ing deep networks to infer a non-linear embedding

of data has shown signiﬁcant improvements when it

comes to learning representations using various loss

functions, including triplet loss—by (Hadsell et al.,

2006), (Chopra et al., 2005)—, contrastive loss—by

(Weinberger and Saul, 2009), (Chechik et al., 2010)—,

angular loss—by (Wang et al., 2017)—, and n-pair

loss—by (Sohn, 2016)—for inﬂuential studies—by

(Taigman et al., 2014), (Sun et al., 2014), (Schroff

et al., 2015), and (Wang et al., 2014)—.

After providing a brief review of the latest trends

in the ﬁeld, we describe the details of our approach

and experimental results in the following sections.

3 PROPOSED APPROACH

3.1 Overview

The proposed approach for document retrieval begins

with ﬁrst devising a new text embedding approach

called EmBoost which is an ensemble of multilevel

abstract representations learned from multiple distinct

pre-trained text embedding models. Secondly a neural

network model called

ConvRR

,short for Convolu-

tional Residual Retrieval Network (and alternatively

FCRR

, short for Full-Connected Retrieval Network

by (Cakaloglu et al., 2018), is trained using triplet

loss. The general architecture of the proposed Con-

vRR model is shown in Figure 2.

The model begins with a series of word inputs

, w

, ...., w

, that could represent a phrase, sen-

tence, or paragraph. Those inputs, then, are initialized

with different resolutions of pre-trained embedding

models which may be context-free, contextual or nu-

merical. ConvRR then further improves the multilevel

abstract representation by using convolutional blocks

through residual connection to the initialized origi-

nal embedding. The residual connection enables the

model not to lose the meaning and the knowledge

derived from the pre-trained multilevel abstract em-

bedding and enables it to make some adjustments to

its knowledge using limited additional training data. A

ﬁnal representation is then sent to the retrieval task.

Figure 2: An overview of proposed approach consisting of an

ensemble of multilevel abstract representations learned from

multiple distinct pre-trained text embedding models and the

Convolutional Residual Retrieval Network (ConvRR).

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

354

3.2 EmBoost

Existing powerful pre-trained text embedding strate-

gies are trained using different data sources

(Wikipedia, Common Craw, and etc.) as well as dif-

ferent techniques (supervised, unsupervised or varia-

tions). These pre-trained representations can broadly

belong to three types- context-free (GloVe, FastText,

etc.), or contextual (ELMo, Bert, etc.), and statistical

(term frequency–inverse document frequency). The

contextual strategies can further be unidirectional or

bidirectional. Generally, context-free and statistical

text embeddings are represented as a vector whereas

contextual text embedding strategies are represented

as a matrix.

Traditionally, one of pre-trained embedding mod-

els is selected to initialize a network for a deﬁned

downstream task. Hence, a series of word inputs to the

network is initialized using the selected embedding

model. If the selected embedding model generates a

matrix instead of

-dimensional vector, then the matrix

for each word is represented as follows:

= [e

, e

, ··· , e

]

l×d

(1)

where

∈ R

l×d

is the

l × d

-dimensional pre-trained

word matrix of

-th word input,

denotes the given

embedding model and

represents the number of lay-

ers in the embedding model. Averaging all the layers

(ELMo),

∑

i=1

, or concatenating each of the last

4 layers (Bert),

< e

l−3

⊕ e

l−2

⊕ e

l−1

⊕ e

in the ma-

trix are the best practice to create a

-dimensional

vector, where

= d

if averaging all layers is used,

and

= L × d

in case concatenating is used,

is the

number of layers (We used the notation of

to refer

to the selected layers from

-layers embedding model).

Current practice is to use only the last layer or top

few layers, while we propose to consider all layers for

multilevel abstract representation.

The proposed multilevel abstract word embed-

ding has two cascaded operations:

mixture

(·, ·, ·)

and

ensemble

(·).

Forming a mixture of the representations from an

embedding model,

mixture

(·, ·, ·)

, can be formulated as

below:

= f

mixture

, w

id f

, m

) (2)

where

∈ R

is a coefﬁcient vector and

∑

i=1

= 1

Each coordinate of the

represents a magnitude

to weight the corresponding layer of the model

id f

denotes an IDF weight of the

-th word in-

put.

mixture

(·, ·, ·)

is an aggregate function, which ag-

gregates the input using an operation such as

sum

average

, and

concatenate

. Weighted layers of the

model

are then computed by that aggregate func-

tion.

is the

-dimensional vector where

= d

mixture

(·, ·, ·)

is deﬁned by

sum

average

, and

d × l

in case

mixture

(·, ·, ·)

is deﬁned by

concatenate

The obtained mixture of representations from multi-

ple word embedding models can form an ensemble

representation as follows.

= {x

, x

, ···x

} (3)

where

is a set of representations from different

embedding models, using

mixture

(·, ·, ·)

for the

-th

word input and

is the number of embedding models.

ensemble

(·)

is a function to aggregate all representa-

tions in X

and can be deﬁned as follows:

= f

ensemble

, u) (4)

where

ensemble

(·)

is also a aggregate function deﬁned

by an operation like

sum

average

, and

concatenate

Note that, representations are coerced to a common

length, if

ensemble

(·)

is deﬁned by

sum

average

Additionally,

u ∈ R

is a coefﬁcient vector and

∑

j=1

||u||

= 1

. Each coordinate of the

represents

a magnitude to weight the corresponding embedding

model of the ensemble of text embedding models.

Hence,

-dimensional multilevel abstract word

embedding of the

-th word input. The pseudo-code of

the proposed approach is shown in Algorithm 1.

Algorithm 1: EmBoost for i-th word input.

With the ensemble text embedding approach, we

are generating embedding by taking the following as-

pects into consideration:

• Multi-sources

: Instead of relying on one pre-

trained embedding model, we want to utilize the

power of multiple pre-trained embedding models

since they are trained using different data source as

well as different techniques. Therefore, integrating

EmBoost: Embedding Boosting to Learn Multilevel Abstract Text Representation for Document Retrieval

355

different word embedding models can harness the

complementary power of individual models.

• Different layers

: We take the embedding from dif-

ferent layers of

each embedding model instead

of just the last layer or few top layers.

• Weighted embedding

: Incorporating word embed-

ding with a weighting factor like inverse document

frequency (IDF) or bm25 produce better results

for information retrieval and text classiﬁcation as

presented by (Boom et al., 2015).

An IDF is formulated as:

log

(

#o f documents

d f

)

, where

a document frequency (

d f

) is the number of doc-

uments in the considered corpus that contain that

particular word w.

3.3 ConvRR

To further improve the performance of document re-

trieval, a convolutional residual retrieval (ConvRR)

model is trained on top of the proposed ensemble of

text embeddings. The model is presented in Figure 2.

Let

∈ R

be the

-dimensional proposed multi-

level abstract word embedding of the

-th word input

in a text; therefore, the word inputs can be denoted as

a matrix:

X = [x

, x

, ··· , x

]

k×d

(5)

where

is the number of word inputs in a text. The

ConvRR generates feature representations, which can

be expressed as the following:

= f (W, X, s f ) (6)

o = X

∑

i=1

(7)

where

f (·, ·, ·)

is the convolutional residual retrieval

network that executes series of convolutional compo-

nents (a convolution and a rectiﬁed linear unit (ReLU)

(Nair and Hinton, 2010)), a pooling, and a scaling op-

eration.

is produced by multilevel abstract word

embedding X with trainable weights

W ∈ R

×ws×d

The weight matrix

contains

kernels, each of

them has

ws × d

, convolving

contiguous vectors.

and

represent window-size and number of ker-

nels respectively. Average pooling operation is added

after ﬁnal convolutional component, which can consol-

idate some unnecessary features and boost computa-

tional efﬁciency.

s f

is a scaling factor that weights the

output with a constant factor. Hence,

is trained on

how much contribution it adds to the

using residual

connection to improve the retrieval task. Final output

o = [o

, o

, ··· , o

] ∈ R

is generated, which will

be fed into the next component. Note that each of

feature vector

is normalized to unit

norm before

passing to the next step.

Figure 3: Overall ﬂow diagram for the proposed approach.

3.4 Loss Function

In order to train the ConvRR network to perform well

on retrieval task and generalize well on unseen data,

we utilize the Siamese architecture with triplet loss—

by (Hadsell et al., 2006), (Chopra et al., 2005)—during

the training period as shown in Figure 3. With this

setup, the network is encouraged to reduce distances

between positive pairs so that they are lesser than dis-

tances between negative ones. A particular question

anchor

would be a question close in proximity to a

document

positive

as the positive pair to the same ques-

tion than to any document

negative

as they are positive

pairs to other questions. The key point of the

triplet

to build the correct triplet structure which should meet

the condition of the following equation:

||q

anchor

, d

positive

|| + m < ||q

anchor

, d

negative

For each anchor, the positive

positive

is selected in

such a way

argmax

positive

||q

anchor

, d

positive

and like-

wise the hardest negative

negative

in such a way that

argmin

negative

||q

anchor

, d

negative

to form a triplet. This

triplet selection strategy is called hard triplets mining.

Let

T = (d

positive

, q

anchor

, d

negative

)

be a triplet in-

put. Given

, the proposed approach computes the

distances between the positive and negative pairs via a

two-branch siamese subnet through the ensemble text

embedding and ConvRR.

triplet

= [||q

anchor

, d

positive

||−||q

anchor

, d

negative

||+m]

(8)

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

356

where

m > 0

is a scalar value, namely margin and

||., .||

represents the Euclidean distance between two

vectors.

4 EXPERIMENTS

4.1 Datasets

In order to evaluate our proposed approach, we con-

ducted extensive experiments on two large question-

answering datasets, including SQuAD (Rajpurkar

et al., 2016), and QUASAR (Dhingra et al., 2017).

4.1.1 SQuAD

The Stanford Question Answering Dataset (SQuAD)

(Rajpurkar et al., 2016) is a large reading compre-

hension dataset that is built with

100, 000+

questions.

Each of these questions are composed through crowd-

sourcing on a set of Wikipedia documents, where the

answer to each question is a segment of text from

the corresponding reading passage. In other words,

the consolidation of retrieval and extraction tasks are

aimed at measuring the success of the proposed sys-

tems.

4.1.2 QUASAR

The Question Answering by Search And Reading

(QUASAR) is a large-scale dataset consisting of

QUASAR-S and QUASAR-T. Each of these datasets

is built to focus on evaluating systems devised to un-

derstand a natural language query, large corpus of text

and to extract answer to the question from that corpus.

Similar to SQuAD, the consolidation of retrieval and

extraction tasks are aimed at measuring the success

of the proposed systems. Speciﬁcally, QUASAR-S

comprises

37, 012

ﬁll-in-the-gaps questions that are

collected from the popular website Stack Overﬂow, us-

ing entity tags. Since our research is not about address-

ing ﬁll-in-the-gaps questions, we want to pay atten-

tion to the QUASAR-T dataset that fulﬁll the require-

ments of our focused retrieval task. The QUASAR-T

dataset contains

43, 012

open-domain questions col-

lected from various internet sources. The candidate

documents for each question in this dataset are re-

trieved from an Apache Lucene based search engine

built on the ClueWeb09 dataset (Callan et al., 2009).

The number of queries in each dataset, including

their subsets, is listed in Table 1.

Table 1: Datasets Statistics: Number of queries in each train,

validation, and test subsets.

DATASET TRAIN VALID. TEST TOTAL

SQUAD 87,599 10,570 HIDDEN 98,169+

QUASAR-T 37,012 3,000 3,000 43,012

4.2 Evaluation

The retrieval model aims to improve the

recall@k

score by selecting the correct pair among all candi-

dates. Basically,

recall@k

would be deﬁned as the

number of correct documents as listed within top-

order out of all possible documents, (Manning et al.,

2008). Additionally, embedding representations are

visualized, using t-distributed stochastic neighbor em-

bedding (van der Maaten and Hinton, 2008) in order

to project the clustered distributions of the questions

that are assigned to same documents.

4.3 Implementation

4.3.1 Input

Word embeddings were adopted, using the proposed

ensemble text embedding, EmBoost.

mixture

(·, ·, ·)

and

ensemble

(·)

settings that represent the best conﬁgura-

tion are shown in Table 2 and Table 3 respectively.

Table 2: f

mixture

(·, ·, ·) conﬁguration of EmBoost.

E w

id f

m f

mix

OUT

BERT FALSE [

,0,..,0] concat. X

ELMO TRUE [0, 0, 1] sum X

FASTTEXT TRUE [1] sum X

Table 3:

ensemble

(·)

conﬁguration of the multilevel abstract

word embedding.

X’ u f

ensemble

, X

} [

] concat.

The short form of this multilevel abstract word

embedding is called as follows: BERT

⊕

ETwI

⊕

FTwI

where

(.)wI

is denoting ”with IDF” and

⊕

represents

concatenation operation. The dimension of embedding

is 4, 372.

4.3.2 ConvRR Coﬁguration

ConvRR is trained, using ADAM optimizer (Kingma

and Ba, 2014) with a learning rate of

−3

. For the

sake of equal comparison, we ﬁxed the seed of ran-

domization. We also observed that a weight decay of

−3

tackles over-ﬁtting. We choose windows-size

ws = 5

, number of kernel

= 4, 372

, and the scaling

EmBoost: Embedding Boosting to Learn Multilevel Abstract Text Representation for Document Retrieval

357

factor

s f = 0.05

. We trained the network with

400

iterations with a batch size of

2, 000

using a triplet loss

with a margin

m = 1

. Note that the best performance

is achieved using a relative large batch size. All ex-

periments are implemented with Tensorﬂow 1.8+ by

(Abadi et al., 2015) on 2

NVIDIA Tesla K80 GPUs.

5 RESULTS

We study different embedding models. We initialize

text inputs of datasets, using different traditional em-

bedding models. We ﬁrst compared our model with

the following baselines: TF-IDF, BERT, ELMo-AVG

(averaging all layers of ELMo), GloVe, and fastText.

Additionally, we also initialized text inputs, using the

proposed EmBoost approach. To demonstrate the con-

tribution of different components of EmBoost, we ﬁrst

conﬁgured it without using ensemble function, which

includes ELMO-LSTM1 (ﬁrst layer of ELMo), BERT

w/IDF (BERT with IDF weight), ELMO-LSTM2 (sec-

ond layer of ELMo), ELMO-TOKEN (token layer of

ELMO), ELMO-TOKEN w/IDF (ELMo-TOKEN with

IDF weight), FASTTEXT w/IDF (fastText with IDF

weight). Then conﬁgured EmBoost with ensemble

function of a concatenation of ELMo token layer with

IDF weight and fastText with IDF weight (ETwI

⊕

FTwI), as well as a concatenation of BERT (concatena-

tion of last 4 layer representations), ELMo token layer

with IDF weight, and fastText with IDF weight (BERT

⊕

ETwI

⊕

FTwI). Last but not least we also compared

the performance gain using downstream models includ-

ing a fully connected residual network (

FCRR

) and the

convolutional residual network (

ConvRR

) respectively.

The

recall@k

results that calculated for SQuAD

and QUASAR-T datasets are listed in Table 4 and

Table 5. Our ConvRR model initialized with the

proposed EmBoost approach outperforms all the

baseline models on these datasets. More speciﬁcally

the results show that the proposed EmBoost approach

without ensemble function signiﬁcantly improves

the result of all baseline methods. The improvement

is further increased when EmBoost uses ensemble

function that combines embeddings trained using

different models and corpora. The only exception is

recall@1 for the result on QUASAR-T (in Table 5).

We believe the reason is that QUASAR-T is a

relatively small dataset, for which a single word

embedding model trained using a very large corpus

should be sufﬁcient, and ensemble function is not

necessary in this case. The best performance is

achieved when the proposed EmBoost is applied with

a downstream IR model, which is much better in

comparison with the baseline word embedding models

applied with the same downstream IR model.

Table 4: Experimental results on SQUAD.

recall@k

re-

trieved documents, using different models and the proposed

approach.

EMBEDDING/MODEL @1 @3 @5

BASE EMBEDDINGS

TF-IDF 8.77 15.46 19.47

BERT 18.89 32.31 39.52

ELMO-AVG 21.24 36.24 43.88

GLOVE 30.84 47.14 54.01

FASTTEXT 42.23 59.86 67.12

EMBOOST (W/O ENSEMBLE)

ELMO-LSTM1 19.65 34.34 42.52

BERT W/ IDF 21.81 36.35 43.56

ELMO-LSTM2 23.68 39.39 47.23

ELMO-TOKEN 41.62 57.79 64.36

ELMO-TOKEN W/ IDF 44.85 61.55 68.07

FASTTEXT W/ IDF 45.13 62.80 69.85

EMBOOST (W/ ENSEMBLE)

ETWI ⊕ FTWI 46.33 63.13 69.70

BERT ⊕ ETWI ⊕ FTWI 48.49 64.96 71.05

BASE EMBEDDING + DOWNSTREAM MODELS

FASTTEXT + FCRR 45.7 63.15 70.02

FASTTEXT + CONVRR 47.14 64.16 70.87

EMBOOST + DOWNS. MODELS

BERT ⊕ ETWI ⊕ FTWI + FCRR 50.64 66.16 73.44

BERT ⊕ ETWI ⊕ FTWI + CON VRR 52.32 68.26 75.68

The t-SNE visualization of question embeddings

that are derived, using different embedding models,

including BERT, ELMo-TOKEN layer, fastText, mul-

tilevel abstract word embedding with a concatenation

of BERT (concatenation of last 4 layer representa-

tions), ELMo-TOKEN layer with IDF weight, and

fastText with IDF weight, and ConvRR are shown in

Figure 4. Note that those questions match the par-

ticular 4 (labeled as 57, 253, 531, 984) sampled con-

texts/documents that are extracted from SQuAD valida-

tion dataset. The visualization shows that the proposed

EmBoost signiﬁcantly improves the clustering of the

questions and corresponding contexts/documents. The

result is further improved by using ConvRR, the pro-

posed retrieval model.

6 CONCLUSION

We developed a new multilevel abstract word embed-

ding approach called EmBoost, which harnesses the

power of individual strength of diverse word embed-

ding methods. The performance of the proposed ap-

proach is further improved by using a convolutional

residual retrieval model optimized using a triplet loss

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

358

Table 5: Experimental results on QUASAR-T.

recall@k

re-

trieved documents, using different models and the proposed

approach.

EMBEDDING/MODEL @1 @3 @5

BASE EMBEDDINGS

TF-IDF 13.86 20.2 23.13

BERT 25.5 34.2 37.86

ELMO-AVG 27.93 37.86 42.33

GLOVE 32.63 40.73 44.03

FASTTEXT 46.13 56.00 59.46

EMBOOST (W/O ENSEMBLE)

ELMO-LSTM1 24.6 33.01 36.9

ELMO-LSTM2 27.03 36.33 40.56

BERT W/ IDF 27.33 38.43 40.11

ELMO-TOKEN 44.46 54.86 59.36

ELMO-TOKEN W/ IDF 48.86 60.56 65.03

FASTTEXT W/ IDF 49.66 58.70 61.96

EMBOOST (W/ ENSEMBLE)

ETWI ⊕ FTWI 48.78 60.05 64.10

BERT ⊕ ETWI ⊕ FTWI 49.46 60.93 65.66

BASE EMBEDDING + DOWNSTREAM MODELS

FASTTEXT + FCRR 47.11 58.25 62.12

FASTTEXT + CONVRR 48.17 59.06 63.07

EMBOOST + DOWNS. MODELS

BERT ⊕ ETWI ⊕ FTWI + FCRR 49.55 61.58 64.53

BERT ⊕ ETWI ⊕ FTWI + CONVRR 50.67 63.09 67.38

Figure 4: t-SNE map visualizations of various embedding

models for all question representations of 4 (57, 253, 531,

984) sampled contexts/documents that are extracted from

SQuAD validation dataset.

function for the task of document retrieval, which is

a crucial step for many Natural Language Processing

and information retrieval tasks. We further evaluate

the proposed method for document retrieval from an

unstructured knowledge base. The empirical study

using large datasets including SQuAD and QUASAR

benchmark datasets shows a signiﬁcant performance

gain in terms of the recall. In the future, we plan to

apply the proposed framework for other information

retrieval and ranking tasks. We also want to improve

the performance of the retrieval task by applying and

developing new loss functions and retrieval models.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G.,

Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur,

M., Levenberg, J., Man

e, D., Monga, R., Moore, S.,

Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner,

B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke,

V., Vasudevan, V., Vi

egas, F., Vinyals, O., Warden,

P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X.

(2015). TensorFlow: Large-scale machine learning

on heterogeneous systems. Software available from

tensorﬂow.org.

Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neural

machine translation by jointly learning to align and

translate.

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003).

A neural probabilistic language model. Journal of

Machine Learning Research, 3:1137–1155.

Boom, C. D., Canneyt, S. V., Bohez, S., Demeester, T.,

and Dhoedt, B. (2015). Learning semantic similarity

for very short texts. In 2015 IEEE International Con-

ference on Data Mining Workshop (ICDMW), pages

1229–1234.

Cakaloglu, T., Szegedy, C., and Xu, X. (2018). Text embed-

dings for retrieval from a large knowledge base. arXiv

preprint arXiv:1810.10176.

Cakaloglu, T. and Xu, X. (2019). MRNN: A multi-resolution

neural network with duplex attention for document

retrieval in the context of question answering. CoRR,

abs/1911.00964.

Callan, J., Hoy, M., Yoo, C., and Zhao, L. (2009). Clueweb09

data set.

Cao, Q., Ying, Y., and Li, P. (2013). Similarity metric learn-

ing for face recognition. In 2013 IEEE International

Conference on Computer Vision, pages 2408–2415.

Chechik, G., Sharma, V., Shalit, U., and Bengio, S. (2010).

Large scale online learning of image similarity through

ranking. J. Mach. Learn. Res., 11:1109–1135.

Chen, D., Fisch, A., Weston, J., and Bordes, A. (2017).

Reading wikipedia to answer open-domain questions.

arXiv preprint arXiv:1704.00051.

Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learning a

similarity metric discriminatively, with application to

face veriﬁcation. volume 1, pages 539–546 vol. 1.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019).

Bert: Pre-training of deep bidirectional transformers

for language understanding.

EmBoost: Embedding Boosting to Learn Multilevel Abstract Text Representation for Document Retrieval

359

Dhingra, B., Mazaitis, K., and Cohen, W. W. (2017). Quasar:

Datasets for question answering by search and reading.

arXiv preprint arXiv:1707.03904.

Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensional-

ity reduction by learning an invariant mapping. CVPR

’06, pages 1735–1742.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Comput., 9(8):1735–1780.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.

(2019). Roberta: A robustly optimized bert pretraining

approach.

Lowe, D. G. (1995). Similarity metric learning for a variable-

kernel classiﬁer. Neural Computation, 7(1):72–85.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and

Joulin, A. (2018). Advances in pre-training distributed

word representations. In Proceedings of the Interna-

tional Conference on Language Resources and Evalua-

tion (LREC 2018).

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances

in Neural Information Processing Systems 26, pages

3111–3119.

Nair, V. and Hinton, G. E. (2010). Rectiﬁed linear units im-

prove restricted boltzmann machines. In Proceedings

of the 27th International Conference on International

Conference on Machine Learning, ICML’10, pages

807–814, USA.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Empirical Methods in Natural Language Processing

(EMNLP), pages 1532–1543.

Perone, C. S., Silveira, R., and Paula, T. S. (2018).

Evaluation of sentence embeddings in downstream

and linguistic probing tasks. arXiv preprint

arXiv:1806.06259.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark,

C., Lee, K., and Zettlemoyer, L. (2018). Deep con-

textualized word representations. In Proceedings of

the 2018 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, Volume 1 (Long Papers),

pages 2227–2237.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016).

Squad: 100, 000+ questions for machine comprehen-

sion of text. arXiv preprint arXiv:1606.05250.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks.

Salton, G. and McGill, M. J. (1986). Introduction to modern

information retrieval.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2020).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

Facenet: A uniﬁed embedding for face recognition

and clustering. volume 00, pages 815–823.

Sohn, K. (2016). Improved deep metric learning with multi-

class n-pair loss objective. NIPS’16, pages 1857–1865.

Sun, Y., Chen, Y., Wang, X., and Tang, X. (2014). Deep

learning face representation by joint identiﬁcation-

veriﬁcation. pages 1988–1996.

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).

Deepface: Closing the gap to human-level performance

in face veriﬁcation. CVPR ’14, pages 1701–1708.

van der Maaten, L. and Hinton, G. E. (2008). Visualizing

data using t-sne.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).

Attention is all you need.

Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J.,

Philbin, J., Chen, B., and Wu, Y. (2014). Learning

ﬁne-grained image similarity with deep ranking. pages

1386–1393.

Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. (2017).

Deep metric learning with angular loss.

Weinberger, K. Q. and Saul, L. K. (2009). Distance metric

learning for large margin nearest neighbor classiﬁca-

tion. 10:207–244.

Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. (2002).

Distance metric learning, with application to clustering

with side-information. NIPS’02, pages 521–528.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.,

and Le, Q. V. (2020). Xlnet: Generalized autoregres-

sive pretraining for language understanding.

Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K.,

Norouzi, M., and Le, Q. V. (2018). Qanet: Combining

local convolution with global self-attention for reading

comprehension. arXiv preprint arXiv:1804.09541.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

360