Embeddings Might Be all You Need: Domain-Speciﬁc Sentence Encoders

for Latin American E-Commerce Questions

Rodrigo Caus

1,2 a

, Victor Sotelo

1,2 b

, Victor Hochgreb

2 c

and Julio Cesar dos Reis

1 d

Institute of Computing, University of Campinas, Campinas, São Paulo, Brazil

GoBots Company, Campinas, São Paulo, Brazil

Keywords:

Sentence Embeddings, Question Retrieval, E-Commerce.

Abstract:

In Latin American e-commerce, customer inquiries often exhibit unique linguistic patterns that require special-

ized handling for accurate responses. Traditional sentence encoders may struggle with these regional nuances,

leading to less effective answers. This study investigates the application of ﬁne-tuned transformer models

to generate domain-speciﬁc sentence embeddings, focusing on Portuguese and Spanish retrieval tasks. Our

ﬁndings demonstrate that these specialized embeddings signiﬁcantly outperform general-purpose pre-trained

models and traditional techniques, such as BM-25, thereby eliminating the need for additional re-ranking

steps in retrieval processes. Our results investigate the impact of multi-objective training within Matryoshka

Representation Learning, demonstrating its effectiveness in maintaining retrieval performance across various

embedding dimensions. Our approach offers a scalable and efﬁcient solution for multilingual retrieval in e-

commerce, reducing computational costs while ensuring high accuracy.

1 INTRODUCTION

In the rapidly growing e-commerce landscape, ef-

fective customer service through accurate question-

answering systems is crucial to user satisfaction

and conversions. Sentence encoders (Reimers and

Gurevych, 2019) play a central role in these systems,

capturing semantic meaning, context, and relation-

ships in numerical embeddings. Such embeddings

can be used to select the most appropriate answer to

the customer inquiry.

General-purpose sentence encoders often prove

less effective in specialized domains due to their difﬁ-

culty in capturing unique vocabulary, phrasing, and

contextual nuances (Tang and Yang, 2025). This

entails that generic models frequently require high-

dimensional embeddings and separate re-ranking

models to achieve acceptable domain-speciﬁc effec-

tiveness, especially when resource minimization is a

key objective.

https://orcid.org/0000-0002-0904-4865

https://orcid.org/0000-0001-9245-8753

https://orcid.org/0000-0002-0529-7312

https://orcid.org/0000-0002-9545-2098

GoBots

company addresses a high volume of

customer inquiries from e-commerce platforms in

Spanish and Portuguese. We have implemented

an end-to-end question-answering solution based on

embeddings to manage customer queries. Existing

pretrained solutions assist in retrieving suitable text

(questions) to provide answers. This context requires

performing a re-ranking process to ensure the quality

of the retrieved text (Chico et al., 2023).

However, this multi-component approach inher-

ently increases complexity and can compromise the

overall quality and efﬁciency of the retrieval pipeline.

Employing a distinct retriever and a subsequent re-

ranker directly escalates computational resource de-

mands, which is prohibitive for small business sce-

narios. Such an architecture typically requires signif-

icantly more memory and CPU processing per query,

leading to higher operational costs and potentially im-

pacting end-user response latency. In contrast, ﬁne-

tuning domain-speciﬁc sentence encoders may offer

a more direct path to optimize cost, processing, and

storage.

This study investigates resource optimization

strategies for e-commerce question paraphrase re-

Leading company of artiﬁcial intelligence (AI) solu-

tions for the e-commerce sector in Latin America. Ofﬁcial

website: https://gobots.ai.

Caus, R., Sotelo, V., Hochgreb, V. and Reis, J. C.

Embeddings Might Be all You Need: Domain-Speciﬁc Sentence Encoders for Latin American E-Commerce Questions.

DOI: 10.5220/0013821200004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 135-146

135

trieval pipelines that integrate vector-based retrieval

(potentially utilizing dense or sparse vectors) with a

subsequent re-ranking phase. This research aims to

attain two speciﬁc goals:

1. Assess the feasibility and effectiveness of utiliz-

ing a single, uniﬁed embedding model to gener-

ate representations for retrieval pipelines, compar-

ing their performance against conventional two-

model architectures (i.e., separate models for re-

trieval and re-ranking).

2. Analyze the trade-off between the reduction in

embedding dimensionality from such a uniﬁed

model and the consequent impact on retrieval ef-

fectiveness and computational efﬁciency.

Our ﬁndings demonstrate that a single, domain-

ﬁne-tuned embedding model, trained efﬁciently on

a single, commonly available GPU, outperforms the

multi-model encoder-re-ranker pipeline and BM-25

retrieval in a real-world e-commerce setting. This

study, conducted in collaboration with a company,

highlights the practical beneﬁts of this streamlined ap-

proach.

As key contributions, we release our test and cali-

bration datasets. Notably, these datasets are in Por-

tuguese and Spanish, which are often underrepre-

sented in natural language processing research, offer-

ing valuable resources for extending existing embed-

ding model benchmarks, such as MTEB (Enevold-

sen et al., 2025). Furthermore, we are open-sourcing

our training and validation code, enabling other re-

searchers and practitioners to adapt and apply these

methods to their domains

The remainder of this article is organized as fol-

lows: Section 2 presents a synthesis and analysis of

key related studies. Section 3 summarizes the E-FAQ,

a dataset generated in our research. Section 4 de-

scribes the training details and approaches used in this

research. Section 5 outlines our experimental evalua-

tion, which includes the dataset, baselines, and eval-

uation metrics. Section 6 reports on our results ob-

tained. Section 7 discusses our ﬁndings. Finally, Sec-

tion 8 summarizes the conclusions and suggests direc-

tions for future research.

2 RELATED WORK

Related work, in the context of our research, concerns

training domain-speciﬁc and language-speciﬁc em-

bedding models, particularly for information retrieval

tasks.

Available at https://github.com/rodrigocaus/

embedding-training.

On domain-speciﬁc embedding models, Z. Feng

et al. (Feng et al., 2020) introduced CodeBERT,

a transformer-based model trained on open-source

GitHub repositories, which currently supports only

six programming languages. It follows multilingual

BERT approaches, using masked language techniques

during ﬁne-tuning. The models focus on bimodal

data, aligning text (code documentation) with their

respective code during pre-training. After this ini-

tial training, they utilize the base model to ﬁne-tune

the process, thereby improving the alignment between

text and code representations. They test the effec-

tiveness of code retrieval based on natural language

queries, and CodeBERT outperformed results from

other pre-trained models, such as RoBERTa, achiev-

ing a higher Mean Reciprocal Rank in the Code-

SearchNet benchmark.

Clinical BERT (Alsentzer et al., 2019) models

were developed to meet the need for domain-speciﬁc

embeddings in clinical contexts. The authors initial-

ized Clinical BERT using two primary models: Base

BERT and BioBERT. They followed the same train-

ing procedures used for BERT, utilizing a corpus of

clinical texts. Their ﬁndings showed that specialized

domain models performed better in domain classiﬁ-

cation tasks for clinical benchmarks. However, a lim-

itation of these models is their limited generalization

to datasets that differ from the training data.

Regarding language-speciﬁc embedding models,

Huang et al. (Huang et al., 2024) introduced Pic-

colo 2, a state-of-the-art model on Chinese embed-

ding benchmarks. It leverages an efﬁcient multi-task

hybrid loss training approach, effectively leveraging

textual data and labels for various downstream tasks,

combined with Matrioshka Representation Learning

(MRL) to support more ﬂexible vector dimensions. It

was evaluated over six tasks on CMTEB benchmark,

including text retrieval, pair classiﬁcation, and seman-

tic similarity.

Industrial Applications models (Bedná

r et al.,

2024) focused on creating an embedding with a lower

size to improve computational efﬁciency. They ap-

plied the study to Seznam, a Czech search engine,

and explored techniques suitable for non-English lan-

guages, utilizing datasets from non-public sources.

Their study examined three methods: auto-encoder

training, unsupervised contrastive ﬁne-tuning, and

multilingual distillation, which do not require large

datasets, making them practical for real-world use.

The models were evaluated on semantic textual simi-

larity (STS) and COSTRA, a benchmark for assessing

embedding quality, as well as measuring search en-

gine ranking effectiveness using precision at 10. Their

ﬁndings showed that pretrained versions and multi-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

136

lingual distillation provide the best encoder models,

highlighting their effectiveness in enhancing search

result quality.

DeepFAQ (Chico et al., 2023) refers to a Por-

tuguese automatic question-answering system that

uses semantic search to ﬁnd similar questions from

a database of FAQs. Its solution applies a general

domain embedding to represent the data (questions

and answers). It retrieves candidate questions and ap-

plies a domain-speciﬁc re-ranking model to identify

the most relevant one, ultimately providing the corre-

sponding answer.

Our current approach makes a novel and original

contribution by utilizing domain-speciﬁc embeddings

for the e-commerce sector, speciﬁcally tailored for

Brazilian Portuguese and Spanish, two low-resource

languages in NLP. We take advantage of the approach

of language-speciﬁc embedding presented by Huang

et al. (Huang et al., 2024) to ﬁne-tune sentence en-

coding models. These embeddings effectively cap-

ture the nuances of informal language used on online

platforms, thereby enhancing results in e-commerce-

related NLP tasks and addressing gaps identiﬁed in

previous methods, particularly the encoder-re-ranker

pipeline as presented by Huang et al. (Chico et al.,

2023).

3 E-FAQ: GROUPED

FREQUENTLY ASKED

QUESTIONS FROM

E-COMMERCE

Real-world data are fundamental for generating

domain-speciﬁc sentence embeddings. This section

presents the E-FAQ, a weakly-supervised dataset of

e-commerce frequently asked questions (FAQs), with

sentences uttered in Brazilian Portuguese or Spanish.

Each entry i of the dataset is the tuple (q

in which:

• q

is an anchor question sentence.

• S

is a set of sentences that are similar to q

; the

sentences convey the same meaning and are inter-

changeable with q

• A

is a set of sentences that are almost similar to

; the sentences are closely related to q

, but dif-

fer in meaningful detail.

• D

is a set of sentences that are dissimilar to q

;

the sentences discuss different topics or contain

unrelated information with q

Any of S

, A

or D

sets can be empty for a given

i. However, at least one of the sets is not empty. Fig-

Figure 1: Examples of entries in E-FAQ. The ﬁgure illus-

trates our classiﬁcation scheme, where candidate questions

are labeled as ‘Similar’, ‘Almost Similar’, or ‘Dissimilar’

with respect to an “Anchor” question. Note the subtle dis-

tinction between ‘Similar’ (same intent, e.g., “backpack”

vs. “bag”) and ‘Almost Similar’ (related topic, different in-

tent, e.g., backpack capacity vs. ﬁtting a laptop), which

allows for a more nuanced understanding of semantic rele-

vance. All sentences are uttered in Brazilian Portuguese or

Spanish. English translations are presented below the orig-

inal sentences.

ure 1 illustrates data examples from E-FAQ. For in-

stance, relative to the anchor question about a back-

pack’s volume, a query about a “bag” is considered

‘Similar’, while a related but distinct question about

ﬁtting a “laptop” is labeled ‘Almost Similar’. This

highlights the semantic nuances our methodology is

designed to capture, distinguishing between identical

intent, related topics, and entirely dissimilar queries.

We originally created this dataset to address a

resource-scarce gap for Portuguese and Spanish, par-

ticularly within the e-commerce domain. We gathered

questions from Latin American e-commerce websites

sourced from the GoBots database (cf. Figure 2). Ini-

tially, we collected a larger set of questions; after

removing duplicates and questions containing fewer

than four words, we were left with one million ques-

tions, evenly split between Brazilian Portuguese and

Spanish.

Our primary goal with E-FAQ was to construct

a dataset composed of thematically disjoint question

groups. To achieve this, we followed a structured

three-step pipeline: (1) a Natural Language Under-

standing (NLU) analysis for feature extraction; (2) a

clustering phase to group related questions and ﬁlter

noise; and (3) an intra-cluster classiﬁcation stage to

assign ﬁne-grained similarity labels. Each of these

steps is detailed in the subsequent sections.

3.1 Natural Language Understanding

In the ﬁrst stage of our pipeline, we employed nat-

ural language understanding (NLU) models to ex-

tract intents and named entities from each question.

For this, we leveraged a proprietary machine learn-

ing model previously trained on a large corpus of

sentences within the GoBots data environment. The

model performed two key tasks. First, it identiﬁed the

Embeddings Might Be all You Need: Domain-Speciﬁc Sentence Encoders for Latin American E-Commerce Questions

137

Figure 2: Overview of data collection process in generating the E-FAQ dataset.

user’s intent — the overall purpose of the question —

which allowed us to classify each query into one of

64 distinct thematic categories. Second, it recognized

named entities — a term or expression with a known

meaning relevant to the sentence’s comprehension —

which were used to normalize the text by mapping

synonymous terms to a canonical form and correcting

potential typos.

3.2 Clustering

Within each intent category, we employed the HDB-

SCAN clustering algorithm to group similar ques-

tions. We used the extracted entities as the main fea-

tures, as they correspond to a normalized sequence

of relevant terms of the sentences. We used the 95%

most frequent entities to create a TF-IDF sparse rep-

resentation, and then applied a singular value decom-

position (SVD) to reduce the features dimension.

We speciﬁcally chose HDBSCAN for its ability to

group semantically similar questions into dense clus-

ters while simultaneously identifying and ﬁltering out

noisy data—i.e., questions that do not belong to any

coherent group, ensuring that the resulting clusters

are thematically distinct from one another. We con-

sider clusters with at least two sentences. This pro-

cess yielded more than 142, 000 clusters, encompass-

ing over 445,000 examples, with the cluster medoid

serving as the anchor sentence.

3.3 Classiﬁcation

In the ﬁnal step, we analyze the contents of each clus-

ter to label every question relative to its cluster’s an-

chor sentence, to ensure high-quality semantic simi-

larity data. This in-cluster classiﬁcation step was con-

ducted with a synthetic labeling process, using large

language models as annotators.

To secure the high value of the annotation pro-

cess, we curated a calibration dataset, which began

with an initial pool of 150 real-world question pairs

sourced from e-commerce platforms. Each pair was

independently classiﬁed as “similar”, “almost simi-

lar”, or “dissimilar” by three human annotators: two

computer science graduate students with expertise in

AI for e-commerce (both co-authors) and one under-

graduate student without prior experience in the do-

main. Each annotator was presented to the instruc-

tion:

You’ll see pairs of product questions extracted

from e-commerce platforms. Your task is to

label these pairs according to their semantic

similarity. The label will be one of:

• similar: The sentences convey the same

meaning or idea, even if phrased differently.

For e-commerce, these questions could be

answered with the same answer.

• almost similar: The sentences share a

signiﬁcant amount of information and are

strongly related, but there are subtle differ-

ences in meaning or scope. They are simi-

lar, but cannot be answered with the same

answer.

• dissimilar: The sentences contain distinct

information or completely different mean-

ings, and are not correlated.

We established the ﬁnal label for a pair based on

majority vote. Of the initial 150 pairs, 144 reached a

majority consensus (96%), meaning at least two an-

notators agreed on a label. The remaining 6 pairs, for

which each annotator assigned a different label, were

therefore discarded. The resulting calibration dataset,

which we named GoSim3, is available on the Hugging

Face Hub

We then leveraged this calibration dataset to op-

timize a classiﬁcation LLM prompt. Speciﬁcally, we

used the Gemma 3 language model (Kamath et al.,

2025) to identify the prompt that yielded the high-

est accuracy against the GoSim3 ground-truth labels.

The prompt contained the same instructions given

to the human annotators, as other formatting and

reasoning instructions. This optimized prompt and

model were subsequently used to classify each ques-

tion pair within our 142, 000 clusters, ensuring a reli-

able, large-scale assessment of semantic similarity.

3.4 Dataset Split

The dataset was further divided into training, valida-

tion, and test sets. The training set comprised most of

Available at https://huggingface.co/datasets/

GoBotsAI/GoSim-3.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

138

the data, with 121,248 entries, followed by the val-

idation set, with 13,472 entries. The test sets were

organized by language (Portuguese and Spanish) and

stratiﬁed by intent class, resulting in two sets with

4,000 entries each. For commercial reasons, the train-

ing dataset used in this study cannot be publicly re-

leased. Nevertheless, our test dataset is available at

HuggingFace’s Hub

4 TRAINING METHODS

The primary application of our proposed models is

retrieving similar questions given an input query. Re-

cent research has increasingly focused on bi-encoder

architectures for generating sentence embeddings.

These models independently encode the query and

the questions, allowing for efﬁcient similarity scoring

(Izacard et al., 2022). Formally, given two sentences x

and y, their embeddings are generated independently

by the f

and f

models, respectively. The embedding

space similarity of the two sentences φ can be deﬁned

as:

φ(x,y) = cos( f

(x), f

(y))/τ (1)

In which τ is a temperature parameter. Two trans-

former models can be used to embed sentences in f

and f

, as in DPR (Karpukhin et al., 2020), which

employs two BERT encoders to map questions and

passages into a shared semantic space. Recent studies

used a single transformer model f

in a siamese bi-

encoder architecture to embed the sentences. Figure 3

presents this architecture. Models that use this archi-

tecture, like SBERT (Reimers and Gurevych, 2019),

LaBSE (Feng et al., 2022), and E5 (Wang et al.,

2024a; Wang et al., 2024b), proved to be effective

in many zero-shot natural language tasks. As ques-

tions and queries share the same domain, we employ

the Siamese architecture. For the pooling strategy, we

use the mean of the token representations.

We assume that E-FAQ contains disjoint groups

of similar sentences, so each dataset entry contains

a unique group of questions. Leveraging the “sim-

ilar”, “almost similar”, and “dissimilar” labels. We

designed a training regimen incorporating two dis-

tinct objectives: a retrieval objective and a semantic

similarity objective. This multi-task learning strat-

egy allowed the model to simultaneously learn effec-

tive representations for retrieving relevant questions

and accurately assessing the degree of semantic re-

latedness between question pairs within our reﬁned

Available at https://huggingface.co/datasets/

GoBotsAI/e-faq.

Figure 3: Siamese Dual Encoder model for sentence em-

beddings generation.

dataset. This method follows Huang et al. (Huang

et al., 2024) approach.

For the retrieval objective, we used the InfoNCE

loss (van den Oord et al., 2019), in which an an-

chor question q

, associated with a similar question

, is compared against N − 1 dissimilar questions in a

cross-entropy function. The loss is deﬁned by:

= −

∑

i=1

log

φ(q

)

φ(q

)

∑

j=1, j̸=i

φ(q

)

(2)

This loss encourages similar question pairs to have

higher similarity scores, and dissimilar questions to

have lower scores (Izacard et al., 2022).

We deﬁne s

i j

∈ S

as a question extracted from the

set of questions similar to q

. The training data con-

sisted of entries in the form (q

i j

), with 0 ≤ j ≤ |S

augmented from each cluster from E-FAQ. Addition-

ally, we incorporated challenging negative examples

by selecting K “hard-negatives” through a combina-

tion process from the union of A

and D

. These K

hard negatives were then combined with the in-batch

negative samples, such that the total number of neg-

ative examples considered for each positive sample

was N − 1, where N is the batch size. For entries

that yielded fewer than K hard negatives, this set was

supplemented by sampling from the hard negatives of

other topically disjoint entries from the entire dataset.

The ﬁnal contrastive loss is a combination of both

the original loss function L

, considering the cross-

entropy on anchor sentences, and its symmetric ver-

sion L

′

, considering the cross-entropy on similar

sentences:

= L

+ L

′

(3)

For the semantic similarity objective, we con-

verted the “similar”, “almost similar”, and “dissimi-

Embeddings Might Be all You Need: Domain-Speciﬁc Sentence Encoders for Latin American E-Commerce Questions

139

lar” labels into score values. The training data con-

sisted in triples in the form (q

, p

i j

), in which q

the anchor question, p

i j

is a sentence in q

’s cluster,

and z

i j

is their labeled similarity score, with values:

i j







1, if p

i j

∈ S

0, if p

i j

∈ A

−1, if p

i j

∈ D

(4)

We used the Cosine Sentence Loss (CoSENT)

(Su, 2022) in this task, a ranking loss function speciﬁ-

cally designed for the score-labeled text pairs (Huang

et al., 2024). The loss is deﬁned by:

= log

1 +

∑

i j

φ(q

)−φ(q

i j

)

(5)

Equation 6) deﬁnes the ﬁnal multi-task loss:

L =



, if task is retrieval

, if task is semantic similarity

(6)

To achieve our objective of reducing embedding

dimensionality, we employed Matryoshka Represen-

tation Learning (MRL) (Kusupati et al., 2024) during

model training. This technique compels the model

to produce hierarchical, coarse-to-ﬁne embeddings,

ensuring that these lower-dimensional representations

are at least as accurate as independently trained low-

dimensional representations.

5 EXPERIMENTS

For our experiments, we ﬁne-tuned two multilingual

transformer models. The ﬁrst, XLM-RoBERTa (Con-

neau et al., 2020), serves as a strong baseline due to its

extensive pre-training on multilingual text. The sec-

ond, Multilingual E5-Base (Wang et al., 2024b), was

selected for its state-of-the-art performance in dense

retrieval tasks, as evidenced by its high ranking on

the MTEB leaderboard

. Both models produce em-

beddings with a native dimensionality of 768.

All models were trained on a single NVIDIA RTX

4090 GPU using the AdamW optimizer. We set

the learning rate to 2 × 10

−5

with a linear warmup

for the ﬁrst 10% of training steps, followed by a

stable learning rate. We trained for a maximum

of 5,000 steps using a batch size of 256 sentence

pairs. The temperature parameter τ for the contrastive

loss was ﬁxed at 0.05 to facilitate the discrimina-

tion of negative samples. For MRL, we trained the

Available online at https://huggingface.co/spaces/

mteb/leaderboard.

models to produce nested embeddings at dimensions

of {64, 128, 256, 384, 512, 768}. We evaluated the

model on a held-out validation set every 200 steps and

saved the checkpoint with the highest retrieval accu-

racy.

For our assessments, we evaluated our trained

models primarily on a symmetric retrieval task,

speciﬁcally sentence paraphrase mining, using the

test partition of our domain-speciﬁc E-FAQ dataset

(cf. Section 3). This dataset, comprising 8, 000 e-

commerce queries in Portuguese and Spanish, enables

us to directly measure the model’s effectiveness in

identifying semantically equivalent questions, a core

function for customer service applications. This pri-

mary task serves as the main benchmark for retrieval

performance.

To ensure the embeddings offer in-domain gener-

ality and clear similarity separability, we conducted

a secondary evaluation on a Semantic Textual Simi-

larity (STS) task. For this, we utilized the GoSim3

dataset, a domain-speciﬁc benchmark that was inten-

tionally excluded from our model’s training distribu-

tion. This test validates the correlation between hu-

man annotations and results obtained by computing

the similarity between the vector representations of

both questions. It assesses that the model can robustly

generalize to new, unseen data within the e-commerce

domain and accurately distinguish between varying

degrees of semantic relatedness.

5.1 Evaluation Metrics

Accuracy@1 is a metric used in IR to evaluate a sys-

tem’s ability to retrieve a relevant item at the top of

the ranking. It measures the proportion of queries for

which the most pertinent item appears in the ﬁrst po-

sition. The score ranges from 0 to 1, where 1 indicates

perfect retrieval (i.e., the relevant item is consistently

ranked ﬁrst), and 0 means the system never places the

appropriate item at the top. This metric is handy when

only the top result matters, such as in FAQ matching,

question answering, or single-result search scenarios.

While Accuracy@1 is a crucial metric for our pri-

mary use case, it only evaluates the top-ranked result.

To gain a more comprehensive understanding of re-

trieval quality, we employed Mean Average Precision

at 10 (mAP@10). This metric evaluates the quality

of the entire ranked list up to the 10th position, tak-

ing into account both the precision and the ranking of

relevant items. mAP@10 provides a more nuanced

evaluation by rewarding models that place multiple

correct items near the top of the list, which is valuable

in scenarios where multiple results are used, such as

in retrieval-augmented generation pipelines.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

140

For the Semantic Textual Similarity (STS) evalua-

tion on the GoSim3 dataset, we used Pearson’s corre-

lation coefﬁcient (r). This metric measures the linear

correlation between our model’s predicted similarity

scores (i.e., the cosine similarity of the sentence em-

beddings) and the ground-truth human judgments. A

higher correlation, approaching 1, indicates that the

semantic relationships captured by our embeddings

strongly align with human perception of similarity,

thereby validating the model’s ability to discern subtle

semantic nuances.

5.2 Baselines

To evaluate the effectiveness of our domain-speciﬁc

embeddings, we selected pretrained models from the

existing literature that have demonstrated superior re-

sults in retrieval tasks and sentence representation

as baselines. This includes various pretrained mod-

els trained using different techniques, encompassing

open-source encoders. Additionally, we incorporated

a traditional BM-25 model for comparison against the

pretrained models. In the following, we summarize

these models.

Embeddings from Bidirectional Encoder Repre-

sentations (E5-models): E5 is a family of advanced

text embeddings trained using weakly supervised

contrastive Pre-training and a large dataset of text

pairs. Our study used the Multilingual E5-base, which

is initialized from XLM-RoBERTa weights. The

model employs an encoder architecture with average

pooling to generate ﬁxed-size embeddings, utilizing

cosine similarity for comparison.

BGE M3 is an encoder model designed for multi-

lingual processing and multifunctional tasks. It sup-

ports over 100 languages, aiming to streamline text

embedding and retrieval for greater efﬁciency. The

model employs self-knowledge distillation, efﬁcient

batching, and high-quality data generation to enhance

embedding quality. It leverages unsupervised, super-

vised, and synthesized data through a structured pre-

training and ﬁne-tuning approach focused on retrieval

tasks.

GTE (Zhang et al., 2024): It refers to a state-of-

the-art multilingual encoder speciﬁcally designed for

retrieval tasks. It was trained using large-scale con-

trastive learning on a combination of unsupervised,

supervised, and synthesized data. This encoder pro-

duces dense text embeddings for over 70 languages,

ensuring high-quality representations even in long-

context scenarios, which is advantageous for indus-

trial applications. Our decision to utilize GTE is

based on concepts proposed by an e-commerce com-

pany (Alibaba), and it outperforms other models with

a similar number of parameters.

Best Matching 25 (BM-25): It is a probabilistic

model for IR. It builds on the term frequency (TF) and

inverse document frequency (IDF) concepts, such as

TF-IDF, but reﬁnes term weighting with a non-linear

function. This allows BM-25 to rank documents more

effectively by considering term frequency and distri-

bution across the corpus, making it better suited for

longer documents than TF-IDF.

5.3 Re-Ranking

In addition to the baseline evaluations, we designed

an experimental setup where each baseline model is

ﬁrst used to perform semantic search and retrieve the

top k candidates most similar to the query. These k

candidates, along with the query, are then passed to a

re-ranking stage, where a separate model, trained to

score semantic similarity, re-evaluates and ranks the

candidates to identify the most relevant one. For all

experiments, we set k = 20. This setup aims to assess

the impact of re-ranking within an IR pipeline and de-

termine whether strong encoders alone can eliminate

the need for re-ranking.

6 RESULTS

We present the overall results (Subsection 6.1) and

then report on our analysis using the Re-ranking ap-

proach (Subsection 6.2). Subsection 6.3 presents our

dimension effects analysis. Subsection 6.5 presents

the ablation study results.

6.1 Overall Findings

Table 1 presents the effectiveness of various models

on retrieval datasets evaluated using Accuracy at one.

The results include both original and ﬁne-tuned mul-

tilingual models, assessed on two datasets: E-FAQ (in

Portuguese and Spanish) and GoSim3. For the ﬁne-

tuned models, we conducted multiple conﬁgurations

and report those that showed the highest effectiveness

on the E-FAQ retrieval task.

Among the domain ﬁne-tuned models, the Mul-

tilingual E5 base achieved the highest Accuracy@1

score on the E-FAQ dataset, scoring 90.48% in Por-

tuguese and 90.12% in Spanish. This model per-

formed well on the GoSim3 dataset, achieving a Pear-

son Correlation of 43.45%. The ﬁne-tuned XLM

model achieved competitive results, with scores of

88.60% in Portuguese and 87.58% in Spanish, yield-

ing the highest Pearson correlation of 48.45% among

all models. Similarly, MAP@10 results achieve a

Embeddings Might Be all You Need: Domain-Speciﬁc Sentence Encoders for Latin American E-Commerce Questions

141

Table 1: Best conﬁguration of ﬁnetuned and baseline models’ results on retrieval (E-FAQ) and STS (GoSim3) datasets. The

E-FAQ scores denote acuraccy@1 (%), and Mean Average Precision MAP@10, and the E-FAQ column corresponds to the

test partitions in each considered language. Meanwhile, GoSim3 columns presented the Pearson correlations for Portuguese

only.

Group Model

Embedding

Dimension

Parameters

(Millions)

E-FAQ GoSim3

pt es pt

ACC@1 MAP@10 ACC@1 MAP@10 Pearson

Baseline

Multilingual E5 Base

768

279.0 68.98 71.36 70.14 71.36 35.45

GTE Multilingual 305.0 71.56 74.68 73.90 75.78 35.93

BGE M3 1024 567.8 73.97 77.27 69.92 73.15 41.05

BM-25 - - 76.14 80.27 70.86 73.23 -

Finetuned

XLM RoBERTa Base

768

278.0 88.60 90.22 87.58 90.99 48.45

Multilingual E5 Base 279.0 90.48 92.30 90.12 92.51 43.45

Multilingual E5 base across all tested conﬁgurations,

yielding 90.48% and 92.51% for Portuguese and

Spanish, respectively.

BGE M3 achieved the highest scores over pre-

trained models in the E-FAQ evaluation for Por-

tuguese, obtaining 73.97% in Portuguese and 69.92%

in Spanish. It also performed best on the STS dataset,

obtaining 41.05%. In contrast, the multilingual E5

base model and GTE revealed lower retrieval and STS

effectiveness on E-FAQ for Portuguese and GoSim3,

with accuracy scores of 68.98% and 70.14%, and

Pearson coefﬁcients of 35.45% and 35.93%, respec-

tively. Yet, the GTE model surpassed the pre-

trained model over E-FAQ in the Spanish partition,

achieving 73.90%, followed by the multilingual E5-

based model, which registered 70.14%. Related to

MAP@10, BGE M3 achieved the highest MAP@10

for Portuguese with 77.27% and 76.15% in Spanish.

GTE Multilingual followed with results slightly be-

low, while the original Multilingual E5 Base reached

71.36% in both languages.

The BM-25 baseline outperformed all original

pre-trained models on E-FAQ, achieving scores of

76.16% in Portuguese and 70.86% in Spanish. The

BM25 baseline yielded MAP@10 scores of 80.27%

for Portuguese and 73.23% for Spanish, outperform-

ing some of the pretrained models in Portuguese, but

still well below the ﬁne-tuned conﬁgurations.

6.2 Reranker Analysis

Figure 4 presents the retrieval effectiveness measured

by the Accuracy@1 result for Portuguese across vari-

ous retrieval models, comparing their results with and

without the reranker. For the baseline models (mE5,

bge-m3, and gte), applying the reranker generally re-

sults in slight improvements or maintains similar ac-

curacy levels. Nevertheless, we observed a minor de-

crease in performance for BM25 when reranking is

applied. The ﬁne-tuned models (F-mE5 and F-XLM)

achieved the highest overall accuracy, with both mod-

els performing better without reranking—F-mE5 ex-

ceeds 90%, while F-XLM reaches nearly 89% Accu-

racy@1 in the no-reranker setting.

Figure 4: Accuracy at one comparison for Portuguese with-

out reranker application for BM25, baseline models, and

our best ﬁne-tuned models (F-mE5 and F-xlm).

Figure 5 presents the Accuracy@1 results for

Spanish across various retrieval models, comparing

conﬁgurations with and without reranking. For most

baseline models (BM25, mE5, and BGE-M3), apply-

ing the reranker yielded slight improvements. We

observed a performance drop for GTE when rerank-

ing is implemented. The ﬁne-tuned models (F-

mE5 and F-XLM) achieved the highest overall ac-

curacy, performing better without reranking. Specif-

ically, F-mE5 achieved approximately 90%, while F-

XLM achieved nearly 88% Accuracy@1 without the

reranker.

6.3 Dimension Analysis

Figure 6 presents the results of models trained with

MLR per the crops embedding dimension from 64

to 768, which affects retrieval effectiveness (Acur-

racy@1) for the Portuguese test partition of the E-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

142

Figure 5: Accuracy at one comparison for Spanish without

reranker application for BM25, baseline models, and our

best ﬁne-tuned models (F-mE5 and F-xlm).

FAQ dataset. All the ﬁne-tuned models (F-mE5 and

F-xlm) conﬁgurations outperformed the best base-

line, BM25, which achieved 76.14%. F-mE5 consis-

tently outperformed F-xlm, with accuracy increasing

from 88.07% at dimension 64 to 90.48% at dimen-

sion 768. In contrast, F-xlm maintained stable perfor-

mance, starting at 88.60% and ﬂuctuating to 87.72%.

These results indicate that higher dimensions beneﬁt

F-mE5 more signiﬁcantly, while F-xlm is less sensi-

tive to dimensional changes.

Figure 6: Cropped embedding dimension Accuracy at one

value of the trained models on Portuguese test partition

of E-FAQ; black dashed line represents the best results

achieved for BM25 as the best baseline retriever.

We observed similar trends for the Spanish test

partition in Figure 7, in which all conﬁgurations of

the ﬁne-tuned models outperformed the best baseline,

GTE multilingual (73.90%). F-mE5 showed a grad-

ual increase in performance with higher embedding

dimensions, ranging from 86.87% at dimension 64

to 90.12% at 768. In comparison, F-xlm remained

relatively stable, with scores ﬂuctuating slightly be-

tween 86.87% and 87.36%. This pattern indicates that

higher dimensions might beneﬁt F-mE5 more clearly,

while F-xlm seems less inﬂuenced by the embedding

size.

Figure 7: Cropped embedding dimension Accuracy at one

value of the trained models on Spanish test partition of E-

FAQ; black dashed line represents the best results achieved

for GTE Multilingual as the best baseline retriever for Span-

ish.

6.4 Qualitative Results

Table 2 provides qualitative examples of cosine simi-

larity scores for question pairs labeled as similar. The

success (S) cases show how the embedding space ef-

fectively captures semantic equivalence, even when

there are differences in surface forms. In contrast, the

failure (F) cases reveal limitations where the cosine

score does not match the gold label. These examples

underscore both the model’s strengths in identifying

paraphrases and its weaknesses in addressing nuanced

semantic variations.

6.5 Ablation Studies

We investigate the impact of different pretraining

methods, the number of hard negatives in contrastive

learning, and the combination of loss functions over

retrieval and STS benchmarks.

Pretraining Methods. The distinct pretraining ap-

proaches of the base models appear to have a direct

impact on downstream task outcome. While XLM-

RoBERTa relies on a Masked Language Modeling

(MLM) objective, the E5 model was pretrained us-

ing weakly supervised contrastive learning. Table 1

presents that the E5 model’s contrastive foundation

provided a signiﬁcant advantage in our retrieval ex-

periments. This performance gap is statistically sig-

niﬁcant, conﬁrmed by a comparison of the Average

Precision at 10 distributions on the E-FAQ test par-

tition, which yielded a p-value of 7.3 × 10

−5

at the

Wilcoxon signed-rank test.

Embeddings Might Be all You Need: Domain-Speciﬁc Sentence Encoders for Latin American E-Commerce Questions

143

Table 2: Qualitative results of cosine similarity scores for

question pairs labeled as similar with in their original form

and English translated version, illustrating representative

Success (S) and Failure (F) cases. The English translation

was done by the authors of this work.

Sentences pair Cosine

cadê a de 25x30

(Where is the 25x30 one?)

Tem como o 25×30

(Is the 25×30 available?)

0.8605

(S)

Vem com módulo ?

(Does it come with an ampliﬁer?)

já vem com módulo??

(Does it already include an ampliﬁer?)

0.9689

(S)

Qual a potência de cada saída desse aparelho

(What is the output power

of each channel of this device?)

Qual a potência do som?

(What is the sound power?)

0.475

(F)

olá, tem em outras cores?

(Hello, do you have it in other colors?)

Tem outras cores?

(Are there other colors available?)

0.4793

(F)

Table 3: Qualitative results of cosine similarity scores for

question pairs labeled as dissimilar with their respective En-

glish translation, illustrating representative Success (S) and

Failure (F) cases. The English translation was done by the

authors of this work.

Sentences pair Cosine

Quantos decibéis ele emite?

(How many decibels does it emit?)

Qual consumo dele?

(What is its power consumption?)

0.3911

(S)

Cabe no golf mk3 97/98?

(Fits golf mk3 97/98)

Boa tarde, tem para Golf 1995.

(Good afternoon, it is available

for Golf 1995.)

0.3646

(S)

Bom dia, vocês tem do A51?

(Good morning, do you have the A51?)

Bom dia, serve no a51?

(Good morning, does it work on the A51?)

0.6848

(F)

Ja vem com o cooler pro procesador?

(Does it come with a CPU cooler?)

Boa noite ja vem com processador?

(Good evening, does it come

with a processor?)

0.6773

(F)

Multi-Task Loss. Table 4 presents the results regard-

ing how the multi-task approach contributed to better

results on retrieval. The similarity task alone was not

sufﬁcient to improve the model’s retrieval capacity, as

it was unable to determine greater separability on its

own. The combination of both tasks yielded the best

results for both trained models, as presented in Table

Table 4: Effect of retrieval objective and semantic simi-

larity objective for Multilingual E5 Base on both retrieval

(E-FAQ) and STS (GoSim3) datasets. The model was ﬁne-

tuned with in-batch negatives only. All columns present the

result metrics for Portuguese only data.

Objective

E-FAQ GoSim3

ACC@1 Pearson

Similarity only 82.20 54.47

Retrieval only 88.95 39.49

Retrieval & Similarity 90.48 44.57

Hard Negatives. The number of hard negatives ex-

tracted from “almost similar” and “dissimilar” la-

bels impacted differently on retrieval and on STS

tasks. Table 5 presents this ﬁnding. Considering the

E5 model trained on retrieval task only, the number

of negatives contributes to greater separability and,

therefore, a better result on STS, and also increases

the quality of retrieval results. With E5 model trained

on both retrieval and similarity tasks, we found a opti-

mal value of STS Pearson’s correlation using a single

hard negative. However, in this scenario, the addition

of hard negatives let to a degradation in retrieval ac-

curacy.

Table 5: Effect of hard negatives on InfoNCE loss for Mul-

tilingual E5 Base on retrieval (E-FAQ) and STS (GoSim3)

datasets. Zero hard negatives indicate in-batch negatives

only. All columns present the metrics for Portuguese.

Tasks

Hard

Negatives

E-FAQ GoSim3

ACC@1 Pearson

Retrieval

only

0 88.95 39.49

1 89.54 46.12

3 89.13 46.27

Retrieval

Similarity

0 90.48 44.57

1 89.60 48.03

3 89.60 46.96

7 DISCUSSION

Table 1 revealed that our ﬁne-tuned, domain-speciﬁc

models outperformed general sentence encoders on

the E-FAQ test set for both Portuguese and Spanish.

Even with a domain-speciﬁc re-ranking baseline (cf.

Figure 4 and 5), our results conﬁrmed the feasibil-

ity and effectiveness of using a single, uniﬁed embed-

ding model in retrieval pipelines. This key ﬁnding

corroborates the signiﬁcant resource optimization po-

tential—reducing memory, CPU processing, and la-

tency—by employing one model instead of two.

Notably, the BM-25 baseline outperformed all

original pre-trained models on the E-FAQ dataset.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

144

We attribute this to the inherent characteristics of

the e-commerce domain, where related questions fre-

quently contain a signiﬁcant overlap of speciﬁc key-

words such as product names, brands, or units of mea-

surement. The effectiveness of our trained sentence

encoders suggests that while they grasp the seman-

tic nuances between questions, they also successfully

capture this crucial “term-wise” similarity.

Figure 6 demonstrated a favorable trade-off be-

tween embedding dimensionality and retrieval effec-

tiveness, underscoring the beneﬁts of MLR training.

Our trained models exhibited remarkable effective-

ness and stability across various cropped embedding

dimensions. Speciﬁcally, our top-performing model,

F-mE5, achieved a 91.6% reduction in sentence repre-

sentation size (from 768 to 64 dimensions) while pre-

serving 97.3% of its original retrieval effectiveness.

This dimensionality reduction yields signiﬁcant

practical advantages. Given that most retrieval al-

gorithms scale in memory and time complexity with

both the indexed corpus size and the embedding di-

mension, a 91.6% decrease in embedding size directly

correlates to substantial reductions in memory foot-

print and processing time. Ultimately, this translates

to considerably lower demands on computational re-

sources and a more cost-efﬁcient implementation for

large-scale retrieval pipelines.

Table 4 and Table 5 underscore the value of our

hybrid training methodology. The presented results

conﬁrm that a multi-task learning approach achieves

a superior balance among retrieval, ranking capabil-

ities, and representation separability. The inclusion

of a similarity training task demonstrably enhances

both retrieval and semantic textual similarity (STS)

results, but only when applied in conjunction with the

retrieval task. This improved separability offers prac-

tical advantages for semantic retrieval, facilitating the

explainability of retrieved elements and enabling the

application of similarity score thresholds for result ﬁl-

tering.

While our current investigation focused explic-

itly on retrieving relevant information within the

Portuguese and Spanish e-commerce question para-

phrases domain, we are conﬁdent that the strengths

of our designed multi-objective training methodol-

ogy offer signiﬁcant potential for broader generaliza-

tion. Furthermore, while our study addressed sym-

metric retrieval for question paraphrases, the adapt-

ability of our models suggests their applicability to a

wider range of retrieval tasks, including asymmetric

retrieval scenarios, by simply adjusting the training

data to a structure similar to, but not restricted to, the

E-FAQ.

8 CONCLUSION

Real-world customer inquiries often feature linguistic

patterns that challenge traditional sentence encoders

and hinder response accuracy. Our study highlighted

the effectiveness of domain-speciﬁc ﬁne-tuned mod-

els for retrieval tasks in Portuguese and Spanish,

outperforming the general-purpose pretrained embed-

dings commonly found in the existing literature. The

results demonstrated that our models eliminate the

need for additional re-ranking, a process often re-

quired when using general embeddings. This makes

retrieval more efﬁcient for real-world applications,

particularly in E-commerce. Our ﬁndings revealed the

success of multi-task objective training in Matryoshka

Representation Learning, underscoring its relevance

in maintaining strong retrieval effectiveness across

various embedding dimensions. This is especially ad-

vantageous for Portuguese and Spanish, where high-

quality retrieval models remain underexplored. Fu-

ture work will focus on implementing these models in

real-world E-commerce environments, with a speciﬁc

emphasis on the Portuguese and Spanish markets. We

will assess their impact on practical real-world ap-

plications and reﬁne them for even greater quality in

multilingual retrieval. We plan future studies to ex-

plore data from other domains or retrieval tasks in a

format similar to that proposed for our E-FAQ dataset.

ACKNOWLEDGMENT

This work was supported by GoBots company.

REFERENCES

Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D.,

Naumann, T., and McDermott, M. (2019). Publicly

available clinical BERT embeddings. In Rumshisky,

A., Roberts, K., Bethard, S., and Naumann, T., ed-

itors, Proceedings of the 2nd Clinical Natural Lan-

guage Processing Workshop, pages 72–78, Minneapo-

lis, Minnesota, USA. Association for Computational

Linguistics.

Bedná

r, J., Náplava, J., Baran

cíková, P., and Lisický, O.

(2024). Some like it small: Czech semantic em-

bedding models for industry applications. Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

38(21):22734–22742.

Chico, V., Zucchi, L., Ferragut, D., Caus, R., de Freitas,

V., and dos Reis, J. C. (2023). Automated question

answering via natural language sentence similarity:

Achievements for brazilian e-commerce platforms. In

Anais do XIV Simpósio Brasileiro de Tecnologia da

Embeddings Might Be all You Need: Domain-Speciﬁc Sentence Encoders for Latin American E-Commerce Questions

145

Informação e da Linguagem Humana, pages 74–83,

Porto Alegre, RS, Brasil. SBC.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,

Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettle-

moyer, L., and Stoyanov, V. (2020). Unsupervised

cross-lingual representation learning at scale.

Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur,

A., Stap, D., Gala, J., Siblini, W., Krzemi´nski, D.,

Winata, G. I., Sturua, S., Utpala, S., Ciancone, M.,

Schaeffer, M., Sequeira, G., Misra, D., Dhakal, S.,

Rystrøm, J., Solomatin, R., . . . Muennighoff, N.

(2025). Mmteb: Massive multilingual text embedding

benchmark. arXiv preprint arXiv:2502.13595.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W.

(2022). Language-agnostic BERT sentence embed-

ding. In Muresan, S., Nakov, P., and Villavicencio,

A., editors, Proceedings of the 60th Annual Meeting

of the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 878–891, Dublin, Ireland.

Association for Computational Linguistics.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,

M., Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou, M.

(2020). CodeBERT: A pre-trained model for program-

ming and natural languages. In Cohn, T., He, Y., and

Liu, Y., editors, Findings of the Association for Com-

putational Linguistics: EMNLP 2020, pages 1536–

1547, Online. Association for Computational Linguis-

tics.

Huang, J., Hu, Z., Jing, Z., Gao, M., and Wu, Y. (2024). Pic-

colo2: General text embedding with multi-task hybrid

loss training.

Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bo-

janowski, P., Joulin, A., and Grave, E. (2022). Unsu-

pervised dense information retrieval with contrastive

learning. Transactions on Machine Learning Re-

search.

Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur,

A., Stap, D., Gala, J., Siblini, W., Krzemi´nski, D.,

Winata, G. I., Sturua, S., Utpala, S., Ciancone, M.,

Schaeffer, M., Sequeira, G., Misra, D., Dhakal, S.,

Rystrøm, J., Solomatin, R., . . . Muennighoff, N.

(2025). Gemma 3 technical report.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L.,

Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense

passage retrieval for open-domain question answer-

ing. In Webber, B., Cohn, T., He, Y., and Liu,

Y., editors, Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 6769–6781, Online. Association for

Computational Linguistics.

Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha,

A., Ramanujan, V., Howard-Snyder, W., Chen, K.,

Kakade, S., Jain, P., and Farhadi, A. (2024). Ma-

tryoshka representation learning.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sen-

tence embeddings using Siamese BERT-networks. In

Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Pro-

ceedings of the 2019 Conference on Empirical Meth-

ods in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Pro-

cessing (EMNLP-IJCNLP), pages 3982–3992, Hong

Kong, China. Association for Computational Linguis-

tics.

Su, J. (2022). Cosent (i): A more effective sentence em-

bedding scheme than sentence-bert. https://kexue.fm/

archives/8847. [Online; accessed 12-May-2025].

Tang, Y. and Yang, Y. (2025). Do we need domain-speciﬁc

embedding models? an empirical investigation.

van den Oord, A., Li, Y., and Vinyals, O. (2019). Represen-

tation learning with contrastive predictive coding.

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D.,

Majumder, R., and Wei, F. (2024a). Text embeddings

by weakly-supervised contrastive pre-training.

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R.,

and Wei, F. (2024b). Multilingual e5 text embeddings:

A technical report.

Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J.,

Lin, H., Yang, B., Xie, P., Huang, F., Zhang, M., Li,

W., & Zhang, M. (2024, November). mGTE: Gen-

eralized long-context text representation and rerank-

ing models for multilingual text retrieval. In Dernon-

court, F., Preo¸tiuc-Pietro, D., and Shimorina, A., edi-

tors, Proceedings of the 2024 Conference on Empiri-

cal Methods in Natural Language Processing: Indus-

try Track, pages 1393–1412, Miami, Florida, US. As-

sociation for Computational Linguistics.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

146