
Informação e da Linguagem Humana, pages 74–83,
Porto Alegre, RS, Brasil. SBC.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,
Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettle-
moyer, L., and Stoyanov, V. (2020). Unsupervised
cross-lingual representation learning at scale.
Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur,
A., Stap, D., Gala, J., Siblini, W., Krzemi´nski, D.,
Winata, G. I., Sturua, S., Utpala, S., Ciancone, M.,
Schaeffer, M., Sequeira, G., Misra, D., Dhakal, S.,
Rystrøm, J., Solomatin, R., . . . Muennighoff, N.
(2025). Mmteb: Massive multilingual text embedding
benchmark. arXiv preprint arXiv:2502.13595.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W.
(2022). Language-agnostic BERT sentence embed-
ding. In Muresan, S., Nakov, P., and Villavicencio,
A., editors, Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 878–891, Dublin, Ireland.
Association for Computational Linguistics.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,
M., Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou, M.
(2020). CodeBERT: A pre-trained model for program-
ming and natural languages. In Cohn, T., He, Y., and
Liu, Y., editors, Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 1536–
1547, Online. Association for Computational Linguis-
tics.
Huang, J., Hu, Z., Jing, Z., Gao, M., and Wu, Y. (2024). Pic-
colo2: General text embedding with multi-task hybrid
loss training.
Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bo-
janowski, P., Joulin, A., and Grave, E. (2022). Unsu-
pervised dense information retrieval with contrastive
learning. Transactions on Machine Learning Re-
search.
Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur,
A., Stap, D., Gala, J., Siblini, W., Krzemi´nski, D.,
Winata, G. I., Sturua, S., Utpala, S., Ciancone, M.,
Schaeffer, M., Sequeira, G., Misra, D., Dhakal, S.,
Rystrøm, J., Solomatin, R., . . . Muennighoff, N.
(2025). Gemma 3 technical report.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L.,
Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense
passage retrieval for open-domain question answer-
ing. In Webber, B., Cohn, T., He, Y., and Liu,
Y., editors, Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 6769–6781, Online. Association for
Computational Linguistics.
Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha,
A., Ramanujan, V., Howard-Snyder, W., Chen, K.,
Kakade, S., Jain, P., and Farhadi, A. (2024). Ma-
tryoshka representation learning.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sen-
tence embeddings using Siamese BERT-networks. In
Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Pro-
ceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 3982–3992, Hong
Kong, China. Association for Computational Linguis-
tics.
Su, J. (2022). Cosent (i): A more effective sentence em-
bedding scheme than sentence-bert. https://kexue.fm/
archives/8847. [Online; accessed 12-May-2025].
Tang, Y. and Yang, Y. (2025). Do we need domain-specific
embedding models? an empirical investigation.
van den Oord, A., Li, Y., and Vinyals, O. (2019). Represen-
tation learning with contrastive predictive coding.
Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D.,
Majumder, R., and Wei, F. (2024a). Text embeddings
by weakly-supervised contrastive pre-training.
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R.,
and Wei, F. (2024b). Multilingual e5 text embeddings:
A technical report.
Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J.,
Lin, H., Yang, B., Xie, P., Huang, F., Zhang, M., Li,
W., & Zhang, M. (2024, November). mGTE: Gen-
eralized long-context text representation and rerank-
ing models for multilingual text retrieval. In Dernon-
court, F., Preo¸tiuc-Pietro, D., and Shimorina, A., edi-
tors, Proceedings of the 2024 Conference on Empiri-
cal Methods in Natural Language Processing: Indus-
try Track, pages 1393–1412, Miami, Florida, US. As-
sociation for Computational Linguistics.
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
146