
tic textual similarity. In *SEM 2012, pages 385–393.
ACL.
Angelov, D. (2020). Top2vec: Distributed representations
of topics.
Carbonell, J. and Goldstein, J. (1998). The use of mmr,
diversity-based reranking for reordering documents
and producing summaries. In SIGIR ’98, page
335–336. ACM.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,
Wenzek, G., Guzm
´
an, F., Grave, E., Ott, M., Zettle-
moyer, L., and Stoyanov, V. (2020). Unsupervised
cross-lingual representation learning at scale. In ACL
2020, pages 8440–8451. ACL.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In NAACL
’19, pages 4171–4186. ACL.
Farahani, A., Voghoei, S., Rasheed, K., and Arabnia, H. R.
(2021). A brief review of domain adaptation. In Ad-
vances in Data Science and Information Engineering,
pages 877–894. Springer International Publishing.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T.,
Foster, C., Phang, J., He, H., Thite, A., Nabeshima,
N., Presser, S., and Leahy, C. (2020). The pile: An
800gb dataset of diverse text for language modeling.
Grootendorst, M. (2022). Bertopic: Neural topic modeling
with a class-based tf-idf procedure.
Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling
the knowledge in a neural network. In NIPS Deep
Learning and Representation Learning Workshop.
Howard, J. and Ruder, S. (2018). Universal language model
fine-tuning for text classification. In ACL 2018, pages
328–339. ACL.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford,
C., Chaplot, D. S., de las Casas, D., Bressand, F.,
Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R.,
Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T.,
Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mis-
tral 7b.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (2019). Roberta: A robustly optimized bert pre-
training approach.
Loshchilov, I. and Hutter, F. (2019). Decoupled weight de-
cay regularization.
McInnes, L., Healy, J., and Astels, S. (2017). hdbscan: Hi-
erarchical density based clustering. JOSS, 2(11):205.
McInnes, L., Healy, J., and Melville, J. (2018). UMAP:
Uniform Manifold Approximation and Projection for
Dimension Reduction. ArXiv e-prints.
McInnes, L., Healy, J., Saul, N., and Grossberger, L. (2018).
Umap: Uniform manifold approximation and projec-
tion. JOSS, 3(29):861.
Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. (2013).
Efficient estimation of word representations in vector
space.
Ni, J., Hernandez Abrego, G., Constant, N., Ma, J., Hall, K.,
Cer, D., and Yang, Y. (2022). Sentence-t5: Scalable
sentence encoders from pre-trained text-to-text mod-
els. In ACL 2022, pages 1864–1874.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., and Duch-
esnay, E. (2011). Scikit-learn: Machine learning in
Python. JMLR, 12:2825–2830.
Radford, A. and Narasimhan, K. (2018). Improving lan-
guage understanding by generative pre-training.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. (2019). Language models are unsuper-
vised multitask learners.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).
Exploring the limits of transfer learning with a unified
text-to-text transformer. JMLR, 21.
Reimers, N., Beyer, P., and Gurevych, I. (2016). Task-
oriented intrinsic evaluation of semantic textual simi-
larity. In COLING 2016, pages 87–96.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-
tence embeddings using siamese bert-networks. In
EMNLP 2019. ACL.
Reimers, N. and Gurevych, I. (2020). Making monolin-
gual sentence embeddings multilingual using knowl-
edge distillation. In EMNLP 2020, pages 4512–4525.
ACL.
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu,
M. M., and Gatford, M. (1994). Okapi at trec-3. In
TREC-3, pages 109–126. NIST.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2020).
Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter.
Sun, B., Feng, J., and Saenko, K. (2016). Return of frus-
tratingly easy domain adaptation. In AAAI’16, page
2058–2065.
Taleb, I., Serhani, M. A., and Dssouli, R. (2018). Big data
quality assessment model for unstructured data. In IIT
2018, pages 69–74.
Thakur, N., Reimers, N., Daxenberger, J., and Gurevych,
I. (2021). Augmented SBERT: Data augmentation
method for improving bi-encoders for pairwise sen-
tence scoring tasks. In NAACL 2021, pages 296–310.
ACL.
Tiedemann, J. and Thottingal, S. (2020). OPUS-MT -
Building open translation services for the World. In
EAMT.
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Ra-
sul, K., Belkada, Y., Huang, S., von Werra, L., Four-
rier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush,
A. M., and Wolf, T. (2023). Zephyr: Direct distillation
of lm alignment.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. In Proceedings of
the 31st NIPS, NIPS’17, pages 6000–6010.
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou,
M. (2020). MiniLM: Deep self-attention distillation
for task-agnostic compression of pre-trained trans-
formers. In Proceedings of the 34th NIPS, NIPS’20.
Wilson, G. and Cook, D. J. (2020). A survey of unsuper-
vised deep domain adaptation. TIST 2020.
DATA 2025 - 14th International Conference on Data Science, Technology and Applications
42