Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering
Erik Nikulski, Julius Gonsior, Claudio Hartmann, Wolfgang Lehner
2025
Abstract
Industrial textual datasets can be very domain-specific, containing abbreviations, terms, and identifiers that are only understandable with in-domain knowledge. In this work, we introduce guidelines for developing a domain-specific topic modeling approach that includes an extensive domain-specific preprocessing pipeline along with the domain adaption of a semantic document embedding model. While preprocessing is generally assumed to be a trivial step, for real-world datasets, it is often a cumbersome and complex task requiring lots of human effort. In the presented approach, preprocessing is an essential step in representing domain-specific information more explicitly. To further enhance the domain adaption process, we introduce a partially automated labeling scheme to create a set of in-domain labeled data. We demonstrate a 22% performance increase in the semantic embedding model compared to zero-shot performance on an industrial, domain-specific dataset. As a result, the topic model improves its ability to generate relevant topics and extract representative keywords and documents.
DownloadPaper Citation
in Harvard Style
Nikulski E., Gonsior J., Hartmann C. and Lehner W. (2025). Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering. In Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-758-0, SciTePress, pages 31-42. DOI: 10.5220/0013460200003967
in Bibtex Style
@conference{data25,
author={Erik Nikulski and Julius Gonsior and Claudio Hartmann and Wolfgang Lehner},
title={Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering},
booktitle={Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2025},
pages={31-42},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013460200003967},
isbn={978-989-758-758-0},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering
SN - 978-989-758-758-0
AU - Nikulski E.
AU - Gonsior J.
AU - Hartmann C.
AU - Lehner W.
PY - 2025
SP - 31
EP - 42
DO - 10.5220/0013460200003967
PB - SciTePress