Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering

Erik Nikulski; Julius Gonsior; Claudio Hartmann; Wolfgang Lehner

doi:10.5220/0013460200003967

Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering

Erik Nikulski, Julius Gonsior, Claudio Hartmann, Wolfgang Lehner

2025

Abstract

Industrial textual datasets can be very domain-specific, containing abbreviations, terms, and identifiers that are only understandable with in-domain knowledge. In this work, we introduce guidelines for developing a domain-specific topic modeling approach that includes an extensive domain-specific preprocessing pipeline along with the domain adaption of a semantic document embedding model. While preprocessing is generally assumed to be a trivial step, for real-world datasets, it is often a cumbersome and complex task requiring lots of human effort. In the presented approach, preprocessing is an essential step in representing domain-specific information more explicitly. To further enhance the domain adaption process, we introduce a partially automated labeling scheme to create a set of in-domain labeled data. We demonstrate a 22% performance increase in the semantic embedding model compared to zero-shot performance on an industrial, domain-specific dataset. As a result, the topic model improves its ability to generate relevant topics and extract representative keywords and documents.

Download

Paper Citation

in Harvard Style

Nikulski E., Gonsior J., Hartmann C. and Lehner W. (2025). Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering. In Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-758-0, SciTePress, pages 31-42. DOI: 10.5220/0013460200003967

in Bibtex Style

@conference{data25,
author={Erik Nikulski and Julius Gonsior and Claudio Hartmann and Wolfgang Lehner},
title={Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering},
booktitle={Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2025},
pages={31-42},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013460200003967},
isbn={978-989-758-758-0},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering
SN - 978-989-758-758-0
AU - Nikulski E.
AU - Gonsior J.
AU - Hartmann C.
AU - Lehner W.
PY - 2025
SP - 31
EP - 42
DO - 10.5220/0013460200003967
PB - SciTePress