loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Erik Nikulski 1 ; Julius Gonsior 2 ; Claudio Hartmann 2 and Wolfgang Lehner 2

Affiliations: 1 School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ, U.S.A. ; 2 Database Research Group, Dresden University of Technology, Dresden, Germany

Keyword(s): Natural Language Processing, Domain Adaption, Semantic Textual Similarity, Semantic Embedding Model, Topic Model.

Abstract: Industrial textual datasets can be very domain-specific, containing abbreviations, terms, and identifiers that are only understandable with in-domain knowledge. In this work, we introduce guidelines for developing a domain-specific topic modeling approach that includes an extensive domain-specific preprocessing pipeline along with the domain adaption of a semantic document embedding model. While preprocessing is generally assumed to be a trivial step, for real-world datasets, it is often a cumbersome and complex task requiring lots of human effort. In the presented approach, preprocessing is an essential step in representing domain-specific information more explicitly. To further enhance the domain adaption process, we introduce a partially automated labeling scheme to create a set of in-domain labeled data. We demonstrate a 22% performance increase in the semantic embedding model compared to zero-shot performance on an industrial, domain-specific dataset. As a result, the topic mode l improves its ability to generate relevant topics and extract representative keywords and documents. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.108

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Nikulski, E., Gonsior, J., Hartmann, C., Lehner and W. (2025). Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering. In Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-758-0; ISSN 2184-285X, SciTePress, pages 31-42. DOI: 10.5220/0013460200003967

@conference{data25,
author={Erik Nikulski and Julius Gonsior and Claudio Hartmann and Wolfgang Lehner},
title={Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering},
booktitle={Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2025},
pages={31-42},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013460200003967},
isbn={978-989-758-758-0},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering
SN - 978-989-758-758-0
IS - 2184-285X
AU - Nikulski, E.
AU - Gonsior, J.
AU - Hartmann, C.
AU - Lehner, W.
PY - 2025
SP - 31
EP - 42
DO - 10.5220/0013460200003967
PB - SciTePress