Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering

Erik Nikulski; Julius Gonsior; Claudio Hartmann; Wolfgang Lehner

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering

Topics: Natural Language Processing; Unstructured Data Analytics

In Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA, 31-42, 2025 , Bilbao, Spain

Authors: Erik Nikulski ¹ ; Julius Gonsior ² ; Claudio Hartmann ² and Wolfgang Lehner ²

Affiliations: ¹ School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ, U.S.A. ; ² Database Research Group, Dresden University of Technology, Dresden, Germany

Keyword(s): Natural Language Processing, Domain Adaption, Semantic Textual Similarity, Semantic Embedding Model, Topic Model.

Abstract: Industrial textual datasets can be very domain-specific, containing abbreviations, terms, and identifiers that are only understandable with in-domain knowledge. In this work, we introduce guidelines for developing a domain-specific topic modeling approach that includes an extensive domain-specific preprocessing pipeline along with the domain adaption of a semantic document embedding model. While preprocessing is generally assumed to be a trivial step, for real-world datasets, it is often a cumbersome and complex task requiring lots of human effort. In the presented approach, preprocessing is an essential step in representing domain-specific information more explicitly. To further enhance the domain adaption process, we introduce a partially automated labeling scheme to create a set of in-domain labeled data. We demonstrate a 22% performance increase in the semantic embedding model compared to zero-shot performance on an industrial, domain-specific dataset. As a result, the topic mode l improves its ability to generate relevant topics and extract representative keywords and documents. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.61

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Nikulski, E., Gonsior, J., Hartmann, C. and Lehner, W. (2025). Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering. In Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-758-0; ISSN 2184-285X, SciTePress, pages 31-42. DOI: 10.5220/0013460200003967

@conference{data25,
author={Erik Nikulski and Julius Gonsior and Claudio Hartmann and Wolfgang Lehner},
title={Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering},
booktitle={Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2025},
pages={31-42},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013460200003967},
isbn={978-989-758-758-0},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering
SN - 978-989-758-758-0
IS - 2184-285X
AU - Nikulski, E.
AU - Gonsior, J.
AU - Hartmann, C.
AU - Lehner, W.
PY - 2025
SP - 31
EP - 42
DO - 10.5220/0013460200003967
PB - SciTePress