Authors:
Erik Nikulski
1
;
Julius Gonsior
2
;
Claudio Hartmann
2
and
Wolfgang Lehner
2
Affiliations:
1
School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ, U.S.A.
;
2
Database Research Group, Dresden University of Technology, Dresden, Germany
Keyword(s):
Natural Language Processing, Domain Adaption, Semantic Textual Similarity, Semantic Embedding Model, Topic Model.
Abstract:
Industrial textual datasets can be very domain-specific, containing abbreviations, terms, and identifiers that are only understandable with in-domain knowledge. In this work, we introduce guidelines for developing a domain-specific topic modeling approach that includes an extensive domain-specific preprocessing pipeline along with the domain adaption of a semantic document embedding model. While preprocessing is generally assumed to be a trivial step, for real-world datasets, it is often a cumbersome and complex task requiring lots of human effort. In the presented approach, preprocessing is an essential step in representing domain-specific information more explicitly. To further enhance the domain adaption process, we introduce a partially automated labeling scheme to create a set of in-domain labeled data. We demonstrate a 22% performance increase in the semantic embedding model compared to zero-shot performance on an industrial, domain-specific dataset. As a result, the topic mode
l improves its ability to generate relevant topics and extract representative keywords and documents.
(More)