leverages large-scale pre-trained language models
such as Bidirectional and Auto-Regressive
Transformers (BART) and Decoding-enhanced
BERT with Disentangled Attention (DeBERTa) fine-
tuned on NLI tasks to infer the semantic relationship
between input text and target labels. This enables the
model to generalize even to unseen categories. ZSC
is particularly effective in scenarios involving a high
number of classes or severe class imbalance, offering
scalable and robust alternatives to traditional
supervised methods. In contrast, similarity-based
classification methods represent texts as numerical
vectors (e.g., using TF-IDF or Word2Vec) and assign
labels based on the conceptual proximity between
document and class representations. These methods
are often favored for their low computational cost,
interpretability, and adaptability. Although traditional
vectorization techniques such as TF-IDF can deliver
strong discriminative power at scale, embedding-
based models such as Word2Vec provide richer
semantic context and often enhance classification
accuracy.
In summary, LDA, ZSC and similarity-based
classification represent three complementary
approaches to the topic classification problem, each
grounded in different theoretical foundations:
unsupervised topic discovery, inference-based
generalization, and semantic similarity respectively.
A systematic examination of these methods provides
valuable insights into designing effective
classification strategies, particularly in large-scale,
unlabeled data environments. The remainder of the
paper is organized as follows: Section 2 reviews the
related literature, Section 3 outlines the objectives of
the study, Section 4 presents the dataset and
methodology, Section 5 reports the experimental
results, and Sections 6 and 7 provide the discussion
and conclusion, respectively.
2 LITERATURE REVIEW
Li et al. (2024) proposed a classification model
designed to address multi-label text classification
tasks in scenarios involving unlabeled or weakly
labeled data. Their approach is grounded in ZSC and
relies on vector representations of class labels
generated through Sentence-BERT (SBERT)
embeddings. The semantic similarity between
documents and labels is computed using cosine
similarity, allowing the model to infer associations
even when class instances have never been seen
during training. The model also features a flexible
architecture capable of handling previously unseen
classes, making it particularly suitable for real-world
applications with limited annotated data. Although
the study does not specify the dataset used, its
contributions to the domain of weakly supervised
multi-label classification with zero shot are
significant. Conceptually, this study aligns with the
ZSC and similarity-based classification paradigms
explored in our study. However, our work
distinguishes itself by offering a comparative analysis
of these methods alongside LDA, providing a more
holistic evaluation of topic-labeling strategies.
Similarly, Schopf et al.
(2022) conducted a
comparative performance evaluation of ZSC and
similarity-based classification techniques, utilizing
state-of-the-art embedding models such as Simple
Contrastive Sentence Embedding (SimCSE) and
Sentence Bidirectional Encoder Representations from
Transformers (SBERT). Their analysis focused on
accuracy-driven metrics, without employing
threshold-based decision mechanisms. Unlike our
study, LDA or other topic modeling techniques were
not incorporated, and the dataset used was not
disclosed. In contrast, this study integrates both
threshold-based decision strategies and traditional
topic modeling, enabling a more comprehensive
assessment of modern and classical approaches in a
unified framework. In a related study, Lakshmi and
Baskar
(2021) introduced novel similarity metrics to
improve the performance of clustering algorithms in
text document grouping tasks. Their work highlights
the critical role of similarity functions in
unsupervised learning, particularly for semantic
grouping. While the dataset used was not specified,
the methodological foundation laid by their research
contributes theoretical grounding to the similarity-
based classification component of this study.
Finally, Yadav et al.
(2025) developed a hybrid
topic modeling framework that integrates traditional
LDA with the contextual word embedding
capabilities of BERT to address the semantic
limitations of conventional topic models. Their
approach is further enhanced through the use of
clustering and dimensionality reduction techniques
and has been validated on multiple text datasets. The
integration of statistical and contextual
representations in this hybrid model enables the
generation of more coherent and interpretable topic
clusters. Our study similarly explores the
complementary strengths of LDA and embedding-
based methods, making this work a relevant and
influential reference within the broader literature.