Lbl2Vec: An Embedding-based Approach for Unsupervised Document

Retrieval on Predeﬁned Topics

Tim Schopf

, Daniel Braun

and Florian Matthes

Department of Informatics, Technical University of Munich, Boltzmannstrasse 3, Garching, Germany

Keywords:

Natural Language Processing, Document Retrieval, Unsupervised Document Classiﬁcation.

Abstract:

In this paper, we consider the task of retrieving documents with predeﬁned topics from an unlabeled document

dataset using an unsupervised approach. The proposed unsupervised approach requires only a small number of

keywords describing the respective topics and no labeled document. Existing approaches either heavily relied

on a large amount of additionally encoded world knowledge or on term-document frequencies. Contrariwise,

we introduce a method that learns jointly embedded document and word vectors solely from the unlabeled

document dataset in order to ﬁnd documents that are semantically similar to the topics described by the

keywords. The proposed method requires almost no text preprocessing but is simultaneously effective at

retrieving relevant documents with high probability. When successively retrieving documents on different

predeﬁned topics from publicly available and commonly used datasets, we achieved an average area under the

receiver operating characteristic curve value of 0.95 on one dataset and 0.92 on another. Further, our method

can be used for multiclass document classiﬁcation, without the need to assign labels to the dataset in advance.

Compared with an unsupervised classiﬁcation baseline, we increased F1 scores from 76.6 to 82.7 and from

61.0 to 75.1 on the respective datasets. For easy replication of our approach, we make the developed

Lbl2Vec

code publicly available as a ready-to-use tool under the 3-Clause BSD license

∗

1 INTRODUCTION

In this paper, we combine the advantage of an unsuper-

vised approach with the possibility to predeﬁne topics.

Precisely, given a large number of unlabeled docu-

ments, we would like to retrieve documents related to

certain topics that we already know are present in the

corpus. This is becoming a common task, considering

not only the simplicity of retrieving documents by, e.g.,

scraping web pages, mails or other sources, but also

the labeling cost. For illustration purposes, we imagine

the following scenario: we possess a large number of

news articles extracted from sports sections of differ-

ent newspapers and would like to retrieve articles that

are related to certain sports, such as hockey, soccer

or basketball. Unfortunately, we can only rely on the

article texts for this task, as the metadata of the articles

contain no information about their content. Initially,

this appears like a common text classiﬁcation task.

However, there arise two issues that make the use of

https://orcid.org/0000-0003-3849-0394

https://orcid.org/0000-0001-8120-3368

https://orcid.org/0000-0002-6667-5452

∗

https://github.com/sebischair/Lbl2Vec

conventional classiﬁcation methods unsuitable. First,

we would have to annotate our articles at a high cost,

as conventional supervised text classiﬁcation methods

need a large amount of labeled training data (Zhang

et al., 2020). Second, we might not be interested in

any sports apart from the previously speciﬁed ones.

However, our dataset of sports articles most likely also

includes articles on other sports, such as swimming or

running. If we want to apply a supervised classiﬁca-

tion method, we would either have to annotate even

those articles that are of no interest to us or think about

suitable previous cleaning steps, to remove unwanted

articles from our dataset. Both options would require

signiﬁcant additional expense.

In this paper, we present the

Lbl2Vec

approach,

which provides the retrieval of documents on prede-

ﬁned topics from a large corpus based on unsupervised

learning. This enables us to retrieve the wanted sports

articles related to hockey, soccer and basketball only,

without having to annotate any data. The proposed

Lbl2Vec

approach solely relies on semantic similar-

ities between documents and keywords describing a

certain topic. Using semantic meanings intuitively

matches the approach of a human being and has previ-

124

Schopf, T., Braun, D. and Matthes, F.

Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predeﬁned Topics.

DOI: 10.5220/0010710300003058

In Proceedings of the 17th International Conference on Web Information Systems and Technologies (WEBIST 2021), pages 124-132

ISBN: 978-989-758-536-4; ISSN: 2184-3252

ously been proven to be capable of categorizing unla-

beled texts (Chang et al., 2008). With this approach,

we signiﬁcantly decrease the cost of annotating data,

as we only need a small number of keywords instead

of a large number of labeled documents.

Lbl2Vec

works by creating jointly embedded

word, document, and label vectors. The label vectors

are deducted from predeﬁned keywords of each topic.

Since label and document vectors are embedded in

the same feature space, we can subsequently measure

their semantic relationship by calculating their cosine

similarity. Based on this semantic similarity, we can

decide whether to assign a document to a certain topic

or not.

We show that our approach produces reliable re-

sults while saving annotation costs and requires almost

no text preprocessing steps. To this end, we apply our

approach to two publicly available and commonly used

document classiﬁcation datasets. Moreover, we make

our

Lbl2Vec

code publicly available as a ready-to-use

tool.

2 RELATED WORK

Most related research can be summarized under

the notion of dataless classiﬁcation, introduced

by Chang et al. (2008). Broadly, this includes any

approach that aims to classify unlabeled texts based

on label descriptions only. Our approach differs

slightly from these, as we primarily attempt to retrieve

documents on predeﬁned topics from an unlabeled

document dataset without the need to consider

documents belonging to different topics of no interest.

Nevertheless, some similarities, such as the ability of

multiclass document classiﬁcation emerge, allowing a

rough comparison of our approach with those from

the dataless classiﬁcation, which can further be

divided along two dimensions: 1) semi-supervised

vs. unsupervised approaches and 2) approaches that

use a large amount of additional world knowledge

vs. ones that mainly rely on the plain document corpus.

Semi-supervised

approaches seek to annotate a

small subset of the document corpus unsupervised

and subsequently leverage the labeled subset to

train a supervised classiﬁer for the rest of the

corpus. In one of the earliest approaches that ﬁt

into this category, Ko and Seo (2000) derive training

sentences from manually deﬁned category keywords

unsupervised. Then, they used the derived sentences

to train a supervised Na

ıve Bayes classiﬁer with

minor modiﬁcations. Similarly, Liu et al. (2004)

extracted a subset of documents with keywords

and then applied a supervised Na

ıve Bayes-based

expectation–maximization algorithm (Dempster et al.,

1977) for classiﬁcation.

Unsupervised

approaches, by contrast, use similarity

scores between documents and target categories

to classify the entire unlabeled dataset. Haj-Yahia

et al. (2019) proposed keyword enrichment (

) and

subsequent unsupervised classiﬁcation based on latent

semantic analysis (

LSA

) (Deerwester et al., 1990)

vector cosine similarities. Another approach worth

mentioning in this context is the pure dataless hierar-

chical classiﬁcation used by Song and Roth (2014)

to evaluate different semantic representations. Our

approach also ﬁts into this unsupervised dimension,

as we do not employ document labels and retrieve

documents from the entire corpus based on cosine

similarities only.

A large amount of additional world knowl-

edge

from different data sources has been widely

exploited in many previous approaches to incorporate

more context into the semantic relationship between

documents and target categories. Chang et al. (2008)

used Wikipedia as source of world knowledge to

compute explicit semantic analysis embeddings

(Gabrilovich and Markovitch, 2007) of labels and

documents. Afterward, they applied the nearest

neighbor classiﬁcation to assign the most likely

label to each document. In this regard, their early

work had a major impact on further research, which

subsequently heavily focused on adding a lot of

world knowledge for dataless classiﬁcation. Yin et al.

(2019) used various public entailment datasets to

train a bidirectional encoder representations from

transformers (

BERT

) model (Devlin et al., 2019)

and used the pretrained

BERT

entailment model to

directly classify texts from different datasets.

Using mainly the plain document corpus

for

this task, however, has been rather less researched

so far. In one of the earlier approaches, Rao et al.

(2006) derived and assigned document labels based

on a k-means word clustering. Besides, Chen et al.

(2015) introduce descriptive latent Dirichlet allocation,

which could perform classiﬁcation with only category

description words and unlabeled documents, thereby

eradicating the need for a large amount of world

knowledge from external sources. Since our approach

only needs some predeﬁned topic keywords besides

the unlabeled document corpus, it also belongs to this

category. However, unlike previous approaches that

mainly used the plain document corpus, we do not

rely on term-document frequency scores but learn

Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predeﬁned Topics

125

new semantic embeddings from scratch, which was

inspired by the topic modeling approach of Angelov

(2020).

A different related research area addresses ad-

hoc document retrieval. Approaches related to this

area attempt to rank documents based on a relevance

score to a speciﬁc user query (Baeza-Yates and

Ribeiro-Neto, 1999). For instance, Gysel et al. (2018)

proposed a neural vector space model that learns

document representations unsupervised, and Ai et al.

(2016) introduce a modiﬁed paragraph vector model

for ad hoc document retrieval. However, our approach

differs from these, as we do not want to receive

documents based on user queries but topics. Further,

we are not particularly interested in ranking within the

retrieved documents.

3 Lbl2Vec METHOD

3.1 General Approach

In the ﬁrst step, our

Lbl2Vec

model learns jointly em-

bedded word vectors

and document vectors

from

an unlabeled document corpus. Afterward, we use

the embeddings

K ⊂ W

of manually deﬁned keywords

that describe topics

to learn label embeddings

within the same feature space. Since all learned em-

beddings

(W, D,L)

share the same feature space, their

distance can be considered their semantic similarity.

To learn a label embedding

, we ﬁnd document em-

beddings

,...,

that are close to the descriptive

keyword embeddings

,...,

of topic

. Afterward,

we compute the centroid of the outlier cleaned docu-

ment embeddings as the label embedding

of topic

. We compute document rather than keyword cen-

troids since our experiments showed that it is more

difﬁcult to retrieve documents based on similarities

to keywords only, even if they share the same feature

space. Moreover, we clean outliers to remove docu-

ments that may be related to some of the descriptive

keywords but do not properly match the intended topic.

As a result, our experiments showed a more accurate

label embedding and slightly improved document re-

trieval performance. Figure 1 provides an exemplary

illustration of the different learned embeddings. After

learning, we can consider the distance of label embed-

ding

to an arbitrary document embedding

as their

semantic similarity. Since we argue that the learned la-

bel embeddings are mappings of topics in the semantic

feature space, this also represents the semantic simi-

larity between

and

. Hence, we use these semantic

similarities to ﬁnally retrieve those documents related

to our predeﬁned topics.

Basketball

NBA

LeBron

Jordan

Game

Ball

Player

Hoop

Court

Points

Score

Sports

Team

Throw

ID: 23

ID: 5

ID: 47

ID: 11

ID: 87

ID: 8

ID: 18

ID: 21

ID: 38

ID: 53

ID: 55

ID: 14

ID: 78

ID: 45

ID: 67

ID: 80

ID: 65

ID: 56

ID: 34

ID: 3

ID: 36

Basketball

Figure 1: Example illustration of a semantic feature space re-

lated to Basketball. Blue: Descriptive keyword embeddings.

Black: Document embeddings that are semantically similar

to the keywords and each other. Red: Outlier document

embeddings. Green: Label embedding.

3.2 Learning Jointly Embedded

Semantic Representations

To train our jointly embedded word and document

vectors, we use the paragraph vector framework intro-

duced by Le and Mikolov (2014). Since the distributed

bag of words version of paragraph vector (

PV-DBOW

)

is proven to perform better than its alternative (Lau

and Baldwin, 2016), we consequently use this archi-

tecture. However,

PV-DBOW

only trains document

embeddings but not word embeddings in its original

version. Therefore, we employ a slightly modiﬁed

implementation that concurrently learns word embed-

dings and is ﬁrst mentioned by Dai et al. (2015). In

this modiﬁed version, we interleave the

PV-DBOW

training with Skip-gram (Mikolov et al., 2013) word

embedding training on the same corpus. As the Skip-

gram architecture is very similar to the

PV-DBOW

architecture, we simply need to exchange the predict-

ing paragraph vector with a predicting word vector for

this purpose. Then, iterative training on the interleaved

PV-DBOW

and Skip-gram architectures enable us to

simultaneously learn word and document embedding

that share the same feature space.

After learning all document and word embeddings,

we use the topic keywords for label embedding train-

ing. For each topic of interest, we need to manually

deﬁne at least one keyword that can describe the topic

properly. Once all keywords are deﬁned, we perform

the following procedure for each topic of interest. By

applying

e =

∑

x=1

(1)

to calculate a centroid

of embeddings

,...,~e

, we

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

126

obtain the centroid

of keyword embeddings for a

topic

. Afterward, we calculate the cosine similarity

to each

d ∈ D

and sort the document embeddings

in descending order. Beginning at the document em-

bedding with the highest cosine similarity, we now

successively add each document embedding to a set

of candidate document embeddings

⊂ D

that has a

high semantic similarity to the descriptive keywords of

topic

. To include only document embeddings with

high cosine similarities in

, we additionally need to

set values for the three following parameters.

• s : {s ∈ R| − 1 ≤ s ≤ 1}

as similarity threshold.

Add only document embeddings to

succes-

sively while cos^(k

d) > s is true.

• d

min

: {d

min

∈ N|1 ≤ d

min

≤ d

max

≤ |D|}

as the

minimum number of document embeddings that

have to be added to

successively. This parame-

ter prevents the selection of an insufﬁcient number

of documents in case we set s too restrictive.

• d

max

: {d

max

∈ N|1 ≤ d

min

≤ d

max

≤ |D|}

as the

maximum number of document embeddings that

may be added to D

successively.

To ensure a more accurate label embedding later, we

now clean outliers from the resulting set of candidate

document embeddings

. Therefore, we apply local

outlier factor (

LOF

) (Breunig et al., 2000) cleaning. If

the

LOF

algorithm identiﬁes document embeddings

outlier

with signiﬁcantly lower local density than that

of their neighbors, we remove these document embed-

dings from

. Hence, we receive the set of relevant

document embeddings

⊆ D

for topic

. Finally,

we compute the centroid of all document embeddings

and deﬁne this as our label embedding

of topic

. Consequently, we obtain jointly embedded semantic

representations of words, documents, and topic labels.

3.3 Receiving Documents on Predeﬁned

Topics

To decide whether the content of document

is se-

mantically similar to a single topic

, we need to cal-

culate the cosine similarity between document em-

bedding

and label embedding

. Subsequently,

the afﬁliation of

is indicated if

cos^(

exceeds a previously manually deﬁned threshold

value

: {α

∈ R| − 1 ≤ α

≤ 1}

. Moreover, we

can use the cosine similarities for classifying

be-

tween multiple different predeﬁned topics

,...,t

To achieve this, we assign the label of topic

cos^(

d) = max({cos ^(

d) : x = 1, ..., n})

. Fi-

nally, we can also decide whether a document

does

not ﬁt into one of our predeﬁned topics. Therefore,

we deﬁne threshold values

,..., α

. In case that

is classiﬁed as most similar to topic

, we discard the

label assignment if

cos^(

d) ≤ α

. As a result,

remains unlabeled, and we assume that the content of

this document is unrelated to any of our predeﬁned

topics.

4 EXPERIMENTS

4.1 Dataset

We use the two publicly available classiﬁcation

datasets,

20Newsgroups

and

AG’s Corpus

, de-

scribed in Table 1. In the following, we consider each

Table 1: Summary of the used classiﬁcation datasets.

Datasets

#Training

documents

#Test

documents

#Classes

20Newsgroups 11314 7532 20

AG’s Corpus 120000 7600 4

class as an independent topic and use the provided

class labels solely for evaluation. The 20Newsgroups

dataset consists of almost 20,000 documents hetero-

geneously split across 20 different newsgroup classes.

The original AG’s Corpus is a collection of over 1

million news articles. We use the version of Zhang

et al. (2015) that construct four evenly distributed

classes from the original dataset, resulting in more

than 120,000 labeled documents.

4.2 Keywords Deﬁnition

To determine suitable keywords for each topic rep-

resented by a class, we adopt the expert knowledge

approach of Haj-Yahia et al. (2019). Hence, we emu-

late human experts ourselves, that deﬁne some initial

keywords based on the class descriptions only. Then,

we randomly select some documents from each class

to further derive some salient keywords. In the case of

a strict unsupervised setting with completely unlabeled

datasets, human experts might describe a topic with

keywords based on their speciﬁc domain knowledge

alone and without necessarily being familiar with the

document contents.

4.3 Model Training

For model training, we need to convert all document

words and topic keywords to lowercase. To ﬁnish

qwone.com/ jason/20Newsgroups

groups.di.unipi.it/

∼

gulli/AG corpus of news articles

Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predeﬁned Topics

127

our short preprocessing, we only have to tokenize the

documents and assign IDs to them. For each dataset,

we train an individual model. Accordingly, we pass

the corresponding preprocessed documents and de-

ﬁned keywords to its own model. For our models

to learn suitable embeddings, we need to set the hy-

perparameter values prior to training. Therefore, we

conduct a short manual hyperparameter optimization

by training

Lbl2Vec

models on the respective train-

ing datasets and evaluating the performance on the

test datasets, which allows us to learn more precise

embeddings while simultaneously avoiding overﬁt-

ting. In the case of completely unlabeled datasets,

the given standard hyperparameters can be used. The

only signiﬁcant hyperparameter setting difference be-

tween the two models, resulting from our hyperparam-

eter optimization, is that we set a similarity thresh-

old of

s = 0.30

and

s = 0.43

for the AG’s Corpus and

20Newsgroups models, respectively. For both mod-

els, we choose

min

= 100

max

= |D|

, and 10 as the

number of epochs for

PV-DBOW

training. As we use

an unsupervised approach, we train our ﬁnal models,

similar to Haj-Yahia et al. (2019), on the entire corpora

of the respective aggregated training and test datasets.

4.4 Topic Representation Analysis

We want to evaluate whether our

Lbl2Vec

approach

is capable of adequately modeling predeﬁned topics

and thereby can return documents related to them. For

that, we classify all documents in the AG’s Corpus

using our pretrained

Lbl2Vec

model. Afterward, we

deﬁne the documents assigned to the same class by

our model as one topic and analyze these topics using

LDAvis (Sievert and Shirley, 2014). In addition, we

compare the modeling capabilities on predeﬁned topics

of our

Lbl2Vec

approach to a common topic model-

ing approach. To this end, we apply latent Dirichlet

allocation (

LDA

) (Blei et al., 2003) with

K = 4

num-

ber of topics to the same dataset and visualize the

modeled topics. Figure 2 shows that the

LDA

model

ﬁnds two similar and two dissimilar topics. However,

the topic sizes are distributed very heterogeneously,

which contrasts with the uniform distribution of doc-

uments across all classes in the AG’s Corpus. As

opposed to this, our

Lbl2Vec

model ﬁnds topics that

are equally sized, which is aligned with the underlying

AG’s Corpus. Further, the topics Science/Technology

and Business are similar, whereas Sports and World

are highly dissimilar to all other topics. Table 2 in-

dicates that a standard topic modeling approach like

LDA

cannot model predeﬁned topics such as the AG’s

Corpus classes. The most relevant terms of the

LDA

topics mainly consist of different entities and do not al-

PC1

PC2

Intertopic Distance Maps (via multidimensional scaling)

PC1

PC2

Business

Science/

Technology

World

Sports

Lbl2Vec LDA

Figure 2: Visualization of Lbl2Vec and LDA topic repre-

sentation capabilities based on AG’s Corpus. Each circle

represents a topic, whereas each topic, in turn, consists of

several documents classiﬁed as related by the respective

models. The size of the circles is proportional to the relative

occurrence of the respective topic in the corpus. Distances

between circles represent semantic inter-topic similarities.

Table 2: Top 10 most relevant terms for each topic of the

LDA model; we use the LDAvis relevance with λ = 0.1.

LDA

Topic 1

oil; crude; prices; microsoft; windows;

dollar; reuters; barrel; stocks; yukos;

Topic 2

ccia; thunderbird; generali; macau; cheetham;

backman; hauritz; pizarro; rituxan; abdicate;

Topic 3

orton; mashburn; bender; kwame; pippen;

attanasio; elliss; icelandair; lefors; stottlemyre;

Topic 4

wiltord; perrigo; quetta; dione; mattick;

olympiad; panis; agis; bago; cracknell;

low us to relate the modeled topics to the AG’s Corpus

classes. However, from Table 3 we can conclude that

Table 3: Top 10 most relevant terms for each topic of the

Lbl2Vec

model; we use the LDAvis relevance with

λ = 0.1

Lbl2Vec

World

iraq; killed; minister; prime; military;

palestinian; minister; israeli; troops; darfur;

Sports

cup; coach; sox; league; championship;

yankees; champions; win; season; scored;

Business

stocks; fullquote; proﬁt; prices; aspx;

quickinfo; shares; earnings; investor; oil;

Science/

Technology

microsoft; windows; users; desktop; music;

linux; version; apple; search; browser;

our

Lbl2Vec

model can capture the semantic meaning

of each predeﬁned topic very well. In addition, the oc-

currence of technology companies such as Microsoft

and Apple in the Science/Technology topic explains

the similarity to the Business topic, as such companies

are also highly relevant in a business context.

4.5 Multiclass Document Classiﬁcation

Results

When using our trained models to classify the entire

document corpus of each dataset, we achieve the re-

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

128

sults stated in Table 4. We compared our models with

a recent fully unsupervised text classiﬁcation approach

and a supervised baseline classiﬁer. First, we observed

Table 4: Performance of our

Lbl2Vec

models when classify-

ing all documents in the respective corpus.

LSA

refers

to the best possible fully unsupervised classiﬁcation results

of Haj-Yahia et al. (2019) on the datasets. The last row states

their baseline classiﬁcation results of a supervised multino-

mial Na

ıve Bayes approach. As we used micro-averaging

to calculate our classiﬁcation metrics, we realized equal F1,

Precision, and Recall scores within each model.

Method

AG’s Corpus 20Newsgroups

F1 Prec. Rec. F1 Prec. Rec.

KE + LSA 76.6 76.8 76.6 61.0 71.1 57.8

Lbl2Vec 82.7 82.7 82.7 75.1 75.1 75.1

Supervised Na

ıve Bayes 89.8 89.8 89.9 85.0 87.1 85.4

that our

Lbl2Vec

models signiﬁcantly outperformed

the recent

LSA

approach for each metric. This

success indicated that using our jointly created embed-

dings for unsupervised classiﬁcation is more suitable

than using term-document frequencies on which

LSA

is heavily reliant. Moreover, the results showed that

our

Lbl2Vec

approach allowed for unsupervised clas-

siﬁcation in case the labeling effort was estimated to

be more expensive than the beneﬁt of a more accurate

classiﬁcation. However, comparing our approach to

the supervised baseline results, we observed that pro-

viding labels for each document is paramount if highly

accurate classiﬁcation results are required.

4.6 Document Retrieval Evaluation

One of the main features of our

Lbl2Vec

approach is

retrieving related documents on a single or multiple

predeﬁned topics without actually having to consider

any further topics contained in the dataset that may

not be of interest. For both datasets, we see each

class as an independent topic. Therefore, we can use

our trained

Lbl2Vec

models to retrieve topic-related

documents for each class independently. When ad-

justing the topic similarity thresholds

,..., α

for

each topic

,...,t

in the respective datasets, we can

observe the receiver operating characteristic (

ROC

)

curves in Figures 3 and 4. By adjusting the topic sim-

ilarity parameter

to be closer to 1, we can reduce the

false positive rate and retrieve proportionally more doc-

uments that are truly related to a topic. Figure 3 shows

that the topics, Business and Science/Technology, have

the lowest area under the ROC curve (

AUC

) values of

all topics within the AG’s Corpus. Further, we know

from Figure 2 that these topics are similar. Hence, we

infer that it is hard for our

Lbl2Vec

approach to dis-

tinguish between related topics. However, the better

AUC

values for the Sports and World topics in Figure

3 and their distance to other topics in Figure 2 show

Figure 3:

ROC

curves of the

Lbl2Vec

model trained on the

AG’s Corpus.

Figure 4:

ROC

curves of the

Lbl2Vec

model trained on the

20Newsgroups.

that our

Lbl2Vec

approach can create suitable topic

representations given the absence of other similar top-

ics in the dataset. The micro-average

ROC

curves of

Figures 3 and 4 indicate that, if we want to achieve a

false positive rate of less than 1% on average, we re-

trieve

≈ 20%

of documents that are truly relevant for a

topic. Therefore, we argue that our

Lbl2Vec

approach

can sample a small dataset with high precision from a

large corpus of documents. This smaller dataset can

then be used, for example, as a starting point for a

subsequent semi-supervised classiﬁcation approach.

4.7 Keywords Analysis

We are additionally interested in how the choice of

keywords affects our

Lbl2Vec

results. Since the

keywords also directly affect the predeﬁned topics,

this simultaneously involves the analysis of topic

distributions. We conduct some hypothesis tests

to address the question of what characterizes good

keywords and topics. For all our tests, we use the

deﬁned keywords of each topic from the concatenation

Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predeﬁned Topics

129

of the two datasets to compute correlation coefﬁcients

and determine a signiﬁcance level of 0.05. We choose

Kendall’s

as our correlation coefﬁcient to measure

monotonic relationships. It is robust against outliers

and small datasets.

First, we test whether the trained

Lbl2Vec

model is

subsequently better able to distinguish topic-related

documents from unrelated ones the more topic-related

keywords are used to describe a topic. This test

assumes that more accurate descriptions of topics also

require more topic-related keywords. Accordingly, we

deﬁne our null hypothesis

(1)

as the

AUC

values

of topics modeled by

Lbl2Vec

are unrelated to the

number of topic-related predeﬁned keywords and

our alternative hypothesis

(1)

as the

AUC

values

of topics modeled by

Lbl2Vec

are positively related

to the number of topic-related predeﬁned keywords.

At ﬁrst glance, the correlation coefﬁcient in Table

Table 5: Correlation values that measure the relation-

ship between

= number of deﬁned topic keywords

and

Y = AUC value of a topic. X

min

= 10 and X

max

= 44.

Correlation coefﬁcient p-value

Kendall’s τ = 0.19 0.20

5 suggested a tendency toward a slightly positive

correlation. However, the p-value exceeded our

deﬁned signiﬁcance level. Therefore, our test results

were statistically insigniﬁcant, hence we cannot

reject

(1)

. Consequently, we found no support for

the assumption that

Lbl2Vec

can yield better topic

models if we use more topic-related keywords, as

there is insufﬁcient evidence to infer a relationship

between X

and Y .

Second, we asses whether using many similar

keywords to describe a topic provides a better dis-

tinction from other topics than using many dissimilar

keywords. As a result, we anticipate

Lbl2Vec

topic

models are better at distinguishing topic-related

documents from unrelated ones if we deﬁne mostly

similar keywords for a single topic. To test this, we

initially deﬁne the average intratopic similarity of

keyword embeddings K

of a topic t

as follows:

∆(i) =

∑

∈K

cos^(

)

| · (|K

| − 1)

(2)

Subsequently, we determine our null hypothesis

(2)

as the

AUC

values of topics modeled by

Lbl2Vec

are unrelated to the average intratopic similarity of

topic keywords and our alternative hypothesis

(2)

as the AUC values of topics modeled by Lbl2Vec are

positively related to the average intratopic similarity of

topic keywords. Based on the p-value in Table 6, we

Table 6: Correlation values that measure the relationship

between

average intratopic similarity of topic keywords

and

Y = AUC

value of a topic.

min

= 0.15

and

max

0.37.

Correlation coefﬁcient p-value

Kendall’s τ = 0.33 0.02

rejected

(2)

and from the correlation coefﬁcient, we

concluded a statistically signiﬁcant medium positive

correlation between

and

. From this evidence, we

found support for our original assumption that using

similar keywords to describe a topic yields better

Lbl2Vec models.

The third test is based on our observation from

Subsection 4.6, that

Lbl2Vec

models more accurate

representations of topics dissimilar to all other topics

within a dataset. We further investigate this aspect, by

examining whether topic keywords highly dissimilar

to all other topic keywords allow

Lbl2Vec

to model

more precise topic representations. For this test, we

deﬁne the average intertopic similarity of keyword

embeddings K

of a topic t

δ(i) =

(|T | − 1)

(|T |−1)

∑

n6=i

∑

∈K

cos^(

)

| · |K

. (3)

Afterward, we deﬁne our null hypothesis

(3)

as the

AUC

values of topics modeled by

Lbl2Vec

are un-

related to the average intertopic similarity of topic

keywords and our alternative hypothesis

(3)

as the

AUC

values of topics modeled by

Lbl2Vec

are neg-

atively related to the average intertopic similarity of

topic keywords. From Table 7, we concluded a mod-

erate negative monotonic relationship between

and

. Moreover, from the p-value, we infer that our third

Table 7: Correlation values that measure the relationship

between

average intertopic similarity of topic keywords

and

Y = AUC

value of a topic.

min

= 0.07

and

max

0.11.

Correlation coefﬁcient p-value

Kendall’s τ = -0.35 0.02

hypothesis test is statistically signiﬁcant and we can

reject

(3)

. The deﬁned topic keywords provide the

foundation for the subsequent

Lbl2Vec

feature space

embedding of a topic. The feature space location, in

turn, determines the similarity of topics to each other.

Accordingly, the dissimilarity of topic keywords trans-

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

130

fers to the resulting

Lbl2Vec

topic representations and

vice versa. Hence, in this statistically signiﬁcant inter-

topic keywords similarity test, we found further sup-

port for our earlier observation that topics dissimilar

to all other topics may be modeled more precisely by

Lbl2Vec

. Consequently, to obtain a more precise topic

representation by

Lbl2Vec

, we need to deﬁne topic

keywords making them as dissimilar as possible to the

keywords of other topics.

5 CONCLUSION

In this work, we introduced

Lbl2Vec

, an approach

to retrieve documents from predeﬁned topics unsuper-

vised. It is based on jointly embedded word, document,

and label vectors learned solely from an unlabeled doc-

ument corpus. We showed that

Lbl2Vec

yields better

ﬁtting models of predeﬁned topics than conventional

topic modeling approaches, such as

LDA

. Further, we

demonstrated that

Lbl2Vec

allowed for unsupervised

document classiﬁcation and could retrieve documents

on predeﬁned topics with high precision by adjust-

ing the topic similarity parameter

. Finally, we ana-

lyzed how to deﬁne keywords that yield good

Lbl2Vec

models and concluded that we need to aim for high

intratopic similarities and high intertopic dissimilar-

ities of keywords.

Lbl2Vec

facilitates the retrieval

of documents on predeﬁned topics from an unlabeled

document corpus, avoiding costly labeling work. We

made our

Lbl2Vec

code as well as the data publicly

available.

6 ETHICAL CONSIDERATIONS

We provide our work in good faith and in accordance

with the ACL Code of Ethics

. However, our approach

depends heavily on the underlying data. Therefore,

users should preprocess the targeted datasets accord-

ing to the ethics’ guidelines to prevent discrimination

in the modeled topics. Further, our approach is heavily

prone to bias introduced by the human expert deﬁning

the keywords and unprotected against intentional

misuse, allowing malicious users to abuse the retrieved

topics. Another concern, as with many models, is

the environmental and ﬁnancial costs incurred in the

training process. Although such costs are naturally

involved in our case, they are quite low compared

with current state-of-the-art language models. Thus,

our approach is comparably environmentally friendly

https://www.aclweb.org/portal/content/acl-code-ethics

and enables ﬁnancially disadvantaged users to

conduct further research.

ACKNOWLEDGEMENTS

The authors would like to thank Thomas Kinkeldei of

ROKIN for his contributions to this paper.

This work has been supported by funds from the Bavar-

ian Ministry of Economic Affairs, Regional Develop-

ment and Energy as part of the program “Bayerischen

Verbundf

orderprogramms (BayVFP) – F

orderlinie

Digitalisierung – F

orderbereich Informations- und

Kommunikationstechnik”.

REFERENCES

Ai, Q., Yang, L., Guo, J., and Croft, W. B. (2016). Improving

language estimation with the paragraph vector model

for ad-hoc retrieval. In Proceedings of the 39th Inter-

national ACM SIGIR Conference on Research and De-

velopment in Information Retrieval, SIGIR ’16, page

869–872, New York, NY, USA. Association for Com-

puting Machinery.

Angelov, D. (2020). Top2vec: Distributed representations of

topics.

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern in-

formation retrieval, volume 463. ACM press New

York.

Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet

allocation. Journal of Machine Learning Research,

3:993–1022.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J.

(2000). Lof: Identifying density-based local outliers. In

Proceedings of the 2000 ACM SIGMOD international

conference on Management of data, page 93–104, New

York, NY, USA. Association for Computing Machin-

ery.

Chang, M.-W., Ratinov, L.-A., Roth, D., and Srikumar, V.

(2008). Importance of semantic representation: Data-

less classiﬁcation. In Proceedings of the Twenty-Third

AAAI Conference on Artiﬁcial Intelligence, pages 830–

835.

Chen, X., Xia, Y., Jin, P., and Carroll, J. (2015). Dataless

text classiﬁcation with descriptive lda. In Proceed-

ings of the Twenty-Ninth AAAI Conference on Artiﬁcial

Intelligence.

Dai, A. M., Olah, C., and Le, Q. V. (2015). Document

embedding with paragraph vectors.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,

T. K., and Harshman, R. (1990). Indexing by latent

semantic analysis. Journal of the American Society for

Information Science, 41(6):391–407.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).

Maximum likelihood from incomplete data via the em

algorithm. Journal of the Royal Statistical Society:

Series B (Methodological), 39(1):1–22.

Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predeﬁned Topics

131

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019).

BERT: Pre-training of deep bidirectional transformers

for language understanding. In Proceedings of the

2019 Conference of the North American Chapter of

the Association for Computational Linguistics: Human

Language Technologies, Volume 1 (Long and Short

Papers), pages 4171–4186, Minneapolis, Minnesota.

Association for Computational Linguistics.

Gabrilovich, E. and Markovitch, S. (2007). Computing

semantic relatedness using wikipedia-based explicit

semantic analysis. In Proceedings of the 20th Inter-

national Joint Conference on Artiﬁcal Intelligence, IJ-

CAI’07, page 1606–1611, San Francisco, CA, USA.

Morgan Kaufmann Publishers Inc.

Gysel, C. V., de Rijke, M., and Kanoulas, E. (2018). Neural

vector spaces for unsupervised information retrieval.

ACM Trans. Inf. Syst., 36(4).

Haj-Yahia, Z., Sieg, A., and Deleris, L. A. (2019). Towards

unsupervised text classiﬁcation leveraging experts and

word embeddings. In Proceedings of the 57th Annual

Meeting of the Association for Computational Linguis-

tics, pages 371–379, Florence, Italy. Association for

Computational Linguistics.

Ko, Y. and Seo, J. (2000). Automatic text categorization

by unsupervised learning. In Proceedings of the 18th

Conference on Computational Linguistics - Volume 1,

COLING ’00, page 453–459, USA. Association for

Computational Linguistics.

Lau, J. H. and Baldwin, T. (2016). An empirical evaluation of

doc2vec with practical insights into document embed-

ding generation. In Proceedings of the 1st Workshop on

Representation Learning for NLP, pages 78–86, Berlin,

Germany. Association for Computational Linguistics.

Le, Q. and Mikolov, T. (2014). Distributed representations of

sentences and documents. In Xing, E. P. and Jebara, T.,

editors, Proceedings of the 31st International Confer-

ence on Machine Learning, volume 32 of Proceedings

of Machine Learning Research, pages 1188–1196, Be-

jing, China. PMLR.

Liu, B., Li, X., Lee, W. S., and Yu, P. S. (2004). Text clas-

siﬁcation by labeling words. In McGuinness, D. L.

and Ferguson, G., editors, Proceedings of the Nine-

teenth National Conference on Artiﬁcial Intelligence,

Sixteenth Conference on Innovative Applications of

Artiﬁcial Intelligence, July 25-29, 2004, San Jose, Cal-

ifornia, USA, pages 425–430. AAAI Press / The MIT

Press.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Rao, D., P, D., and Khemani, D. (2006). Corpus based

unsupervised labeling of documents. In Sutcliffe, G.

and Goebel, R., editors, Proceedings of the Nineteenth

International Florida Artiﬁcial Intelligence Research

Society Conference, Melbourne Beach, Florida, USA,

May 11-13, 2006, pages 321–326. AAAI Press.

Sievert, C. and Shirley, K. (2014). LDAvis: A method for

visualizing and interpreting topics. In Proceedings of

the Workshop on Interactive Language Learning, Vi-

sualization, and Interfaces, pages 63–70, Baltimore,

Maryland, USA. Association for Computational Lin-

guistics.

Song, Y. and Roth, D. (2014). On dataless hierarchical text

classiﬁcation. In Proceedings of the Twenty-Eighth

AAAI Conference on Artiﬁcial Intelligence, pages 1579–

1585.

Yin, W., Hay, J., and Roth, D. (2019). Benchmarking zero-

shot text classiﬁcation: Datasets, evaluation and en-

tailment approach. In Proceedings of the 2019 Con-

ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Conference

on Natural Language Processing (EMNLP-IJCNLP),

pages 3914–3923, Hong Kong, China. Association for

Computational Linguistics.

Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-

level convolutional networks for text classiﬁcation. In

Proceedings of the 28th International Conference on

Neural Information Processing Systems - Volume 1,

NIPS’15, page 649–657, Cambridge, MA, USA. MIT

Press.

Zhang, Y., Meng, Y., Huang, J., Xu, F., Wang, X., and Han,

J. (2020). Minimally supervised categorization of text

with metadata. In Proceedings of the 43rd International

ACM SIGIR Conference on Research and Development

in Information Retrieval, pages 1231–1240.

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

132