Data Scarcity: Methods to Improve the Quality of Text Classiﬁcation

Ingo Glaser

1 a

, Shabnam Sadegharmaki

, Basil Komboz

and Florian Matthes

Chair of Software Engineering for Business Information Systems, Technical University of Munich,

Boltzmannstrasse 3, 85748 Garching bei M

unchen, Germany

Allianz SE, Munich, Germany

Keywords:

Data Scarcity, Natural Language Processing, Text Classiﬁcation, Legal Text Analytics.

Abstract:

Legal document analysis is an important research area. The classiﬁcation of clauses or sentences enables

valuable insights such as the extraction of rights and obligations. However, datasets consisting of contracts

or other legal documents are quite rare, particularly regarding the German language. The exorbitant cost of

manually labeled data, especially in regard to text classiﬁcation, is the motivation of many studies that suggest

alternative methods to overcome the lack of labeled data.

This paper experiments the effects of text data augmentation on the quality of classiﬁcation tasks. While a large

amount of techniques exists, this work examines a selected subset including semi-supervised learning methods

and thesaurus-based data augmentation. We could not just show that thesaurus-based data augmentation as

well as text augmentation with synonyms and hypernyms can improve the classiﬁcation results, but also that

the effect of such methods depends on the underlying data structure.

1 INTRODUCTION

With the burst of available textual data, automation

of certain processes in various areas, such as adver-

tisement, risk evaluation, or translation, is becoming

more and more attractive. As a result, text classiﬁ-

cation became one of the essential tasks in natural

language processing (NLP) and knowledge discov-

ery. Classiﬁcation as a supervised learning (SL) tech-

nique has been applied widely in different areas such

as language modeling, sentiment analysis, topic mod-

eling, and named entity recognition (Allahyari et al.,

2017). However, the other side of the coin is train-

ing data in certain domains. Supervised techniques

have to acquire enough amount of labeled data to be

able to generalize a model ﬁtted to the labeled target.

Hence, data is not equal to training data. The domi-

nant source of these annotated training data is human

experts. However, it is not achievable easy to create

annotated corpora. The process of annotating is time-

consuming, expensive, and more importantly error-

prone. This challenge is even more relevant when it

comes to deep learning (DL) techniques that require

labeled data on a massive scale.

Many studies are addressing this challenge. As an

instance, the lack of training data was the primary mo-

https://orcid.org/0000-0002-5280-6431

tivation behind the advent of semi-supervised learn-

ing (SSL) methods. The most basic approach, self-

training (ST) was introduced back in 1960s (Chapelle

et al., 2006). Recently, graph-based SSL ap-

proaches have gained popularity due to the ﬂexibil-

ity and ease of interpretation (Sawant and Prabuku-

mar, 2018). Also, state-of-the-art literature considers

multi-instance or transfer learning (Cheplygina et al.,

2019). However, domains such as the legal domain

rely on a vast amount of domain knowledge and as a

result, require extensive feature engineering. Further-

more, for these domains an explainable classiﬁcation

result is crucial. The goal of this paper is to not just

support the traditional domains suited for NLP, but

also these highly speciﬁc domains. Hence, the scope

of this paper is limited to ST, label propagation (LP),

and thesaurus-based data augmentation as traditional

machine learning (ML) techniques.

As mentioned, many studies address the problem

of data scarcity, but yet there is no state-of-the-art way

to overcome it. This leads to our hypotheses behind

this paper: The effect of methods to improve clas-

siﬁcation tasks despite data scarcity depends on the

characteristics of the underlying dataset.

For that reason, three different scenarios in Ger-

man text classiﬁcation have been considered: (1) clas-

siﬁcation of economic news, (2) classiﬁcation of le-

556

Glaser, I., Sadegharmaki, S., Komboz, B. and Matthes, F.

Data Scarcity: Methods to Improve the Quality of Text Classiﬁcation.

DOI: 10.5220/0010268005560564

In Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2021), pages 556-564

ISBN: 978-989-758-486-2

gal norms and regulations, and (3) classiﬁcation of

tweets.

The remainder of the paper is structured as fol-

lows: Section 2 provides a short overview of the re-

lated work, the experimental setup along with the

used datasets are discussed in Section 3, ﬁnally, the

approaches and its performance is evaluated in Sec-

tion 4 before Section 5 closes with a conclusion and

outlook.

2 RELATED WORK

Data scarcity is one of the most important obstacles in

many research areas, involving SL, but also in particu-

lar concerning real-world problems. A vast amount of

approaches to overcome this hurdle exist. These can

be divided into the categories (1) SSL, (2) data aug-

mentation, (3) multi-instance learning, and (4) trans-

fer learning.

Each one of them addresses a speciﬁc problem.

SSL techniques affect the algorithm directly by en-

abling it to consume unlabeled data as well as labeled

data. Data augmentation, on the other hand, trans-

forms and expands the data even before feeding it

to the algorithm. Multi-instance learning enables the

utilization of labels for a bag of instances instead of

each one separately. Transfer learning can apply the

knowledge in another domain with enough samples

to process the domain with less training data. As al-

ready brieﬂy touched in the introduction, this paper

focuses on traditional ML approaches and thus only

investigates SSL and data augmentation.

The following sections describe relevant related

work.

2.1 Semi-supervised Learning

Through SSL, both labeled and unlabeled data are

feed to the learning algorithm. The main idea behind

it is the fact that, despite scarce labeled data, there is

a large amount of unlabeled data available for many

applications (Zhu and Goldberg, 2009). In the follow-

ing, the most popular techniques in the SSL paradigm

and its applications in NLP are discussed.

2.1.1 Self-training

ST is the most common technique in SSL. In this

approach, ﬁrst, a prediction model is learned based

on available labeled data. The model then is used

to predict the unlabeled data. These pseudo-labeled

data alongside the original labeled ones will be later

fed to a new model to be retrained. If the second

model is different from the base one, it is also called

co-training (Zhu, 2005). Various approaches in self-

training differ in the selection of these pseudo-labeled

data. As an instance of text classiﬁcation, (Pavlinek

and Podgorelec, 2017) applied a threshold on the re-

sults to ﬁlter the more conﬁdent labels for the next

round of training.

2.1.2 Label Propagation

Among SSL methods, graph-based approaches

gained popularity recently because of their scalabil-

ity but with the cost of higher complexity (Sawant

and Prabukumar, 2018). In graph-based SSL, labeled

and unlabeled data are represented as vertices in a

weighted graph, with edge weights encoding the sim-

ilarity between instances (Zhu et al., 2003). Labeling

is done by smooth regularization of these weights in

a process called LP.

The graph is constructed in two steps: (1) the ad-

jacency matrix is constructed based on the k-nearest

neighbor with radius ε, and (2) the weight of each

edge is calculated by similarity functions such as

gaussian or the inverse Euclidean distance function.

In the next step, the classiﬁcation problem can be

represented as optimization of the normalized graph

laplacian (Zhou et al., 2004).

2.1.3 Semi-supervised Learning in Text

Classiﬁcation

Text classiﬁcation is the task of assigning a cat-

egory to a sentence or document. These cate-

gories vary over many applications such as automatic

email reply (Kannan et al., 2016), news classiﬁca-

tion (Howard and Ruder, 2018), question answer-

ing (Cer et al., 2018), or sequence modeling (Clark

et al., 2018) among others.

SSL has hosted many novel pieces of research in

NLP. ST, for instance, has been applied widely in

language modeling techniques such as part-of-speech

tagging and parsing (McClosky et al., 2006). Be-

sides, (Pavlinek and Podgorelec, 2017) applied ST

for increasing the training data size which improved

the performance of text classiﬁcation. However, some

papers doubted the fact that self-training can be help-

ful as the errors are ampliﬁed in each iteration (Clark

et al., 2003).

On the other hand, SSL had been part of the state-

of-the-art classiﬁers in different applications. John-

son and Zhang (Johnson and Zhang, 2016) exploited

unlabeled data to categorize texts by driving the re-

gion embeddings from an LSTM network. LP also

has been shown to be effective in sentiment analy-

sis (Yang and Shaﬁq, 2018). Moreover, Google’s

Data Scarcity: Methods to Improve the Quality of Text Classiﬁcation

557

smart reply project takes advantage of LP in an au-

tomatic email reply (Kannan et al., 2016). Another

application is the classiﬁcation of legal data. (Waltl

et al., 2017) applied active machine learning (AML)

to approach legal norm classiﬁcation. (Savelka et al.,

2015) utilized AML for the analysis of statutes.

2.2 Data Augmentation

Data augmentation techniques have addressed the

problem of a lack of labeled data as well. The ter-

minology comes originally from image processing,

where more data can be crafted by adding noise

or transforming existing images (Perez and Wang,

2017).

To adapt this deﬁnition to text, given a text or sen-

tence, a variation of the text is created without affect-

ing the meaning. The ﬁrst hurdle is that a meaning of

a text is rather subjective and therefore hard to train.

Hence this technique has not been applied in NLP as

extensively as image or signal processing. However,

there are some breakthroughs recently, such as (Wang

and Yang, 2015), who proposed a novel data augmen-

tation approach.

The ideal way of varying a text can be paraphras-

ing, but it is a labor-intensive task. One alternative is

replacing the words with synonyms or similar words,

either using a thesaurus (Zhang and LeCun, 2015) or

embeddings (Miyato et al., 2016).

(Sun and He, 2018) have introduced multi-

granular data augmentation for sentiment analysis by

incorporating synonyms and word vectors as word

level and also some transformation for phrase and

sentence-level. The synonyms often are derived from

a thesaurus such as WordNet (Fellbaum, 2010). Un-

like WordNet, the German version, GermaNet (Hamp

and Feldweg, 1997; Henrich and Hinrichs, 2010) is

not open-source and requires a licence. (Zhang and

LeCun, 2015), who introduced a random selection al-

gorithm for replacing a synonym. Another nice dis-

cussion about data augmentation for NLP has been

made most recently by Wei and Zou (Wei and Zou,

2019).

3 EXPERIMENTAL SETUP

3.1 Objective

As already brieﬂy touched in the introduction, we

assume that the effect of methods to overcome data

scarcity depends on the respective dataset. Therefore,

we utilize three different datasets with varying charac-

teristics. However, when talking about data scarcity, it

Table 1: Distribution of labels in the LN dataset.

Semantic type Occurrences Rel occurr. (%)

Duty 117 19

Indemnity 8 1

Permission 148 25

Prohibition 18 3

Objection 98 16

Continuation 21 3

Consequence 117 19

Deﬁnition 18 3

Reference 56 9

can be distinguished between two different problems.

(1) the label problem, and (2) the data problem. While

in the former case enough data is existent, but just la-

bels are missing, the latter problem even misses sufﬁ-

cient data instances. This paper investigates methods

to overcome both problems on different datasets.

3.2 Data

As mentioned, three diverse datasets were used to

show, that a generalization of the effects of the exam-

ined methods is not possible. The remainder of this

Subsection deals with the utilized data.

3.2.1 Legal Norms (LN)

This dataset has been introduced in (Glaser et al.,

2018). It contains 601 sentences of the German ten-

ancy law which were manually labeled according to

a taxonomy, constituting 9 semantic types. Table 1

shows the distribution of the different semantic types.

For more information about the legal deﬁnition of

these semantics, please have a look at (Waltl et al.,

2019).

As representation for formal and technical Ger-

man sentences, this dataset has been used. In other

words, exactness in the meaning of technical words

plays an essential role in this case. Moreover, the clas-

siﬁcation of semantics in the LN dataset is a multi-

class problem, while the next two datasets represent

binary classiﬁcation.

Pre-processing of LN. In terms of pre-processing,

words were lemmatized, after their part-of-speech

tags had been extracted using the spaCy library (Hon-

nibal and Montani, 2017). For example, the word

”booked” is converted to the phrase ”book v”. In

the next step, the documents were transformed into

numeric vectors. There are many techniques in

this regard, such as TF-IDF, term-frequency, binary-

frequency, or different embeddings. The utilization

of a binary vectorizer leads to the best performance.

This setup was used for further experiments. The

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

558

Table 2: Distribution of labels in the NB dataset.

Label Occurrences Rel occurr. (%)

Critical 282 12%

Non-critical 1996 88%

model achieved the best result when binary-frequency

was applied.

3.2.2 GermEval18: Offensive Tweets (GE18)

For the sake of analyzing the effect of different aug-

mentation methods in the social network context, we

have employed the GermEval-2018 dataset (Wiegand

et al., 2018). It is a publicly available dataset of tweets

in German with a binary label, providing the infor-

mation whether the tweet contains offensive content.

The authors offer two label sets for this purpose, a

coarse-grained and a more ﬁne-grained. In this paper,

the coarse-grained label set was chosen. The dataset

includes 5.009 tweets, whereof 1.688 are labeled as

offensive. Due to its nature, this dataset contains in-

formal short texts in comparison to the formal content

of the LN dataset.

Pre-processing of GE18. Rule-based approaches

in order to remove superﬂuous special characters,

such as hashtags, the so-called mentions, or links have

been removed or replaced. Afterward, the same pre-

processing steps from the above were applied.

3.2.3 News Bulletin (NB)

This is a private dataset provided by a big German in-

surance company, which contains 2.278 news regard-

ing the German economy and industry. The dataset

has been labeled manually by experts into whether it

contains critically important information for the com-

pany or not. Being important is subject to differ-

ent criteria such as target company, industry, and any

other signals affecting the market of the companies

insured by the insurance company. The frequency of

labels is shown in Table 2.

Creating a model to extract critical news saves the

cost and time of experts in insurance industries by re-

ducing the risk of omission through crucial pieces of

information. Moreover, this dataset is particularly in-

teresting for the present research, because news in-

volves long texts which are mostly edited and formal-

ized in a standardized way. Hence, this dataset pro-

vides the opportunity to evaluate the methods of this

paper on longer texts, too.

Preprocessing of NB. After the removal of special

characters as well as links, the methods from the LN

dataset were applied.

3.3 Experiments

We implemented all the experiments in this research

in Python and scikit-learn (Pedregosa et al., 2011).

The code will be published on Github.

3.3.1 Effect of Graph-based SSL

The ﬁrst experiment aims to investigate how well

graph-based SSL performs compared to classic SL

methods on textual data. The experiment was de-

signed by having different training sizes and consis-

tent test size.

LP with regularization is implemented in scikit-

learn by a function named LabelSpreading. This

function follows the work of (Zhou et al., 2004)

which suggested an afﬁnity matrix based on the nor-

malized graph Laplacian and soft clamping across the

labels. There are two parameters that we tuned for

each dataset. (1) the parameter of the RBF kernel

deﬁning how spread the decision region is (Gamma),

and (2) the parameter which conﬁgures the label prop-

agation and is the relative amount that an instance

should adopt the information from its neighbors as

opposed to its original label (Alpha).

3.3.2 Effect of Self-training SSL

For the second experiment, we investigated another

approach in SSL, called ST. The goal of this exper-

iment is to examine how much ST can compensate

for the lack of training data in the textual context.

To achieve this goal, each dataset was divided into

three parts: (1) constant-size test set, (2) constant-

size training set, and (3) variable-size augmented set

(pseudo-labeled).

After the pre-processing steps, as described in

Section 3.2, we implemented the self-training frame-

work by means of scikit-learn models. Moreover, a

custom k-fold validator was required to adapt to the

implemented framework. Therefore, the evaluation is

repeated ﬁve times, and then the average of F

score

is reported as the performance of the model instead

of employing a built-in cross validator. Moreover, a

threshold was introduced to ﬁlter the pseudo-labels

with conﬁdence above it. The value of the threshold

was tuned during the training process.

As a base of comparison, ﬁrst, we evaluated the

model without the presence of unlabeled data. Then

we evaluated the highest cut-off possible. It means we

ﬁt the model by both augmented and training set with

correct labels in order to omit the ﬁrst model error.

In other words, an ST model cannot achieve a better

result than this cut-off.

Data Scarcity: Methods to Improve the Quality of Text Classiﬁcation

559

In the next step, we incrementally increased the

number of unlabeled data inserted to the next model.

We investigated the hypothesis if a larger number of

unlabeled data increases the performance of classiﬁ-

cation. Moreover, the effect of the threshold on the

performance is reported.

3.3.3 Effect of Thesaurus-based Data

Augmentation: Synonyms and Hypernyms

As mentioned in the previous chapter, data augmenta-

tion is coming from the area of the image processing

where to create a more generalized model, different

variations of an image, e.g. picture of an object from

different angles, are added to the training set.

During text augmentation, for each training sam-

ple, the different variations are created by the replace-

ment of words with their synonyms or hypernyms.

The goal is to expand the training dataset to catch sim-

ilar words around a topic. To achieve this goal, XML-

based German synsets provided by GermaNet (Hamp

and Feldweg, 1997) are employed. However, due to

the licensing situation, it could not be applied to the

NB dataset. Therefore, this experiment is tested and

reported only on the two remaining datasets.

To better understand the effect of thesaurus-based

data augmentation, the following example is consid-

ered:

The weather is nice, labeled as +.

The weather is awful, labeled as -.

Assuming a classiﬁcation model has been trained

with the sentences above, the polarity of the following

sentence shall be predicted:

The weather is decent.

For simpliﬁcation, let’s assume the binary classiﬁ-

cation is determined by the cosine similarity between

the binary vectors. In that case, the word ”decent”,

which is not among the training vocabularies, is ig-

nored by the binary vectorizer. As a result, the model

determines the similarities incorrectly:

cos similarity(sent1, unseen) = 0.86

cos similarity(sent2, unseen) = 0.86

Using a synonym thesaurus, the training set can

be augmented to include ”decent” as a synonym of

”nice”. Most of the studies replace synonyms ran-

domly and add new documents to the training. How-

ever, in the following experiment, all alternatives

were compared. For better understanding, the sim-

ple example is expanded: Let ”decent” stand as the

synonym of ”nice” and the adjective ”bad” as the

synonym of ”awful”, we get different possibilities for

data augmentation. The remainder of this section de-

scribes these.

Horizontal Augmentation by Synonyms. In this

case, the number of training data is consistent while

the extra words are concatenated to the sentences. No-

tably, in this case, the unseen or test data should also

be transformed. For example:

The weather is nice decent, +

The weather is awful bad, -

The weather is decent nice, ?

Consequently, the unlabeled sentence moves to-

ward the correct label:

cos

similarity(set1, unseen) = 1

cos similarity(sent2, unseen) = 0.6

. It is essential to consider the feature space is in-

creased. In reality, words have more than one syn-

onym. This fact can degrade the similarity of close

sentences. Moreover, synonym relations are not tran-

sitive. As an example, in WordNet, the word ”nice”

is a synonym of ”decent” and ”decent” is a synonym

of ”clean”. However, ”nice” does not count as the

synonym of ”clean”. This fact can enforce concate-

nating irrelevant words. The following sentences are

created by synonyms extracted from WordNet:

The weather is nice decent good

pleasant, +

The weather is awful bad, -

The weather is decent nice adequate

modest, ?

This example shows adding too many irrelevant

words can have side effects for the similarity function:

cos similarity(set1, unseen) = 0.72

cos similarity(sent2, unseen) = 0.54

Besides, the increase of dimensionality can affect the

assumptions made in ML algorithms which should

be revisited. In the next section, we show how this

method performs in the mentioned datasets.

Vertical Augmentation by Synonyms. As the sec-

ond alternative, the original document remains intact,

and combinations of synonyms for the word in the

document are added to the training data. This ap-

proach is more related to the original concept of data

augmentation.

The weather is nice, +

The weather is decent, +

The weather is awful, -

The weather is bad, -

The weather is decent, ?

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

560

Mainly, it generalizes the training data and is able

to catch similar words. However, this approach has

a signiﬁcant downside when it comes to text process-

ing. TF-IDF is a common technique for vectorizing

the text. DF in the denominator normalizes the fre-

quency of the words which repeats in different doc-

uments. This approach affects TF-IDF dramatically.

In the above sample, for instance, ”The weather is”

is more likely to be degraded in the ﬁnal vector. This

is one of the reasons we utilized a binary vectorizer

instead of TF-IDF.

Another important observation is that, in this case,

the unlabeled data is not transformed. Still, the most

critical challenge of this approach is the fact that

all combination of synonyms of the words implies

a vast number of variations. Following the work of

(Zhang and LeCun, 2015), we introduced a parameter

n random which selects a speciﬁc number of varia-

tions.

Let’s assume each document has |W | number of

words and each one has in average N

syn

number of

synonyms. Hence, there are |W | ∗ N

syn

varieties for

each document. Therefore, n

andom of these combi-

nations is selected. In the next step, the same label of

the original sentence is assigned to these augmented

set.

The last consequence is the fact that we can not

easily apply cross validators to the augmented train-

ing dataset. Because an augmented version of a doc-

ument should not appear in the test set. Otherwise, it

results in over-ﬁtting.

Generalization with Hypernyms. Another possi-

bility to transform the text data is the utilization of

hypernyms. GermaNet offers a similar structure as

synonyms. Instead of co-meaning, they represent the

generalization or abstract version of a word. As an

example, the word ”color” is the hypernym for the

words ”red”, ”blue” , and ”green”. In the previous

example, assuming pos adj would be the hypernym

for ”nice and neg adj to be the hypernym for ”awful”

then the transformed data will look like:

The weather is pos adj, +

The weather is neg adj, -

The weather is pos adj, ?

The disadvantages of the latter approaches are less

relevant here. Yet, hypernyms are not transitive and

therefore can increase the chance of adding irrelevant

words. Although, compared to the synonyms, fewer

words will be added to the feature space.

For the sake of implementation, we developed a

transformer to add the respective words given the

original data. The transformer searches the synset

structure in GermaNet and ﬁnds the most probable

synset of the word from which synonyms and hyper-

nyms are extracted. Moreover, in the case of verti-

cal augmentation, it selects n variation of sentences

by randomly combination synonyms and hypernyms.

For extraction of the synonyms though, we repeated

this process to ﬁnd synonyms of synonyms and there-

fore extend the possible alternatives. This helps the

method by increasing the chance that two similar

words have enough common synonyms. However,

there is a trade-off between enforcing irrelevant words

and an increasing number of common synonyms be-

tween two words. Finally, the same process as the

previous approaches is taken into account to trans-

form the text documents into binary vectors.

4 EVALUATION

4.1 Results of Graph-based SSL

To evaluate the performance of classiﬁers, a 5-fold

cross validator was employed, while the data was

shufﬂed and then split by a stratiﬁed method to en-

sure the ratio of the labels is intact.

Designed the experiment as mentioned in Sec-

tion 3, the results are shown in Figure 1. For each

dataset, the results of LP are compared to a super-

vised method, linear support vector classiﬁcation, or

logistic regression, implemented by scikit-learn. The

tuned parameters for each set of data are shown in

Table 3. Logistic regression has been chosen for

the NB dataset to be comparable with the previous

results, which were achieved internally on the NB

dataset. Nevertheless, logistic regression and linear

support vector classiﬁcation share a similar optimiza-

tion function which yields to the same hyperplanes as

the solution, and therefore our results are comparable.

The results show, by increasing the training size,

both SSL and SL performances improve. Moreover,

at some point, the amount of data does not add any

information to the classiﬁcation problem, and there-

fore the performance reaches a ceiling. Comparing

SSL and SL, the results do not show a clear superior-

ity of LP over the linear models. Only in NB, it shows

a marginally increase in performance. Moreover, we

observed that the LP technique is susceptible to the

conﬁguration of parameters, despite the linear mod-

els.

4.2 Results of Self-training

Figure 2 compares the performance of ST in the dif-

ferent datasets given the various number of unlabeled

Data Scarcity: Methods to Improve the Quality of Text Classiﬁcation

561

Table 3: Comparison of F

between SSL and SL.

Labeled NB LN GermEval18

% LogReg LP L SVC LP L SVC LP

100 0.75 0.7 0.81 0.73 0.71 0.59

50 0.67 0.63 0.75 0.63 0.7 0.58

25 0.56 0.58 0.69 0.51 0.66 0.59

12.5 0.4 0.58 0.58 0.49 0.64 0.56

Tuned C=35.38 γ=30 C=1 γ=10 C=1 γ=20

Params α=0.7 α=0.2 α=0.2

Figure 1: Effect of SSL LP by increasing the training size.

data. The solid yellow line shows the performance

of the model without considering any unlabeled data.

The dashed yellow lines show maximum performance

that the model can reach assuming all data is labeled.

The right axis, as well as the lines, show the perfor-

mance of self-training and the left axis and the bars

are showing the gradual increase of unlabeled data

while the base labeled remains intact.

The result shows in presence of a threshold, ST

boosts the performance. That is a very positive re-

sult, as usually there is a large number of unlabeled

data available in different applications, which could

be used for training as well now. Interestingly, ST

improves the performance of each dataset.

4.3 Results of Data Augmentation

Table 4 shows the performance of text classiﬁcation

on two datasets, LN and GE18, compared to the dif-

ferent augmentation techniques. The horizontal tech-

nique using synonyms performed poorly in compari-

son to the other techniques and decreased the perfor-

mance. This was expected as we discussed it in the

previous section. It must be noted that in the hor-

izontal synonym method, despite the horizontal hy-

pernym, new data is not transformed. However, for

hypernyms, we have to transform the new sets as the

categories are not necessarily meaningful word units.

The other techniques, on the other hand, could

increase the performance slightly in the LN dataset.

However, they are competing closely, and they are not

showing any better results for the GE18. This is is

a larger dataset that provides one explanation. Fur-

thermore, the data in social networks is rather diverse

and informal, whereas the news data as well as legal

norms constitute more formal data.

Table 4: Effect of different methods of data augmentation

on the F

in text classiﬁcation.

DA LN GermEval18

total # data 601 5009

% training 0.9 0.8

original 0.819 0.736

syn. horizontal 0.799 0.734

hypernym horizontal 0.822 0.734

syn. vert. random 5 0.841 0.728

syn. vert. random 10 0.837 0.726

5 CONCLUSION & OUTLOOK

This work examined the effects of methods to im-

prove the quality of text classiﬁcation despite a lack

of data. We divided the issue in two distinct problem:

(1) the label problem, and (2) the data problem. For

the former problem, we investigated the application

of SSL in different datasets. The latter problem was

tackled by means of data augmentation.

We could show, LP, although promising, can-

not improve the performance in either dataset. This

method is very sensitive to the parameters and noises

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

562

Figure 2: Effect of ST on the three datasets.

compared to classical linear models. Besides, we

showed ST with consideration of a threshold can in-

crease the performance and enables the model to take

advantage of a vast number of unlabeled data. On the

other hand, a self- or co-training method without a

threshold has undoubtedly a negative impact.

Utilizing thesaurus-based data augmentation, a

new variation of documents is created by replacing

synonyms or hypernyms. Out experiments revealed,

that data augmentation can be useful only in formal

contexts. Furthermore, the experiment with horizon-

tal data augmentation shows, that it was enforcing

more irrelevant data which caused a negative impact

on the classiﬁcation. Finally, we could show that text

augmentation with both, synonyms and hypernyms,

can slightly improve the classiﬁcation performance.

However, the parameters must be ﬁtted speciﬁc to

each application and dataset. Also, it is essential to

note that vertical data augmentation affects the vec-

torizer technique. TF-IDF as an instance has an ad-

verse effect on the words which do not have a syn-

onym. Hence, the augmentation techniques should be

applied with binary vectorization.

Last but not least, the varying results observed

during this work conﬁrm the initial hypotheses. Meth-

ods to overcome data scarcity depend a lot on the

characteristics of the used dataset.

REFERENCES

Allahyari, M., Pouriyeh, S., Asseﬁ, M., Safaei, S., Trippe,

E. D., Gutierrez, J. B., and Kochut, K. (2017). A

brief survey of text mining: Classiﬁcation, clus-

tering and extraction techniques. arXiv preprint

arXiv:1707.02919.

Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John,

R. S., Constant, N., Guajardo-Cespedes, M., Yuan,

S., Tar, C., et al. (2018). Universal sentence encoder.

arXiv preprint arXiv:1803.11175.

Chapelle, O., Sch

olkopf, B., and Zien, A. (2006). Semi-

Supervised Learning. MIT Press, London, England.

Cheplygina, V., de Bruijne, M., and Pluim, J. P. (2019). Not-

so-supervised: a survey of semi-supervised, multi-

instance, and transfer learning in medical image anal-

ysis. Medical Image Analysis.

Clark, K., Luong, M.-T., Manning, C. D., and Le, Q. V.

(2018). Semi-supervised sequence modeling with

cross-view training. arXiv preprint arXiv:1809.08370.

Clark, S., Curran, J. R., and Osborne, M. (2003). Bootstrap-

ping pos taggers using unlabelled data. In Proceedings

of the seventh conference on Natural language learn-

ing at HLT-NAACL 2003-Volume 4, pages 49–55. As-

sociation for Computational Linguistics.

Fellbaum, C. (2010). Wordnet. In Theory and applications

of ontology: computer applications, pages 231–243.

Springer.

Glaser, I., Scepankova, E., and Matthes, F. (2018). Classi-

fying semantic types of legal sentences: Portability of

machine learning models. In Proceedings of the 28th

Annual Conference on Legal Knowledge and Infor-

mation Systems (JURIX’15), Groningen, The Nether-

lands.

Hamp, B. and Feldweg, H. (1997). Germanet-a lexical-

semantic net for german. Automatic information ex-

traction and building of lexical semantic resources for

NLP applications.

Henrich, V. and Hinrichs, E. (2010). Gernedit-the germanet

editing tool. Proceedings of the ACL 2010 System

Demonstrations, pages 19–24.

Honnibal, M. and Montani, I. (2017). spacy 2: Natural lan-

guage understanding with bloom embeddings, convo-

lutional neural networks and incremental parsing. To

appear.

Howard, J. and Ruder, S. (2018). Universal language model

ﬁne-tuning for text classiﬁcation. arXiv preprint

arXiv:1801.06146.

Johnson, R. and Zhang, T. (2016). Supervised and semi-

supervised text categorization using lstm for region

embeddings. arXiv preprint arXiv:1602.02373.

Data Scarcity: Methods to Improve the Quality of Text Classiﬁcation

563

Kannan, A., Kurach, K., Ravi, S., Kaufmann, T., Tomkins,

A., Miklos, B., Corrado, G., Lukacs, L., Ganea, M.,

Young, P., et al. (2016). Smart reply: Automated

response suggestion for email. In Proceedings of

the 22nd ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 955–

964. ACM.

McClosky, D., Charniak, E., and Johnson, M. (2006). Ef-

fective self-training for parsing. In Proceedings of the

main conference on human language technology con-

ference of the North American Chapter of the Asso-

ciation of Computational Linguistics, pages 152–159.

Association for Computational Linguistics.

Miyato, T., Dai, A. M., and Goodfellow, I. (2016). Adver-

sarial training methods for semi-supervised text clas-

siﬁcation. arXiv preprint arXiv:1605.07725.

Pavlinek, M. and Podgorelec, V. (2017). Text classiﬁcation

method based on self-training and lda topic models.

Expert Systems with Applications, 80:83–93.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Perez, L. and Wang, J. (2017). The effectiveness of data

augmentation in image classiﬁcation using deep learn-

ing. arXiv preprint arXiv:1712.04621.

Savelka, J., Trivedi, G., and Ashley, K. D. (2015). Apply-

ing an interactive machine learning approach to statu-

tory analysis. In Proceedings of the 28th Annual Con-

ference on Legal Knowledge and Information Systems

(JURIX’15). IOS Press.

Sawant, S. S. and Prabukumar, M. (2018). A review

on graph-based semi-supervised learning methods for

hyperspectral image classiﬁcation. The Egyptian

Journal of Remote Sensing and Space Science.

Sun, X. and He, J. (2018). A novel approach to generate a

large scale of supervised data for short text sentiment

analysis. Multimedia Tools and Applications, pages

1–21.

Waltl, B., Bonczek, G., Scepankova, E., and Matthes, F.

(2019). Semantic types of legal norms in german laws:

classiﬁcation and analysis using local linear explana-

tions. Artiﬁcial Intelligence and Law, 27(1):43–71.

Waltl, B., Muhr, J., Glaser, I., Bonczek, G., Scepankova,

E., and Matthes, F. (2017). Classifying legal norms

with active machine learning. In Proceedings of the

28th Annual Conference on Legal Knowledge and In-

formation Systems (JURIX’15), pages 11–20.

Wang, W. Y. and Yang, D. (2015). That’s so annoying!!!:

A lexical and frame-semantic embedding based data

augmentation approach to automatic categorization of

annoying behaviors using# petpeeve tweets. In Pro-

ceedings of the 2015 Conference on Empirical Meth-

ods in Natural Language Processing, pages 2557–

2563.

Wei, J. and Zou, K. (2019). Eda: Easy data augmentation

techniques for boosting performance on text classiﬁ-

cation tasks. arXiv preprint arXiv:1901.11196.

Wiegand, M., Siegel, M., and Ruppenhofer, J. (2018).

Overview of the germeval 2018 shared task on the

identiﬁcation of offensive language. In 14th Con-

ference on Natural Language Processing KONVENS

2018.

Yang, Y. and Shaﬁq, M. O. (2018). Large scale and par-

allel sentiment analysis based on label propagation in

twitter data. In 2018 17th IEEE International Con-

ference On Trust, Security And Privacy In Computing

And Communications/12th IEEE International Con-

ference On Big Data Science And Engineering (Trust-

Com/BigDataSE), pages 1791–1798. IEEE.

Zhang, X. and LeCun, Y. (2015). Text understanding from

scratch. arXiv preprint arXiv:1502.01710.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and

Sch

olkopf, B. (2004). Learning with local and global

consistency. In Advances in neural information pro-

cessing systems, pages 321–328.

Zhu, X., Ghahramani, Z., and Lafferty, J. D. (2003). Semi-

supervised learning using gaussian ﬁelds and har-

monic functions. In Proceedings of the 20th Inter-

national conference on Machine learning (ICML-03),

pages 912–919.

Zhu, X. and Goldberg, A. B. (2009). Introduction to semi-

supervised learning. Synthesis lectures on artiﬁcial

intelligence and machine learning, 3(1):1–130.

Zhu, X. J. (2005). Semi-supervised learning literature

survey. Technical report, University of Wisconsin-

Madison Department of Computer Sciences.

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

564