Improving Mental Health using Machine Learning to Assist Humans in

the Moderation of Forum Posts

Dong Wang

, Julie Weeds

and Ian Comley

Department of Informatics, University of Sussex, Brighton, BN1 9RH, U.K.

MeeTwo Education Ltd, 17 Princelet Street, London, E1 6QH, U.K.

Keywords:

Machine Learning, Natural Language Processing, Mental Health, Online Forum Moderation,

Data Augmentation, BERT, LSTM.

Abstract:

This work investigates the potential for the application of machine learning and natural language processing

technology in an online application designed to help teenagers talk about their mental health issues. Speciﬁ-

cally, we investigate whether automatic classiﬁcation methods can be applied with sufﬁcient accuracy to assist

humans in the moderation of posts and replies to an online forum. Using real data from an existing application,

we outline the speciﬁc problems of lack of data, class imbalance and multiple rejection reasons. We inves-

tigate a number of machine learning architectures including a state-of-the-art transfer learning architecture,

BERT, which has performed well elsewhere despite limited training data, due to its use of pre-training on a

very large general corpus. Evaluating on real data, we demonstrate that further large performance gains can be

made through the use of automatic data augmentation techniques (synonym replacement, synonym insertion,

random swap and random deletion). Using a combination of data augmentation and transfer learning, perfor-

mance of the automatic classiﬁcation rivals human performance at the task, thus demonstrating the feasibility

of deploying these techniques in a live system.

1 INTRODUCTION

Mental health problems are now very common in the

UK. Approximately 1 in 4 people in the UK will expe-

rience a mental health problem each year (McManus

et al., 2009). Further, in England alone, 1 in 6 people

report experiencing a common mental health problem

(such as anxiety and depression) in any given week

(McManus et al., 2016). This pressing need to help

people with mental health problems has given rise to

the growing number of initiatives and organizations

working in the area. MeeTwo Education is a social

enterprise which has, since 2016, been working to re-

duce the number of mental health issues experienced

by young people. Their strategy is to provide an on-

line app where young people can safely share prob-

lems and receive advice and support from both pro-

fessionals and peers.

The users in the MeeTwo scenario can be, due

to age and personal nature of posts, very vulnera-

ble. Thus it is essential that they are protected from

the potential negative impacts of un-moderated posts

to this online forum. For example, the forum pro-

hibits posts containing offensive language or cyber-

bullying as well as posts containing personal infor-

mation, since these could let other users identify an

individual, posing dangers to them in the non-virtual

world. Furthermore, posts which indicate that a user

is in danger, for example potentially suicidal, are redi-

rected to a trained professional rather than being left

to peers. Currently, all posts to the MeeTwo forum

are moderated by trained human moderators. But, as

the number of users and associated posts increases,

so does the workload for moderators. This increases

the cost to run the service and potentially, the delay

between a post being made by a user and it appear-

ing online, which has a negative impact on the ex-

perience of users. We posit that whilst a fully auto-

mated moderation system is unlikely to be able to deal

with complex edge cases, there is scope for employ-

ing a semi-automated moderation system which pre-

labels posts before they are presented to human mod-

erators. Such a semi-automated system could greatly

reduce the workload of human moderators, accelerate

the moderation process, avoid some low-level human

errors and ultimately enhance the experience of users.

The potential beneﬁts of semi-automated moder-

ation are not limited to the speciﬁc use-case consid-

Wang, D., Weeds, J. and Comley, I.

Improving Mental Health using Machine Learning to Assist Humans in the Moderation of Forum Posts.

DOI: 10.5220/0008988401870197

In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 5: HEALTHINF, pages 187-197

ISBN: 978-989-758-398-8; ISSN: 2184-4305

187

ered so far. More generally, most of the posts and

comments on any websites, online forums or social

media have the chance of containing some improper

content. Following on from earlier research with sim-

ilar results, a survey of over 4000 Americans in 2017

found that roughly 40% of their respondent had per-

sonally experienced some form of online harassment

(from offensive name-calling to stalking and sexual

harassment) and over 60% of them consider it to be

a serious problem (Duggan, 2017). Many companies

and non-proﬁt organisations only have a limited bud-

get for hiring human moderators and, thus, a semi-

automated moderation model may be useful. With an

automated system pre-reviewing the content, a human

is only required to check cases which the system can-

not place with high probability in either “accepted” or

“rejected” categories. In the long-term, the accuracy

of such a system may become equal or even better

than that of a human moderator, who may be prone to

slips of concentration or inconsistency. In this case,

the system might be viewed as an automated modera-

tion model which is able to directly ‘accept’ to ‘reject’

the posts. In any case, the workload of human moder-

ators is decreased and user experience is increased.

Here, speciﬁcally, we explore the feasibility of

building a semi-automated moderation model that is

able to reduce the moderation workload and accel-

erate the moderation speed in the MeeTwo scenario.

To ensure high accuracy, the aim is to assist human

moderators, rather them replace them, in moderat-

ing the posts and replies on the MeeTwo platform.

Working in a real-world scenario means that there

are challenges not always present in academic stud-

ies using large datasets which are carefully controlled

and balanced. The dataset provided by MeeTwo has

been collected over two years of operation and is of

medium-size, containing around 22K labelled posts,

where the average length of post is 45 words. The

dataset is also very unbalanced with regard to the

numbers of accepted and rejected posts. Only a mi-

nority of posts, less than 10%, have the label “reject”.

The lack of “reject” data makes it more challenging

to reliably train machine learning models which can

generalise to unseen data.

Another challenge in studying negative online be-

haviour is the myriad of forms it can take and the lack

of a clear, common deﬁnition (Saleem et al., 2017).

Some previous work has explored different forms

of online behaviour including hate speech (Gagliar-

done et al., 2015), online harassment (Cheng et al.,

2015), and cyber-bullying (Pieschl et al., 2015). In

the real-world dataset explored here, there are mul-

tiple reasons for posts being labelled with “reject”.

This makes it much harder to build a general model

which can successfully make a binary classiﬁcation

decision. Here, we investigate building multiple clas-

siﬁers for the most commonly occurring rejection cat-

egories. However, breaking down the categories in

this way further exacerbates the problems caused by

a lack of data.

Various machine learning (ML) researchers (Shri-

vastava et al., 2017; Park et al., 2019) have demon-

strated the effectiveness of data augmentation tech-

niques in increasing the amount of training data avail-

able and, thus, the generalizability of the models

learnt. As will be discussed further in Section 2,

researchers in the area of Natural Language Pro-

cessing (NLP) have also recently started looking for

ways to augment linguistic datasets. Most recently,

transfer learning models, e.g., BERT (Devlin et al.,

2018), have been developed where a general language

model can be trained on very large unlabelled cor-

pus and then leveraged in domain-speciﬁc scenarios

with small amounts of labelled training data. Here,

we investigate the applicability of both data augmen-

tation methods and transfer learning methods in a

real-world scenario where accuracy and user trust are

paramount. We evaluate the effectiveness of these

techniques, comparing with more conventional NLP

technology: 1) Logistic Regression (LR) classiﬁer

applied to Term Frequency - Inverse Document Fre-

quency (TF-IDF) document representations, and 2)

Long Short Term Memory (LSTM) networks applied

to general-purpose words embeddings.

The contributions of this research are as follows.

We demonstrate that NLP and ML technology has

come of age and can now be successfully deployed in

challenging real-world scenarios in the area of Health

Informatics. More speciﬁcally, we show that a com-

bination of training data augmentation and transfer

learning methods can yield highly accurate classiﬁ-

cation models despite small and unbalanced datasets.

Furthermore, we show that data augmentation tech-

niques that insert and replace synonyms which have

been discovered automatically from corpora outper-

form dictionary-based techniques. Accuracy of our

models in classifying previously unseen posts, across

4 different rejection reason categories, is close to hu-

man performance. Thus, a live system, which uses

these models to pre-label posts, will be effective in

increasing the consistency of moderation and in re-

ducing human-time to moderate.

2 RELATED WORK

Wulczyn et al. (2017) demonstrate that the perfor-

mance of a machine learning model can be close to

HEALTHINF 2020 - 13th International Conference on Health Informatics

188

humans on comments moderation. They use an N-

gram word representation to represent a large dataset

of 115K Wikipedia comments which have been la-

belled as personal attack (“reject”) or not (“accept”).

The classiﬁcation architectures compared are linear

regression (LR) and a simple feed-forward neural net-

work or multiple layer perceptron (MLP). They show

that MLP performs better on detecting the personal

attack comments and that it achieves 1% absolute less

than humans.

Other researchers (Pavlopoulos et al., 2017) have

investigated applying more complex neural networks

or deep learning models to the moderation of com-

ments. Here, experiments based on the 115K

Wikipedia comments and a Gazzetta dataset (1.45M

training comments) show that Recurrent Neural Net-

work (RNN) models outperform Convolutional Neu-

ral Network (CNN) models as well as a word-list

baseline. Furthermore, more complex RNN models

using an attention mechanism lead to further perfor-

mance increases. Despite near human performance,

they also conclude that it is more realistic to build

a semi-automated system to assist human moderators

rather than replace them.

However, the complex neural network models

investigated in the aforementioned research require

large labelled datasets in order to achieve such im-

pressive results. These datasets also contain a reason-

able balance of examples for both classes and only a

single rejection reason. Thus, we cannot necessarily

expect such good results on a real world dataset which

is small, unbalanced and containing multiple rejection

reasons.

Data augmentation as a method to increase the

number of training examples and thus boost model

performance has been extensively researched in com-

puter vision (Shrivastava et al., 2017) and more re-

cently in NLP (Park et al., 2019). In computer vision,

it is now standard to rotate, reﬂect and crop images

as these transformations are unlikely to affect a clas-

siﬁcation decision as to whether an image contains,

say, a face or not. In NLP, the synonyms replacement

method (Kobayashi, 2018) has been shown to be an

effective method for data augmentation. Other possi-

ble data augmentation methods for NLP, explored by

Wei et al. (2018) include synonyms insertion, random

swap and random deletion. They combined synonyms

replacement, synonyms insertion, random swap and

random deletion together and ﬁnd them effective on 5

NLP tasks.

Many of these aforementioned data augmentation

techniques rely on a source of linguistic knowledge

about the semantics of words. For example, both

Kobayashi et al. (2018) and Wei et al.(2019) use

synonyms which are randomly selected from Word-

Net (Fellbaum, 1998). However, there is an exten-

sive literature in NLP on the automatic discovery of

semantically related words, stemming from the distri-

butional hypothesis (Harris, 1954) which states that

words which have similar meanings are used in sim-

ilar contexts. Automatic methods for discovering

synonyms have an advantage over dictionary-based

methods in that they can be tailored to a speciﬁc do-

main. They can therefore be expected to have bet-

ter coverage of the vocabulary and domain-speciﬁc

meanings (McCarthy et al., 2004). Currently, two

of the most popular methods for constructing general

purpose or domain-speciﬁc word representations are

Word2Vec (Mikolov et al., 2013) and GloVe (Pen-

nington, 2014). These methods owe their popular-

ity due to their ability to represent word meaning in

a low-dimensional space (typically around 300 di-

mensions). However, like all word representation

methods based on the distributional hypothesis, it is

well known (Weeds et al., 2004) that they conﬂate

different semantic relationships (e.g., synonymy, hy-

ponymy, antonymy, meronymy and topicality) and

whether they can be used successfully in data aug-

mentation remains an empirical question, which we

explore here.

Another potential data augmentation technique,

not explored here, exploits machine translation tech-

nology (Yu et al., 2018): alternative training examples

can be generated by translating a text from English to

some other language (e.g., French) and then translat-

ing it back into English. However, this method relies

on an external service (e.g., the Google Translate API)

or a fully implemented machine translation model,

making it considerably more time-consuming to pro-

duce a similar sized augmented dataset than when us-

ing the simpler methods described above.

Most recently in NLP, there has been a lot of in-

terest in transfer learning models (Howard and Ruder,

2018; Devlin et al., 2018; Peters et al., 2018; Radford

et al., 2013). Rather than augmenting a small domain-

speciﬁc training set, these models rely on building a

large general-purpose language model (through pre-

training on unlabelled data) and then transferring this

knowledge to a domain-speciﬁc task (through ﬁne-

tuning on small amounts of labelled data). Univer-

sal Language Model Fine-tuning (ULMFiT) (Howard

and Ruder, 2018) was shown to outperform the state-

of-art on six classiﬁcation tasks. Furthermore, perfor-

mance with 100 labelled examples matched the per-

formance of training on 10K from scratch (Howard

and Ruder, 2018), thus making it particularly useful

in scenarios with limited amounts of labelled data.

Subsequently, Bidirectional Encoder Representations

Improving Mental Health using Machine Learning to Assist Humans in the Moderation of Forum Posts

189

from Transformers (BERT) (Devlin et al., 2018) have

been shown to beat ULMFiT and achieve a new state

of art in many NLP tasks. The architecture of BERT

differs from other recent deep learning architectures

(Peters et al., 2018; Radford et al., 2013) in its use of

transformers (rather than LSTMs) and in its inherent

bidirectionality. It also uses a fully connected layer

for both the encoder and decoder networks. Stacked

self-attention and point-wise architecture are used in

the encoder and decoder parts (Vaswani et al., 2017).

The encoder has 6 identical layers which take the in-

put embeddings and produce a 512-dimensional out-

put. The decoder takes that vector and ﬁnally outputs

the probabilities. An attention function is used to map

a set of key-value and query pairs to an output vector.

The network linearly projects the queries, keys and

values in order to make a multi-head attention. Thus,

the embedding of each word in a sequence captures

contextual information about words in other positions

in the sequence.

Pre-training of the BERT model also differs from

pre-existing models, which have tended to use a

continuous bag-of-words (CBOW) model in training,

through its use of masked language models (MLM).

The MLM model randomly masks some percentage

(15%) of input tokens, replacing them with 80% prob-

ability of [MASK], 10% probability of a random

word and 10% probability of unchanging; the model

must then predict those masked tokens (Devlin et al.,

2018). The difference between MLM and CBOW is

that the MLM model randomly masks some percent-

age of input tokens while the CBOW model contin-

uously masks tokens withion a ﬁxed window. This

means that in every round of training, a MLM model

is able to consider the information of the whole in-

put while a CBOW model is only able to consider the

information of that ﬁxed window.

After pre-training, a BERT model can be ﬁne-

tuned for a variety of NLP tasks with the simple ad-

dition of a single output layer. Recently, pre-trained

BERT has been released making it simple and cheap

to deploy in real-world scenarios. However, its per-

formance on a real-world problem such as ours, rather

than standard NLP tasks from the academic literature,

has yet to be seen.

3 DATASET

The data for this research was supplied free-of-charge

by MeeTwo Education Ltd. At the time of the study,

the dataset contained around 22487 labelled posts

made by over 1000 users. Each post is associated

with a user proﬁle, which includes a user’s id, gender,

birth month, birth year and general location. User ids

were anonymised by MeeTwo in the dataset so that

they are meaningless strings and cannot be used to

identify individuals. Further, personal details from

rejected posts were also removed at source and data

encryption was used to futher safe-guard the dataset.

Any rejected posts are labelled with “reject” together

with the reasons for rejection.

The dataset is a heavily imbalanced dataset; 1654

out of 22487 posts are labelled “reject” and the re-

mainder are labelled “accept”. There are also 37 dif-

ferent categories or reasons for rejection observed in

the dataset. Most of the categories of rejected posts

contain very few posts. Here we focus on the 4 cat-

egories which contain more than 100 posts, which

are Suicidal Ideation, Not Right for MeeTwo, Unclear

and Offensive posts. The corresponding frequencies

of each of these categories is shown in Table 1.

Table 1: Rejection reasons occurring more than 100 times

in the MeeTwo dataset.

Rejection Reasons Frequency

Suicidal Ideation 453

Not right for MeeTwo 353

Unclear 225

Offensive 109

One of 33 other reasons 514

Total Rejected posts 1654

Figure 1: WordClouds for different rejection reasons. Top

left: Suicidal, top right: Unclear, bottom left: NotRight,

bottom right: Offensive.

The posts in the dataset range in length from 1

word to 140 words, with an average length of 45

words. Figure 1 shows the most frequently occur-

ring words (ignoring stop words) in each of the 4

chosen categories of rejected posts. We see that two

of the rejection categories (Suicidal Ideation and Of-

fensive) appear to have clearly related words associ-

ated with them. For example, in Suicidal Ideation

posts, the most frequent content words include ‘life’,

‘kill’, ‘end’, ‘die’ and ‘hate’. Looking at Offensive

posts, many of the most frequently occurring words

HEALTHINF 2020 - 13th International Conference on Health Informatics

190

are well-known swear words. In contrast, the other

two rejection categories (Unclear and NotRight) do

not have such obvious clearly related words. For ex-

ample, the most frequent words in both of these cate-

gories include ‘still’, ‘help’, ‘girl’ and ‘good’, which

are likely to also occur in other categories and in ac-

cepted posts. Consequently, these categories of re-

jected posts are likely to be harder to identify.

4 METHOD

There are two main parts to our method. First, the

training dataset is automatically augmented. Second,

binary classiﬁcation models are trained for each of

the rejection categories. Due to the small number

of posts in each category, it is not possible to create

separate training, validation and testing sets for each

category. Therefore, hyper-parameter optimisation is

carried out on a single category Suicidal Ideation and

the other three categories are reserved for testing.

In Section 4.1, we discuss augmentation tech-

niques in more detail: introducing 4 potential aug-

mentation techniques which we use in our experi-

ments. In Section 4.2, we discuss machine learning

architectures for classiﬁcation in more detail: intro-

ducing 3 potential models of increasing complexity

(Logistic Regression, LSTM, BERT).

4.1 Data Augmentation

The point of data augmentation is to improve a

model’s performance on unseen data. In general, the

number of training examples is increased by taking

existing examples and carrying out simple transfor-

mations which we do not expect to affect the label.

All of the techniques described below have two pa-

rameters α and n: α controls how similar a trans-

formed example will be to the original example, and

n controls how many times the transformation is ap-

plied to a single example and, thus, also the size of

the resulting augmented dataset, which will be n + 1

times the size of the original training set. We will now

describe each technique in detail. Table 2 shows ex-

amples of using these 4 methods.

• Synonyms Replacement(SR): Randomly select

a proportion (α) of the words in each sentence and

replace them by their closest synonyms i.e., most

similar words. For example, if α = 0.2 we will

replace 20% of the words in each post. If n = 2,

then we will do this twice to each post resulting in

2 new posts for each of the original posts.

• Synonyms Insertion(SI): Randomly select a pro-

portion (α) of the words in each sentence and in-

sert their most similar words at a random position

in the same sentence.

• Random Deletion(RD): Randomly delete a pro-

portion (α) of the words in each sentence.

• Random Swap(RS): Randomly select a propor-

tion (α) of words in each sentence and swap their

position with another randomly selected word in

the sentence.

We experiment with two ways of generating syn-

onyms for words: WordNet and word2vec. WordNet

(Fellbaum, 1998) is a lexical database which groups

words into sets of synonyms called synsets. It is

straightforward to lookup the synonyms of a word

in WordNet, but these will be based on lexicogra-

phers’ knowledge of general usage and will not re-

ﬂect the dominant sense within the domain. If a word

has multiple synonyms, one of them is chosen at ran-

dom. In the ﬁrst example in Table 2, we see that the

word ‘school’ is replaced by the WordNet synonym

‘civilise’. This is unlikely to be the intended sense

of ‘school’ in the MeeTwo dataset. In fact, the word

‘civilise’ is not used in any of the MeeTwo posts and

therefore this training example is of very limited use.

Word2vec (Mikolov et al., 2013) is a continu-

ous space model based on neural networks which

generates distributed word embeddings that can be

used in downstream NLP tasks. In order to use

word2vec to generate synonyms we ﬁrst prepare the

in-domain training corpus (using the nltk library to

carry out case normalisation, tokenisation and stop-

word & punctuation removal). We then use the

gensim library to build a word2vec model with de-

fault parameters (sg=0, window=5, size=100) and

also to ﬁnd the most similar word to a given word,

according to cosine similarity between embeddings.

Crucially, the word2vec model is trained on the

dataset, so all the words it generates must be in that

vocabulary. In the second example in Table 2, we

see that ‘go’ is replaced by ‘bring’, which occurs 546

times in the MeeTwo dataset.

4.2 Classiﬁcation

Machine learning classiﬁers for document classiﬁca-

tion take a numerical representation of the text as

input and output the most likely label for the docu-

ment. In its simplest form, the numerical represen-

tation of a post might be a vector which associates

each word in the vocabulary with a weight according

to its perceived importance in the post (e.g., based on

frequency). For more sophisticated machine learning

approaches, the numerical representation might be a

sequence of word embeddings. Here, we investigate

Improving Mental Health using Machine Learning to Assist Humans in the Moderation of Forum Posts

191

Table 2: Examples of different data augmentation techniques applied to the text “I cant push myself to go to school”.

Method alpha num synonyms text

Original Text “I cant push myself to go to school”

SR 0.15 1 word2vec “i cant push myself to bring to school”

SR 0.15 1 WordNet “i cant push myself to bring to civilise”

SI 0.15 1 word2vec “i cant push myself bring to go to school”

SI 0.15 1 WordNet “i cant tug push myself to go to school”

RD 0.15 1 “i cant myself to go to school”

RS 0.15 1 “school cant push myself to go to i”

three different classiﬁers: Logistic Regression(LR),

LSTM and BERT. Speciﬁcally, the input to the LR is

a TF-IDF document representation; the input to the

LSTM is a sequence of GloVe word embeddings; and

the input to BERT is a sequence of words, since this

model handles both the representation and the classiﬁ-

cation internally. We now describe each classiﬁcation

technique in more detail.

4.2.1 Architecture 1: TF-IDF and Logistic

Regression

In this architecture, a post is represented as a vector of

its TF-IDF values. For a given post, term frequency

(TF) is the frequency of a word in that post. IDF

is the inverse document frequency of the word i.e.,

the log reciprocal of the number of posts containing

that word. Thus, TF-IDF, which is the product of TF

and IDF, assigns higher weights to words which oc-

cur more frequently in an individual post and less fre-

quently in other posts.

Logistic regression (LR) is a simple classiﬁcation

method, widely used in statistics and machine learn-

ing. It uses a logistic function to model a binary vari-

able. Having less parameters, it is not as sensitive to

the amount of training data as more complex machine

learning methods. Here, we use the TF-IDF embed-

ding as the input to LR. The output of the classiﬁer is

the probability of each class label.

In our implementation, we ﬁrst use python’s nltk

library to pre-process the data, carrying out tokeni-

sation, case normalisation, stopword and punctua-

tion removal and lemmatisation. These standard pre-

processing steps reduce the size of the vocabulary

and remove tokens / distinctions which are unlikely to

have an effect on classiﬁcation. We then use python’s

scikit-learn library to construct the TF-IDF repre-

sentation of each post and to realise the LR classifer.

4.2.2 Archictecture 2: General Purpose Word

Embeddings and LSTM

Recurrent neural networks (RNNs) are typically used

to model sequences because the hidden state of an

RNN at any given time depends both on the current

input and the previous state of the network. Typically,

in language modelling, the input to an RNN is an em-

bedding of a word (a high dimensional representation

which captures similarities between words) and the

network is trained to predict the next word in the ob-

served sequence, given the current word and state of

the network (which represents the context). Vanilla

RNNs, however, have been shown to struggle with

long range dependencies between words (Hochreiter

et al., 2001). LSTMs attempt to overcome this prob-

lem by using 4 interacting layers in each repeated neu-

ron: a cell state for long term memory; a forget gate

to forget information; an input gate decides which val-

ues to update and what with; and an output gate that

controls what to output. A classiﬁcation layer can be

put on top of any RNN, to make a prediction for a

document label based on the hidden state of the net-

work: either after the last token has been read or by

pooling hidden states after each token is read.

We use the pytorch library to realise an LSTM

which takes a sequence of general purpose (pre-

trained) word embeddings as input. Speciﬁcally, we

use GloVe (Pennington, 2014) embeddings with a di-

mensionality of 300 and a context window size of 8,

trained on a large, general corpus of English text. The

model has 2 LSTM and 2 linear hidden layers. The ﬁ-

nal classiﬁcation output is decided using the ﬁnal lin-

ear layer.

4.2.3 Architecture 3: BERT Pre-trained

Embedding, BERT Fine-tuning and

Training

BERT (Devlin et al., 2018) is a deep neural network

architecture which can be used to generate contextu-

alised word embeddings and carry out classiﬁcation

tasks. Using BERT typically has two steps. The ﬁrst

is pre-training on a very large general corpus; the sec-

ond is ﬁne-tuning on the speciﬁc task. Pre-training

BERT has a huge computational cost. Fortunately,

a number of pre-trained BERT models have been re-

HEALTHINF 2020 - 13th International Conference on Health Informatics

192

leased as open source by the developers

. The mod-

els for English have been pre-trained on the concate-

nation of BooksCorpus (800M words) and English

Wikipedia (2,500M words). There are different ver-

sions for cased and uncased text as well as two dif-

ferent sizes of model: BERT-base and BERT-large.

BERT-base has 12 layers, 768 hidden units, 12 heads

and a total of 110M parameters. BERT-large has 24

layers, 1024 hidden units, 16 heads and a total of

340M parameters.

In our implementation, the BERT-base uncased

pre-trained model is employed. We then directly build

a downstream model by ﬁne-tuning this pre-trained

BERT using our own labelled training data.

The ﬁne-tuning part of BERT for sequence-level

classiﬁcation tasks is straightforward. To get an em-

bedding of the input sequence, the ﬁnal hidden state

is taken for the ﬁrst token in the input by identifying

the special [CLS] word embedding and outputting a

vector as C ∈ R

. This vector is the input of a clas-

siﬁcation layer W ∈ R

K×H

where K is the number of

classes. The ﬁnal label probability is computed with a

softmax function. The parameters W are then trained

in order to maximise the probability of the correct la-

bel. The hyper-parameters for ﬁne-tuning are batch

size, learning rate and number of epochs. We use a

training batch size of 24 and a learning rate of 2e − 5.

Convergence was achieved after a single epoch of

training, with no beneﬁt seen from continued train-

ing.

5 EXPERIMENTS

We carried out two sets of experiments. First, we op-

timised the hyper-parameters for data augmentation

(see Section 5.1). Second, we compared the three dif-

ferent machine learning architectures, with and with-

out data augmentation, on the four different categories

of rejection reason (see Section 5.2).

5.1 Data Augmentation

Hyperparameter Tuning

As discussed in Section 4.1, each data augmentation

technique has 2 parameters: α and n. We experi-

ment with these hyperparameters on just the Suicidal

Ideation category of posts. We prepare the training

and testing data in the following way.

• Select all of the Suicidal Ideation posts labelled

“reject” (453 posts).

https://github.com/google-research/bert

• Randomly select the same number of posts (453)

from posts not rejected for Suicidal Ideation.

• Merge the posts selected in above 2 steps to make

a new dataset: the Suicidal Ideation dataset.

• Split the dataset into 75% training and 25% test-

ing data.

After testing each data augmentation technique

individually to ﬁnd the best hyperparameters, all of

the techniques were used together with their best hy-

perparameters to augment the training dataset for the

subsequent experiments.

5.2 Post Classiﬁcation Experiments

Here, we investigate the effect of data augmentation

on each of the machine learning architectures intro-

duced previously (LR, GloVe + LSTM, BERT) on

each of the categories for possible post rejection (Sui-

cidal Ideation, Offensive, Not Right for MeeTwo and

Unclear). The process of carrying out these experi-

ments is as follows.

• Prepare each the data for each category of rejec-

tion reason. First, select those rejected posts in

that category. Second, randomly select the same

number of other posts (accepted or other rejection

category). Third, merge the above 2 datasets to-

gether to form the original dataset for that cate-

gory, which is split into training (75%) and testing

(25%) sets..

• Augment the training dataset prepared using the

best parameters of data augmentation methods.

• Use the original prepared dataset and augmented

dataset separately to train each model and test on

the same held-out testing data.

• Compare the accuracy of the different models on

the test data.

The results of model comparison are the accuracy

of each model on each of four different categories of

posts. The average accuracy over the four categories

is also considered.

6 RESULTS

In this section, we present our results on each set of

experiments.

6.1 Data Augmentation

Hyperparameter Tuning

We conducted experiments using a range of α

and n values (α ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} and n ∈

Improving Mental Health using Machine Learning to Assist Humans in the Moderation of Forum Posts

193

{0, 4, 8, 12, 16, 20}).

When testing on Suicidal Ideation data, all 3 ma-

chine learning architectures make gains in perfor-

mance through the use of the data augmentation tech-

niques. Figure 2 shows the effect of tuning α in data

augmentation methods combined with an LSTM (on

Suicidal Ideation posts). Here n is kept constant (n =

3). We see that optimal performance is achieved with

all of the data augmentation methods with a relatively

low value of α. This means that the best augmented

posts are relatively similar to the original posts (con-

taining, say, 80-90% of the same tokens). In general,

we observed the same patterns for the other classiﬁ-

cation architectures. The only signiﬁcant difference

being that random swap has no effect on the LR clas-

siﬁer (since the document representation is based on

a bag-of-words rather than a sequence).

Figure 2: Tuning α for LSTM on Suicidal Ideation posts.

Top left: SR, top right: SI, bottom left: RS, bottom right:

RD.

Figure 3 shows the effect of tuning n in the

data augmentation methods for the LSTM and for

BERT (on Suicidal Ideation posts). Both architec-

tures show substantial improvements with data aug-

mentation (over 20%). We observe that the LSTM re-

quires a greater amount of augmentation (n > 20) to

achieve results in the same ballpark as BERT (n = 4).

The LR classiﬁer, on the other hand, beneﬁted less

from data augmentation, with performance only im-

proving by around 4%. This peak performance oc-

curred at around n = 8.

6.2 Post Classiﬁcation Experiments

Here, we present the performance of each model on

four categories of posts and the average performance

over the four categories.

Figure 3: Tuning n for LSTM and BERT on Suicidal

Ideation Posts. left: LSTM, right: BERT.

Figure 4 shows our results for the different cate-

gories. The blue bars represent the accuracy of mod-

els which are trained on original training data, while

orange bars represent the accuracy of models which

are trained on data augmented by data augmentation

methods with best parameters.

Figure 4: Testing on different categories. Top left: Suici-

dal Ideation, top right: Not Right for MeeTwo, bottom left:

Unclear, bottom right: Offensive.

From the results we make the following obser-

vations. For the Suicidal Ideation category, the TF-

IDF+LR architecture gains 3 absolute points with

data augmentation; GloVe+LSTM gains 21 absolute

points; BERT gains 5 absolute points and reaches the

highest accuracy among 3 models. The best accuracy

is 0.916 which is close to humans accuracy which is

0.946

For the Not Right category, TF-IDF+LR increases

1 absolute point; GloVe+LSTM gains 22 absolute

points; BERT gains 3 absolute points and reaches the

highest accuracy among 3 models. The best accuracy

is 0.82 which is much less than the accuracy on Suici-

dal Ideation category. One likely reason for this is

that Suicidal Ideation posts normally contain some

obvious words such as kill, life, sad and so on. Not

Human accuracy is using the same test data and testing

by professional posts moderation staff of MeeTwo

HEALTHINF 2020 - 13th International Conference on Health Informatics

194

Right posts are those posts that are not in line with the

MeeTwo rules. These are more difﬁcult to classify.

For the unclear category, TF-IDF+LR decreases 1

absolute point; GloVe+LSTM gains 3 absolute points;

BERT gains 3 absolute points and reaches the highest

accuracy among 3 models. The best accuracy is 0.81

which is quite close to the accuracy on notRight cate-

gory. Unclear posts may include posts that are essen-

tially meaningless. Those kind of posts are also more

difﬁcult for the machine learning models to classify.

LSTM achieves highest accuracy on the Offensive

category of posts. The accuracy is 0.93 which is much

higher than LR. BERT also achieves a good accuracy

which is 0.91. Offensive posts are quite similar with

Suicidal Ideation posts. Both of them have some ob-

vious words, which makes classiﬁcation easier for all

of the models.

More generally we see that data augmentation is

much more beneﬁcial to the more complex, data hun-

gry architectures such as LSTM and BERT. These

architectures need a large amount of data to achieve

peak performance. The accuracy of 4 different cate-

gories in Table 3 all show that when using augmented

training data, the test accuracy of LSTM and BERT

are greatly improved. For the TF-IDF+LR model, the

increase in data amount does not have the same im-

pact. In fact, in some categories, the accuracy of LR

decreases with the augmented data.

The BERT model achieves highest accuracy on

most of the categories. The highest average accuracy

is the BERT model. The average accuracy of BERT

and LSTM are very close and is around 5 absolute

points higher than LR.

6.3 Error Analysis

The incorrect predictions made by the different mod-

els and the human are not all the same. Table 4 shows

some examples of errors made by each on the Suicidal

Ideation test set. 0 represents an “accepted” decision

for the post while 1 represents a “rejected” decision

for the post. The ﬁrst post shows that human is correct

while all other models are wrong. The post is quite

short. Some words like break, anymore may lead the

models to a wrong prediction. The second post is the

scenario where the human is wrong while all mod-

els are correct. The third and fourth posts show only

BERT is correct or only BERT is wrong. We note that

the gold standard label was also made by some human

moderator when the posts were uploaded. Therefore,

the gold standard label may not be one hundred per-

cent correct.

Figure 5 shows the proportion of the same pre-

dictions made by each pair of models on the Suici-

dal Ideation dataset. We note that LSTM and BERT

have the biggest overlap in errors. Therefore, typi-

cally, the mistakes made by these two models are very

similar. We also see that the smallest overlaps in er-

rors are between LSTM and human and between LR

and human. In other words, a greater proportion of

the BERT model’s errors are the same as the human

errors. Consequently, when the results of the LSTM

model and the BERT model differ, the BERT is more

likely to have made a human-like error.

Figure 5: Proportion of predicting the same result for any

pair of models or human.

7 CONCLUSION AND FUTURE

WORK

In this work, we have explored the feasibility of build-

ing a semi-automated moderation model that is able

to reduce the moderation workload and accelerate the

moderation speed in the MeeTwo scenario. We have

demonstrated a practical pipeline which can be used

in this and other small dataset classiﬁcation scenarios.

We recommend the use of simple data augmentation

methods to build up a much larger training dataset and

then use this dataset to train a data hungry model such

as BERT.

There are in total 38 different possible reasons for

posts being rejected from MeeTwo. Here, we have

focused on the top 4 rejection reasons which account

for more than half of the rejected posts. These are

Suicidal Ideation, Not Right for MeeTwo, Offensive

and Unclear. Using a combination of dataset aug-

mentation and modern machine learning architectures

for classiﬁcation, we have demonstrated that, despite

Improving Mental Health using Machine Learning to Assist Humans in the Moderation of Forum Posts

195

Table 3: The average performance of 3 models using original data and augmented data.

Model 1(LR) Model 2(LSTM) Model 3(BERT)

Original Data Average Accuracy 0.82 0.70 0.75

Augmented Data Average Accuracy 0.83 0.86 0.87

Table 4: Example errors made by the different architectures on the Suicidal Ideation dataset.

No. Text Real

Label

Human

Label

LSTM

Label

BERT

Label

1 my friend doesn’t want to be with me at break or

lunch anymore

0 0 1 1 1

2 Sometimes I wonder what it’s like to not be alive.

Does anybody else wonder whether death is better

than life? I think about it a little too much I think.

1 0 1 1 1

3 I need to stay happy. I have a bottle of anxiety pills

in the bathroom. One pill a day it says. I could just

swallow them all if I get upset and suicidal. I need

to stay happy.

1 0 0 0 1

4 I’m falling for someone I can’t fall for what do I

do?

0 1 1 1 0

limited and unbalanced training data, it is possible

to build classiﬁers with near-human accuracy for a

number of different rejection reasons. On the Sui-

cidal Ideation category, we achieve an accuracy of

91.2% on a balanced dataset (compared to human per-

formance of 94.5%). In the Offensive category, we

achieve 92% accuracy. Accuracy is lower (around

80%) on less well-deﬁned categories such as Unclear

and Not Right for MeeTwo. However, in the real-

world scenario, where a much smaller proportion of

posts should actually be rejected than in our testing

scenario, these methods will be very valuable in be-

ing able to identify and ﬂag posts of potential concern,

which a human moderator can then check.

We have demonstrated 4 different augmentation

techniques which can increase the amount of training

data for machine learning by 20 times or more. Our

results show that these methods can massively boost

model performance when we only have small labelled

dataset. We compared three machine learning archi-

tectures (TF-IDF+LR, GloVe+LSTM and BERT) in a

practical text classiﬁcation task. In general, the per-

formance of BERT was the highest. However, with-

out data augmentation, the performance of the LSTM

is poor and performance of BERT is on a par or worse

than the simple baseline of TF-IDF representations

and a LR classiﬁer. Thus, we conclude, that whilst

some data sparsity issues may be overcome through

the use of a pre-trained BERT model, it is of fur-

ther beneﬁt to augment the training set used for ﬁne-

tuning.

On the Not Right for MeeTwo and the Unclear

categories, the performance of BERT is higher than

LSTM while the performance of LSTM is close to

BERT on the Suicidal Ideation category and even

higher than BERT on the Offensive category. These

latter categories can be more easily deﬁned in terms

of individual words (see Figure 1). This suggests that

the BERT model can work well even when the text has

no obvious words as indicators of a particular class.

There are a number of potential avenues for fur-

ther work. First, the data augmentation techniques

presented herein might be improved upon or ex-

tended. For example, other models or datasets could

be used to train the synonyms generation model. A

different or larger corpus for discovering semanti-

cally similar words might create a more generalize-

able model. Techniques for ensuring that only syn-

onyms (rather than antonyms or other related words)

are inserted or substituted could also be investigated.

Further, part-of-speech tagging or dependency analy-

sis might be used to ensure that the new training ex-

amples are linguistically plausible. Finally, we have

looked at ways in which to transform posts at the word

level. However, we could consider transforming posts

at the sentence or discourse level. For example, it

might be possible to create new sentences or posts

by combining different sentences or parts of sentence

which are in a post with a given label.

The second avenue for future work relates to ma-

chine learning classiﬁcation model performance and

integrating these models into a live system. The 3

architectures investigated here could be blended (e.g.,

through a use of a simple voting system), which could

HEALTHINF 2020 - 13th International Conference on Health Informatics

196

increase the performance. In a practical system which

values recall over precision, if any of the models pre-

dict that a post should be rejected, then this should

be ﬂagged to the human moderators. There are also a

number of other hyper-parameters and potential fea-

tures which could be explored. For example, we have

noticed that posts made by boys have a higher rejec-

tion rate. Consequently, future work could explore

whether and how to incorporate extra-linguistic fea-

tures of the posts including gender and age of user.

REFERENCES

Cheng, J., Danescu-Niculescu-Mizil, C., and Leskovec, J.

(2015). Antisocial behavior in online discussion com-

munities. In Ninth International AAAI Conference on

Web and Social Media.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018).

BERT: pre-training of deep bidirectional transformers

for language understanding. CoRR, abs/1810.04805.

Duggan, M. (2017). Online harrassment. Technical report,

Pew Research Center.

Fellbaum, C. (1998). WordNet: An Electronic Lexical

Database. Bradford Books.

Gagliardone, I., Gal, D., Alves, T., and Martinez, G. (2015).

Countering online hate speech. Unesco Publishing.

Harris, Z. (1954). Distributional structure. Word,

10(23):146 – 162.

Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.,

et al. (2001). Gradient ﬂow in recurrent nets: the dif-

ﬁculty of learning long-term dependencies.

Howard, J. and Ruder, S. (2018). Universal language model

ﬁne-tuning for text classiﬁcation. arXiv preprint.

Kobayashi, S. (2018). Contextual augmentation: Data

augmentation by words with paradigmatic relations.

CoRR, abs/1805.06201.

McCarthy, D., Koeling, R., Weeds, J., and Carroll, J.

(2004). Finding predominant word senses in untagged

text. In Proceedings of the 42nd Meeting of the Asso-

ciation for Computational Linguistics (ACL’04), Main

Volume, pages 279–286, Barcelona, Spain.

McManus, S., Bebbington, P., Jenkins, R., and Brugha,

T. (2016). Mental health and wellbeing in england:

Adult psychiatric morbidity survey 2014.

McManus, S., Meltzer, H., Brugha, T. S., Bebbington, P. E.,

and Jenkins, R. (2009). Adult psychiatric morbidity

in england, 2007: results of a household survey.

Mikolov, T., Yih, W. T., and Zweig, G. (2013). Linguis-

tic regularities in continuous space word representa-

tions. In Proceedings of the 2013 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, pages 746–751.

Park, D. S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk,

E. D., and Le, Q. V. (2019). Specaugment: A sim-

ple data augmentation method for automatic speech

recognition. CoRR, abs/1904.08779.

Pavlopoulos, J., Malakasiotis, P., and Androutsopoulos, I.

(2017). Deep learning for user comment moderation.

CoRR, abs/1705.09993.

Pennington, J., S. R. . M. C. (2014). Glove: Global vectors

for word representation. In Proceedings of the 2014

conference on empirical methods in natural language

processing (EMNLP), pages 1532–1543.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M.,

Clark, C., Lee, K., and Zettlemoyer, L. (2018).

Deep contextualized word representations. CoRR,

abs/1802.05365.

Pieschl, S., Kuhlmann, C., and Porsch, T. (2015). Beware

of publicity! perceived distress of negative cyber in-

cidents and implications for deﬁning cyberbullying.

Journal of School Violence, 14(1):111–132.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever,

I. (2013). Improving language understanding by gen-

erative pre-training.

Saleem, H. M., Dillon, K. P., Benesch, S., and Ruths, D.

(2017). A web of hate: Tackling hateful speech in

online social spaces. CoRR, abs/1709.10159.

Shrivastava, A., Pﬁster, T., Tuzel, O., Susskind, J., Wang,

W., and Webb, R. (2017). Learning from simulated

and unsupervised images through adversarial training.

In 2017 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 2242–2251.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez A. N., ., and Polosukhin, I. (2017). Atten-

tion is all you need. In Advances in neural information

processing systems, pages 5998–6008.

Weeds, J., Weir, D., and McCarthy, D. (2004). Character-

ising measures of lexical distributional similarity. In

Proceedings of the International Conference on Com-

putational Linguistics (COLING), pages 1015–1021,

Geneva, Switzerland.

Wei, J. W. and Zou, K. (2019). EDA: easy data augmenta-

tion techniques for boosting performance on text clas-

siﬁcation tasks. CoRR, abs/1901.11196.

Wulczyn, E., Thain, N., and Dixon, L. (2017). Ex machina:

Personal attacks seen at scale. International World

Wide Web Conferences Steering Committee., pages

1391–1399.

Yu, A. W., Dohan, D., Luong, M., Zhao, R., Chen, K.,

Norouzi, M., and Le, Q. V. (2018). Qanet: Combining

local convolution with global self-attention for read-

ing comprehension. CoRR, abs/1804.09541.

Improving Mental Health using Machine Learning to Assist Humans in the Moderation of Forum Posts

197