many people may intentionally or unintentionally
conform to social expectations or their own internal
defense mechanisms when filling out questionnaires,
which can seriously affect the accuracy of evaluation
results. In addition, self-report questionnaires are
usually evaluated at specific time points and cannot
continuously track changes in an individual's
psychological state. In fact, posts on social media
often better reflect a person's real-time psychological
state. If potential psychological problems can be
warned and intervened in a timely manner through
social media posts, many mental illnesses will not
develop to an irreversible stage. With the
development of natural language processing
technology, intervening in potential psychological
issues through social media is feasible. To further
investigate the manifestations of these mental health
issues on social media, this work used the Reddit
Mental Health Dataset published by Low et al. in
2020 (Low, 2020). This dataset contains posts
collected from 28 different subreddits in the Reddit
community between 2018 and 2020. This work used
a total of 75000 pieces of data from three subreddits,
depression, anxiety, and suicide observations in 2020,
to train Bidirectional Encoder Representations from
Transformers (BERT) models for specific
psychological states. The author fine-tuned the BERT
model for each situation to learn unique language
patterns related to each mental health issue, and
employed ensemble learning techniques to combine
the outputs of these models for a more powerful and
comprehensive assessment.
2 METHOD
2.1 Dataset
The dataset is the Reddit Mental Health Dataset from
Zenodo (Low, 2020). Among them, there are 15896
pieces of data on anxiety subreddits, 38033 pieces of
data on depression, and 21410 pieces of data on
suicidal tendencies. The three subreddits contain a
total of approximately 75000 data points, and each
subreddit's data includes various kinds of attributes,
such as Metadata Attributes which introduce the
specific subreddit, author, and date of publication,
Text Content Attributes which include the text
content of the post, Readability Indices which
measure the difficulty and readability of text reading,
Sentiment Analysis Attributes which analyze the
emotional types of text and Mental Health-related
Metrics which provide statistics on the frequency of
occurrence of vocabulary related to different types of
psychological states.
2.2 Data Preprocessing
In order to adapt the dataset to the training
requirements of the BERT model, it is required to
process the data (Boukhlif, 2024). Firstly, this work
uses the chardet library to detect file encoding and
ensure that data can be accurately read and processed.
Then, the dataset is loaded into Pandas DataFrame
and conducts preliminary missing value evaluation to
ensure the integrity of the data. Next, in order to
enable the BERT model to process and extract
meaningful features more accurately, this work used
regular expressions to remove noise from the original
text data, such as HTML tags, redundant whitespace,
and other irrelevant symbols, completing text
cleaning. To facilitate further processing and
classification, various sentiment scores and
readability indicators stored in string form in the
dataset are converted into appropriate numerical
types. Then this work performs feature selection.
Based on the correlation between features and mental
health status detection, the selected features include
emotional scores, language usage indicators, and
readability indicators. To create classification labels,
this work developed a custom function that classifies
each record based on predefined thresholds. These
label generation functions consider multiple features,
allowing for detailed classification of anxious text as
"anxious", "potentially anxious" or "non anxious".
The label categories for depressive and anxiety texts
are similar, divided into "depression", "potential
depression", or "non depression". Suicide tendency is
divided into "suicidal tendency" and "non suicidal
tendency". Finally, this work exported the processed
data (including labels) as a CSV file. These annotated
datasets can be used for training and fine-tuning
BERT models (Chowdhary, 2020).
2.3 Model
This work adopted the BERT architecture and added
a classification layer specifically for outputting
predictions of corresponding mental health
conditions. The basic BERT model, BERT-base-
truncated, is initialized with pre-trained weights, and
the classification layer is fine-tuned on the training
data for each scenario. According to the specific task,
the classification layer has been adjusted to match the
number of output labels (Devlin, 2018). Each dataset
of mental health status is divided into a training set, a
validation set, and a testing set, using a hierarchical