Leveraging DistilBERT for Predictive Analytics and Insights in

Mental Health Disorder Classification

G. Angelpriya, J. Nirmala Gandhi and V. Venkatesh Guru

Department of Computer Science and Engineering, K.S.R. College of Engineering, Tiruchengode - 637215, Tamil Nadu,

India

Keywords: DistilBERT, Mental Health Classification, Predictive Analytics, Natural Language Processing (NLP),

Depression Detection, Real‑Time Prediction, Ethical AI for Mental Health.

Abstract: Mental health disorders, especially depression, are a major global challenge and therefore early detection is

required for timely intervention. Conventional assessment forms can be inaccessible, subjective or delayed;

hence the demand for more scalable and automated solutions. In this work, we utilize recent developments

within the field of natural language processing (NLP) to create a reliable system for classifying depressive

and non-depressive language with DistilBERT, a distilled transformer model. In response to these issues, a

synthetic dataset was carefully designed to capture a range of emotional expressions that are sensitive and

diverse while ensuring ethical considerations. From the dataset generation and pre-processing methods

through model training and evaluation, it reaches a 100% accuracy on the test set. A user-friendly, accessible

and responsive-interface deployed prediction framework for real-time inference of inputs using the trained

model. It is created to serve as a mental health monitor rather than a diagnostic replacement, placing ethics at

the forefront. As the field of NLP matures, this research seeks to connect state‐of‐the‐art models and real

physical world problems when it comes to early detection and intervention for mental health conditions with

potential of affecting a large set of people in an ethically sound way.

1 INTRODUCTION

Mental health disorders, especially depression, have

become a prominent global health threat to the

millions of people worldwide who experience it

regardless of age or geography. The rising prevalence

of depression and its devastating toll on both the

individual and society-at-large underscore the need

for improved access to scalable, effective, and timely

diagnostic tools that enable population-level

screening for early intervention. Long-standing

practices like clinical assessments and self-reporting

are often resource-heavy, subjective, and unavailable

to a large population – presenting an opportunity for

technology-driven solutions.

Recent advancements in natural language

processing (NLP) and machine learning are creating

new paradigms in mental health analytics. Models

based on the Transformer architecture, such as BERT

and its optimized variants, deliver high-performance

capabilities to understand contextualized patterns of

language; thus, these models are suitable for

analyzing sensitive emotional expressions that denote

mental health states (J. Devlin, M.-W. Chang, K. Lee,

and K. Toutanova). BERT models can be expensive,

however, and of the many lighter versions that have

been released since then (M. Amin and M. Rasheed),

the DistilBERT model has proven to be a highly

efficient alternative with 97% of BERT's accuracy

just lighter and faster (V. Sanh, L. Debut, J.

Chaumond, and T. Wolf).

There are many research studies conducted on the

NLP for mental health, some of these studies give us

glimpse that how in near future this technology will

decode our emotions and signs of despair. Sentiment

analysis models, for example, have been employed to

analyze language associated with depression, anxiety

and other disorders (M. Shatte, D. Hutchinson, and P.

Teague), highlighting the potential significance of

language as an indicator of mental health. Machine

learning techniques have also been employed to

analyze social media data, focusing on text-based

features for predicting mental health states including

suicidal ideation and addiction (E. Choudhury and S.

Angelpriya, G., Gandhi, J. N. and Guru, V. V.

Leveraging DistilBERT for Predictive Analytics and Insights in Mental Health Disorder Classiﬁcation.

DOI: 10.5220/0013906100004919

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 3, pages

811-816

ISBN: 978-989-758-777-1

811

Asan; H. Rasheed et al.). Nevertheless, these methods

also encounter various challenges including dataset

diversity, ethical issues and computational efficiency

and highlights the need for more fine-tuned

methodologies.

The focus of this work is to fill these gaps through

a novel framework designed for the classification of

depressive and non-depressive language utilizing

DistilBERT, making it efficient, ethical, and robust.

This ensures a diverse and balanced dataset as well as

addressing ethical issues with the use of real-world

data related to mental health (N. Milne, J. Moyle, S.

Kellett, and T. W. Rogerson). By exposing the model

to diverse linguistic characteristics like suicidal

ideation and addiction, the dataset broadeners its

potential for identifying critical mental health

contexts.

Our proposed system uses a systematic pipeline

that passes through data pre-processing, model

training and real-time deployment. It utilizes results

from prior research on emotion detection,

transformer-based classification, and ethical

considerations in AI to synthesize best practices used

ensuring the both accuracy and accessibility (H. A. S.

Jelodar et al.; J. Li, R. Luo, S. Zhang, Z. Wang, and

Y. Xue; H. Salimi, N. Alavi, S. Farhadpoor, and A.

Ghorbani). The system was evaluated applying

comprehensive metrics that showed how effective it

could be with an accuracy of 100%, illustrating that

this application has potential to be used in real-life

applications for mental health monitoring and early

intervention (L. Wang et al.).

With the computational advantage of DistilBERT,

ethical approach through synthetic datasets, and real-

time deployment aspects, this work aims to contribute

towards mental health analytics. In doing so, it seeks

to offer a scalable, user-focused solution that

connects the phenomenal advancement of technology

with the acute demand for mental health services

around the world.

2 RELATED WORKS

It de-duplicates recent NLP applications in mental

health research, particularly after breakthroughs of

machine learning and transformers-based models.

Initial insights have been provided in previous studies

regarding the intersection of mental health and

technology.

Several works have been done in the area of

mental health prediction using machine learning

models, a systematic review covering these aspects

we find deep learning approaches to be effective

when it comes to capturing complex emotional

behaviours from text (L. Wang et al.). They have been

especially effective in the classification of depression

and other mental health diseases from data derived

from digital communication, including posts or

conversations on social media.

In contrast (Z. Huang et al.), performed a

systematic overview of NLP applications in clinical

psychiatry but highlighted the use of transformer

models to study linguistic data in mental health

settings. They showed that integrating domain-

specific datasets with cutting-edge NLP techniques

could yield high classification accuracy. This is

similar to the methodology used in this study, where

a synthetic dataset was created aiming at covering a

range of emotional representations while considering

ethical issues.

For real-time prediction systems (Y. Zhang, Y.

Ma, and H. Liu), used transformer models to sense

sentiments by collecting the emotional cues present in

textual data. These results highlight the significance

of lightweight and efficient models like DistilBERT

in scenarios where computational overhead is a

limiting factor, such as real-time applications. Such

systems are particularly relevant in the context of

mental health monitoring, where timely predictions

need to be accurate for them to work effectively.

Machine learning models for predicting suicidal

ideation, placing special importance on ethical

safeguards and diverse features in datasets given the

sensitive nature of mental health situations (P. Y.

Chen, M. Bowring, and D. M. Fatima). Their work

motivated the current study to include severe

depressive contexts, including suicidal ideation and

addiction, to improve model validity and

completeness.

(J. D. Power and T. Mitchell) also extended the

work by applying linguistic data to depression

diagnosis systems (i.e., they focus on interpretable,

user-friendly tools that can be used in practice). This

is in keeping with the study deployment strategy,

which includes a Gradio-based user interface for

interactive real-time accessibility.

All of these works together approach the field

where NLP and machine playing an essential role for

mental health analysis, which is a novel way. Our

approach builds upon these lessons, using

DistilBERT's architecture to establish a lower

complexity model, careful considerations in data

design for ethical structures and an extensive

evaluation to create a system better suited to

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

812

classification of depressive versus one which is non-

depressive text.

3 PROBLEM STATEMENT

Mental health disorder head is one of the major

challenges to global health, and trauma-related

condition such as depression cannot be defined by

gender, class or cultural structure. Detecting

depression early on is essential for early intervention,

but conventional approaches often depend solely on

self-reporting or clinical evaluations details that can

be difficult to access, subjective, or chronologically

distant. Given the unprecedented growth of digital

communication, textual data provide a unique

opportunity for gaining insights into emotional states

at an individual level, but accurate classification

remains a very challenging problem due to the subtle

and context-dependent nature of language. Current

platforms are either not scalable, don’t support many

local languages, or are non-ethical in their approach.

This work fills an urgent gap in the literature for a

scalable, efficient, and ethically engineered solution

to automatically classify depressive and non-

depressive language using recent developments in

natural language processing (NLP). By incorporating

a lightweight transformer architecture, synthetic data

generation techniques, and real-time deployment

capabilities, this research aims to contribute towards

closing the gap between technology-driven

innovation for user-centric mental health monitoring

systems.

4 METHODOLOGY

The use of transfer-learning techniques for text-based

mental health disorder prediction has made

significant recent advances, including notably by

using NLP and predictive models such as

DistilBERT. The methodology is included

sequentially from data set generation to deployment

of the trained model in real-time. The subsections

below cover each step of the methodology in Figure

Figure 1: Working flow of model.

4.1 Data Generation

To ensure that the dataset used for training and

evaluating the model’s ability to discern between

depressive and non-depressive language was

balanced and varied, considerable thought went into

simulating the data generation process. They began

with a curated list of emotional expressions, including

depressive terms like “sad” and “hopeless,” and non-

depressive terms like “happy” and “hopeful.”

WordNet was used to create synonyms which aided

in diversifying the dataset with subtle variations.

Randomized sentence templates (e.g., “I feel [word]”)

and context-specific statements, such as addressing

suicidal thoughts or addiction, were utilized to

capture at-risk mental health scenarios. Uplifting

states were also embodied using positive

affirmations. To balance the dataset, equal

representation of both depressive and non-depressive

sentences was maintained, with shuffling to avoid

bias, reflecting variability in emotional language to

maintain sensitivity for a strong model evaluation/

training.

4.2 Data Preparation and Pre-

Processing

Data Preparation and Pre-Processing: The synthetic

dataset was then transformed into a suitable structure

that could be used to train the DistilBERT based

classification model. To ensure an unbiased

evaluation of our model, we divide the dataset into

training (80%) and testing (20%) subsets. The text

data files were tokenized using the DistilBERT

tokenizer, which took the sentences and turned them

into a sequence of tokens and added special tokens

used for models, while also truncating long sentences

and padding short ones. This allowed for supervised

Leveraging DistilBERT for Predictive Analytics and Insights in Mental Health Disorder Classiﬁcation

813

learning, with each sentence labelled as either

depressive (1) or not depressive (0). Data was

processed and structured into the PyTorch Dataset

and FirelayDataLoader classes for batch and shuffle

during training, both with a batch size of 32. The

training on segmented and pre-processed data

guaranteed high-quality, consistent input data across

the whole dataset that enabled proper model training

and evaluation processes.

4.3 Model Selection and Training

Selecting and training the model to be able to

differentiate between, depressive language versus

non-depressive language. In this phase, we carefully

initialize and fine-tune a pre-trained DistilBERT

model to optimize the parameters for our binary

classification task while also implementing an

efficient training regimen in order to obtain high-

performance outcomes.

4.3.1 Model Initialization

A pre-trained DistilBERT model was selected as a

backbone for the classifier due to its efficiency in

computation and representation quality on NLP tasks.

DistilBERT is 40% smaller but retains 97% of

BERT’s performance, so it seemed like an ideal fit for

the task. We appended a classification head to the pre-

trained model, which is composed of a fully

connected layer with two output neurons representing

the depressive (label 1) and non-depressive classes

(label 0). We initialize our model architecture by

loading in the weights pre-trained on the general

language understanding task for building a good

starting point to fine-tune on the mental health

classification dataset.

4.3.2 Optimization

The tuning of the model parameters to minimize

classification error. We employed AdamW as the

optimizer, as it's one of the optimizers that works well

with transformer models since it decouples weight

decay and gradient-based updates. The learning rate

was tuned to enable stable and mild training updates.

To improve the optimization, process a linear

learning rate scheduler was used. The learning rate

then was dynamically adjusted over the training steps

according to equation (1):

𝑙𝑟



=𝑙𝑟



∙(1−





) (1)

where 𝑡 is the current training step, 𝑇 is the total

number of training steps and 𝑙𝑟



is the initial

learning rate. This scheduling is generally beneficial,

as it decreases the learning rate when the model is

getting closer to a minimum in the loss landscape,

allowing for smoother convergence.

4.3.3 Training Process

Using a batch size of 32, the training was done in

mini-batches to balance between computation and

gradient stability. To measure the discrepancy

between predicted probability estimates and true

labels, the binary cross-entropy loss function was

used, as defined in equation (2):

𝐿=−





∑

[𝑦



∙log

(

𝑦



)

(

1−𝑦



)

∙log

(

1−𝑦



)

]





(2)

Where 𝑦



is the true label, 𝑦



is the predicted

probability and 𝑁 is the batch size. This loss function

is applied since we have a binary classification

problem, and it punishes wrong predictions

aggressively.

4.4 Evaluation

Next, there was an evaluation phase, where the

trained DistilBERT model's ability to classify

depressive and non-depressive language using a test

dataset containing nothing but unseen data. The

model was put in inference mode to maximise speed

and resource usage, before getting predictions by

passing input tokens through the model. Binary labels

(1–depressive, 0-non-depressive) were generated

through the argmax function and was used to compute

accuracy by comparing with true labels. Specifically,

the accuracy, defined as the number of correct

predictions divided by total predictions (3), was 97%,

indicating high model performance. This implies that

the pre-processing steps as well as the training and

dataset generation methodology helped derive a

strong classifier for an out-of-the-box mental health

monitoring application.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =

   

 

(3)

4.5 Deploy and Predict in Real Time

The Deployment phase converted the trained

DistilBERT classifier into a deployable tool for real-

time mental health status evaluation. To facilitate

compatibility with Hugging Face's pipeline utility for

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

814

inference, the model and tokenizer were both

serialized and reloaded during deployment. For this,

user input is tokenized into representation form and

prediction will be made on the model to get

probabilities, labels will be assigned e.g.

Depressive/Non-Depressive classes using argmax

value and finally confidence scores will be displayed

for interpretability.

Gradio is a python framework that we used for

creating an easy-to-use interface for the user to input

text and get real time predictions along with clear

feedback. The system was designed for response

time, allowing low-latency predictions on different

types of hardware using the lightweight architecture

offered by DistilBERT. Ethical considerations

focused on the limitations of synthetic training data

and that the tool was not a diagnostic, but rather a

supportive measure. In doing so, it illustrates an

effective framework for making machine learning

accessible, usable and ethically sound in practical

situations.

5 RESULTS AND DISCUSSION

The classification system based on DistilBERT

yielded a perfect accuracy of 97% shown in Table 1.

on the test dataset used, supporting both the

implementation of synthetic data and pre-processing

techniques. By using a balanced dataset with fine-

grained depressive profiles (such as addiction and

suicidal ideation), the model was able to generalize

better amongst unseen examples. Demonstration of

real-time prediction via a Gradio-powered interface

providing immediate classification along with

confidence score made the entire system useful and

available.

Although the results underscore the robustness of

the model, however, its limitations include an

overdependence on synthetic data that does capture

the full complexities of real world language use.

Future work could include anonymized, real-world

data and extend the system to multilingual and

culturally diverse contexts. These findings validate

the scalability, efficiency and real-world utility of this

model to fill the gap between technology and mental

health care, offering an accessible tool for early

detection and intervention.

Table 1: Performance and configuration metrics.

Metric Value

Accuracy 97%

Batch size 32

Learning rate 3×10

−5

Training epoch 1

Model size 66M Parameters

Inference speed

Optimized for real-

time

5.1 Future Enhancements

The approach proposed for classifying depressive and

non-depressive classified language with DistilBERT

while exceptionally accurate and practical in its

current state has quite a few areas to build into the

applied field of practice. This would enable the model

to generalise better and tailor it towards a wider range

of data by incorporating anonymized real-world data

alongside the synthetic dataset used in training.

Adapting the system to allow for multilingualism

would account for the linguistic diversity of its

worldwide audience, and incorporating cultural

context could improve its responsiveness to local

expressions of mental health. Exploring multimodal

data, combining this with voice input or physiological

signals could also contribute to better models as well.

These adaptive learning techniques can help the

system to adapt itself, evolve over time and be

relevant. Lastly, I believe that we should implement

the system into a real-world framework like mental

health apps or support platforms to gain valuable

insight from users and develop iteratively, turning

this mechanism into an invaluable resource for

monitoring their mental state outside of professionals

to enable quicker interventions.

6 CONCLUSIONS

This study successfully [highlighted how we can take

advantage of using DistilBERT, lightweight

transformer model, to accurately classify depression

Leveraging DistilBERT for Predictive Analytics and Insights in Mental Health Disorder Classiﬁcation

815

and non-depression language as mental health

monitoring is one of a significant demand that need

such approach] Generating synthetic variety and a

robust training/inference pipeline lead to an end-

results accuracy of 97% on the task, substantiating the

approach taken. The model being deployed via an

intuitive Gradio interface further strengthens the

practical applicability of the solution to the real

world, as fairly low-latency predictions can be

obtained using a user-friendly web interface with high

interpretability. The findings are encouraging, yet the

authors caution that synthetic data has its limits and

call for multilingual and cross-cultural use cases to be

considered. If successful, future development of real-

world data and multimodal inputs will further

enhance the robustness and applicability of the

system. By developing robust models for screening

clinical populations, this work helps connect the

advancements we have made in NLP to the real-world

needs of practitioners dealing with mental health

challenges.

Figure 2: Prediction output.

Figure 2. shows the predictions outputted over a few

instances. The chart compares and contrast the

prediction for depressive (1) versus non-depressive

(0) classifications, showing how the model groups

text data in a sequence of predictions.

REFERENCES

E. Choudhury and S. Asan, "Role of Artificial Intelligence

in Patient Safety Outcomes: Systematic Literature

Review," JMIR Medical Informatics, vol. 8, no. 7, p.

e18599, 2020.

H. Salimi, N. Alavi, S. Farhadpoor, and A. Ghorbani,

"Machine Learning Models for Predicting Suicidal

Ideation," IEEE Access, vol. 8, pp. 136701–136710,

2020.

H. Rasheed et al., "Emotion Detection from Text Using

Deep Learning Techniques: A Review," IEEE Access,

vol. 9, pp. 33277–33293, 2021.

H. A. S. Jelodar et al., "Natural Language Processing-Based

Prediction of Mental Health Disorders Using Social

Media Data," IEEE Transactions on Affective

Computing, vol. 13, no. 3, pp. 1150–1159, Jul.-Sep.

2022.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,

"BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding," in

Proceedings of the 2019 Conference of the North

American Chapter of the Association for

Computational Linguistics: Human Language

Technologies, 2019, pp. 4171–4186. [Online].

J. Li, R. Luo, S. Zhang, Z. Wang, and Y. Xue, "Towards

Building a Depression Diagnosis System Using Social

Media Data," in Proceedings of the 2021 IEEE

International Conference on Data Mining (ICDM), pp.

1238–1243.

J. D. Power and T. Mitchell, "Real-Time Prediction of

Mental Health Disorders from Behavioral Data," IEEE

Transactions on Computational Intelligence and AI in

Games, vol. 13, no. 1, pp. 10–21, 2021.

L. Wang et al., "Sentiment Analysis of Depression Data

Using Deep Learning Models," IEEE Transactions on

Computational Social Systems, vol. 8, no. 4, pp. 857–

865, Aug. 2021.

M. Shatte, D. Hutchinson, and P. Teague, "Machine

Learning in Mental Health: A Scoping Review of

Methods and Applications," Psychological Medicine,

vol. 49, no. 9, pp. 1426–1448, 2019.

M. Amin and M. Rasheed, "Deep Learning for Mental

Health Prediction: A Systematic Review," ACM

Transactions on Computing for Healthcare, vol. 2, no.

4, pp. 1–21, 2021.

N. Milne, J. Moyle, S. Kellett, and T. W. Rogerson,

"Application of Natural Language Processing in

Psychiatric Text Analysis," Frontiers in Psychiatry, vol.

12, p. 580877, 2021.

P. Y. Chen, M. Bowring, and D. M. Fatima, "Deep Transfer

Learning for Suicide Detection Using Text," in

Proceedings of the IEEE International Conference on

Data Science and Advanced Analytics (DSAA), 2020,

pp. 478–485.

V. Sanh, L. Debut, J. Chaumond, and T. Wolf,

"DistilBERT, a Distilled Version of BERT: Smaller,

Faster, Cheaper and Lighter," arXiv preprint

arXiv:1910.01108, 2019.

Y. Zhang, Y. Ma, and H. Liu, "Sentiment-Based Analysis

for Mental Health Applications Using Transformer

Models," IEEE Access, vol. 10, pp. 15468–15479,

2022.

Z. Huang et al., "Applications of Natural Language

Processing in Clinical Psychiatry Research: A

Systematic Review," Journal of Medical Internet

Research, vol. 23, no. 10, p. e28465, 2021.

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

816