AI-Supported Diagnostic of Depression Using Clinical Interviews: A

Pilot Study

Bakir Had

1,∗ a

, Julia Ohse

2,∗ b

, Michael Danner

3 c

, Nicolina Peperkorn

2 d

Parvez Mohammed

1 e

, Youssef Shiban

2 f

and Matthias R

atsch

1 g

ViSiR, Reutlingen University, Reutlingen, Germany

Private University of Applied Sciences, G

ottingen, Germany

CVSSP, University of Surrey, Guildford, U.K.

Keywords:

Deep Learning, Depression Diagnostics, Mental Health, NLP, BERT, GPT.

Abstract:

In the face of rising depression rates, the urgency of early and accurate diagnosis has never been more

paramount. Traditional diagnostic methods, while invaluable, can sometimes be limited in access and sus-

ceptible to biases, potentially leading to underdiagnoses. This paper explores the innovative potential of AI

technology, speciﬁcally machine learning, as a diagnostic tool for depression. Drawing from prior research,

we note the success of machine learning in discerning depression indicators on social media platforms and

through automated interviews. A particular focus is given to the BERT-based NLP transformer model, previ-

ously shown to be effective in detecting depression from simulated interview data. Our study assessed this

model’s capability to identify depression from transcribed, semi-structured clinical interviews within a general

population sample. While the BERT model displayed an accuracy of 0.71, it was surpassed by an untrained

GPT-3.5 model, which achieved an impressive accuracy of 0.88. These ﬁndings emphasise the transformative

potential of NLP transformer models in the realm of depression detection. However, given the relatively small

dataset (N = 17) utilised, we advise a measured interpretation of the results. This paper is designed as a pilot

study, and further studies will incorporate bigger datasets.

1 INTRODUCTION

Depression is a common mental disorder affecting

around 280 million people worldwide (point preva-

lence in 2019), according to the (Institute for Health

Metrics and Evaluation, 2023). While a survey car-

ried out between 2007-2014 found a prevalence of de-

pression in US adults of 15.08% (Liu et al., 2017),

a meta-analysis found a pooled prevalence of 25%

during the COVID-19 outbreak (Bueno-Notivol et al.,

2021). Early identiﬁcation and intervention are essen-

tial since prolonged untreated depression is associated

with poorer treatment results (Kraus et al., 2019). Yet,

many people suffering from depression remain with-

out a diagnosis (Berner et al., 2008; Jacob et al., 2012;

https://orcid.org/0009-0003-1197-7255

https://orcid.org/0009-0005-3344-4753

https://orcid.org/0000-0002-8652-6905

https://orcid.org/0009-0008-9481-9354

https://orcid.org/0009-0001-7448-7857

https://orcid.org/0000-0002-6281-0901

https://orcid.org/0000-0002-8254-8293

∗

Both authors contributed equally

Jacobi et al., 2002) or show unmet needs regarding

mental health treatment (Coley and Baum, 2022).

1.1 Diagnostic Methods

Depression is traditionally diagnosed by deploying

self-report measures like questionnaires (e.g. PHQ-9,

HADS, BDI-II) (Smarr and Keefer, 2011) or in semi-

structured clinical interviews (e.g. SCID (First, 2014),

MINI (Sheehan et al., 1998)) and/or by external rating

scales (e.g. HRSD (Hamilton, 1960)). Research ex-

plores the diagnostic utility of physiological sensory

markers (e.g. heart rate (Roh et al., 2014), sleep phys-

iological data (Zhang et al., 2020) or skin conduc-

tance (Smith et al., 2020)) and/or movement data from

phones and wearables (Roberts et al., 2018; Torous

et al., 2015). Traditional methods for depression di-

agnostics as well as emergent ones, are not without

their challenges and limitations: One of the limita-

tions of traditional diagnostic methods is availabil-

ity. Traditional methods require access to standard-

ised questionnaires/scales and trained clinicians, po-

tentially limiting their utility in resource-constrained

settings (Thom et al., 2015).

500

Hadži

c, B., Ohse, J., Danner, M., Peperkorn, N., Mohammed, P., Shiban, Y. and Rätsch, M.

AI-Supported Diagnostic of Depression Using Clinical Interviews: A Pilot Study.

DOI: 10.5220/0012439700003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 1: GRAPP, HUCAPP

and IVAPP, pages 500-507

ISBN: 978-989-758-679-8; ISSN: 2184-4321

1.1.1 Emergent Technologies

Promising tools in the recent decade for overcom-

ing difﬁculties present in traditional diagnostic pro-

cedures are coming from the ﬁeld of machine learn-

ing. According to the literature review conducted

by (Zhang et al., 2022), in the past decade, research

on NLP-driven mental health issue detection was on

the rapid rise, which points to the promising future

of automated mental health screening. This applies

in particular to the detection of depression and suici-

dality, with these two areas of research covering more

than 60% of research papers on NLP use in mental

health issues detection. Very apparent growth is no-

ticeable since the release of BERT-(Bidirectional En-

coder Representations from Transformers) by (De-

vlin et al., 2019) whose groundbreaking architecture

signiﬁcantly advanced the ﬁeld of natural language

processing (NLP). Since then, plenty of researchers

have employed BERT architecture for the detection of

mental health issues (Villatoro-Tello et al., 2021; Ro-

drigues Makiuchi et al., 2019; Senn et al., 2022). In

the near future, it is reasonable to anticipate an even

greater focus of researchers on the area of NLP con-

sidering the recent release of the GPT model by Ope-

nAI which has generated signiﬁcant enthusiasm and

attention among both scholars and practitioners.

1.1.2 The Potential of Machine Learning

Contemporary approaches aim at deploying Artiﬁcial

Intelligence/Machine learning systems for depression

diagnostics to overcome the challenges and limita-

tions of traditional diagnostic systems and to utilise

the potential of the aforementioned systems. In par-

ticular, the detection of depression from social net-

work data has been of scientiﬁc interest, with studies

such as (Islam et al., 2018; Deshpande and Rao, 2017)

achieving an accuracy of between 60% and 80% in

detecting posts indicative of depression from Face-

book and Twitter posts and comments by analysing

emotional processes, linguistic styles and temporal

processes. However, it is worth highlighting one per-

tinent aspect: In these cases, the ground truth was ”de-

pression suggestive comments”, a classiﬁcation that

may not correspond to psychological diagnostic cri-

teria for depression. In contrast to previously men-

tioned studies, (Victor et al., 2019) used a validated

self-report depression scale (PHQ-9 (Kroenke et al.,

2001)) as ground truth for their Artiﬁcial intelligence

Mental Evaluation (AiME). In a multi-channel ap-

proach, facial expressions, tone of voice and vocabu-

lary used during the interview were used to infer a de-

pressive state. A speciﬁcity of 87.77% and sensitivity

of 86.81% was reached. However, some other stud-

ies have shown that a multichannel approach might

not even be necessary. More economic models have

demonstrated sufﬁcient accuracy. As (Shin et al.,

2022) have demonstrated that extraction of textual

markers from a clinical interview (in their case the

MINI (Sheehan et al., 1998)) using a Naive-Bayes

Classiﬁer can yield adequate results reached a sensi-

tivity of 69.9%, speciﬁcity of 96.4% while having an

accuracy of 83.1%.

This pilot study is driven by the research ques-

tion of whether a speech-to-text approach utilising

valid clinical interviews can effectively diagnose in-

dividuals experiencing unipolar depression within the

broader population. In our preliminary study (Dan-

ner et al., 2023), a BERT model was ﬁne-tuned on

clinical interview data related to depression. So ﬁne-

tuned BERT classiﬁer and zero-shot prompted GPT-

3.5 are compared, and results have demonstrated that

GPT-3.5 even without any ﬁne-tuning training, signif-

icantly outperforms the BERT classiﬁer. This study

aims to verify whether this effect persists also when a

truthful respondent dataset, coming from the general

population obtained through human-interacted clin-

ical interviews, is used. Additionally, it compares

benchmark results of other approaches and evaluates

the efﬁcacy of this newly introduced dataset.

2 MATERIALS AND METHODS

2.1 Materials

2.1.1 Clinical Interview for Depression

The GRID-HAMD (Itai et al., 1982) is an upgraded

version of the Hamilton Depression Rating Scale

(HDRS) (Hamilton, 1960), consisting of three com-

ponents: the GRID scoring system, the manual of

scoring conventions and a semi-structured-interview-

guide. GRID-HAMD-17 refers to the 17-item version

of the GRID-HAMD in which emotional symptoms are

addressed. Responses provided by participants dur-

ing the GRID-HAMD-17 interview are scored on two

dimensions, with a) intensity making up the vertical

axis and b) frequency making up the horizontal axis.

These two dimensions comprise the scoring grid in

which each intersection is attributed a numerical value

contributing to the overall score. A validation study

of the GRID-HAMD (Williams et al., 2008) showed a

high internal consistency (Cronbach’s α = 0.78) and

high interrater-reliability (ICC = 0.94). In the case

of our experiment, the German version of the GRID-

HAMD-17 was deployed (Schmitt et al., 2015).

AI-Supported Diagnostic of Depression Using Clinical Interviews: A Pilot Study

501

2.1.2 Depression Screening

Participants’ subjective levels of depression were

evaluated using a German language version of the

9-item depression scale of the Patient Health Ques-

tionnaire (PHQ-9, (Kroenke et al., 2001; Gr

afe et al.,

2004)), provided as an online questionnaire. The 9

items involve different depressive symptoms and are

scored on a 4-point Likert scale ranging from 0 = ”not

at all” to 3 = ”nearly every day”. The PHQ-9 has high

reliability (Cronbach’s α = 0.88 (Gr

afe et al., 2004))

and has been validated in multiple studies (Kroenke

et al., 2001; Gr

afe et al., 2004). Total scores can vary

from 0-27, with higher values indicating a higher fre-

quency of depressive symptoms. The PHQ-8 refers

to the same questionnaire with item number 9 (sui-

cidal ideation) omitted (Kroenke et al., 2009). It is

mainly used for research purposes and was used as the

Ground Truth for ﬁne-tuning the BERT-based model.

2.1.3 Transcription and Translation Tool

The initial data analysis stage was transcribing and

translating the recordings into English to achieve

compatibility with the dataset used for the model ﬁne-

tuning. For this task, we employed the speech-to-

text model Whisper (OpenAI, 2022), developed by

OpenAI. Its primary purpose is to turn spoken words

from audio recordings into text, while allowing users

to translate that text simultaneously into another lan-

guage of their choice.

2.1.4 Training Dataset

To detect depression utilising Deep-Learning meth-

ods, commonly used training datasets among re-

searchers worldwide are DAIC-WOZ and EXTENDED

DAIC, both released by the University of South-

ern California - Institute for Creative Technologies

(USCICT) as a part of the Distress Analysis Interview

Corpus (DAIC) database (Gratch et al., 2014). The

DAIC-WOZ dataset consists of 193 audio and video

recordings collected using the Wizard of Oz inter-

view data collection method. In the publicly pub-

lished dataset, as ground truth values, authors used the

PHQ-8 questionnaire scores (Kroenke et al., 2001).

Except for the interview procedure, the EXTENDED

DAIC (Gratch et al., 2014) is identical to the DAIC-

WOZ dataset. In the EXTENDED DAIC dataset, an

interview procedure was conducted completely au-

tonomously by an automated virtual agent following a

predeﬁned set of instructions and was used to conduct

263 interviews.

2.2 Text Classiﬁcation Models

2.2.1 BERT

Bidirectional Encoder Representations from Trans-

formers (BERT) is a transformer-based language

model introduced by researchers at Google. Since

its introduction, it has been used in a wide range of

tasks in NLP. Basically, BERT is a pre-trained model,

trained on a large corpus of text from Wikipedia and

Google’s Books Corpus (Devlin et al., 2019). Un-

like context-free encoders like word2vec or GloVe

that generate a single word embedding representing

each word, these pre-trained BERT models can gener-

ate embeddings that take into account the context of

the word, and hence are more robust as words tend to

have different meanings in different contexts. These

pre-trained models can easily be used in a Classiﬁer

network for text classiﬁcation. This process is called

ﬁne-tuning. The classiﬁer model is the sole compo-

nent undergoing training during this process, while

the BERT model remains unaltered.

Training. All hyperparameters of our BERT classi-

ﬁer are listed in Table 1. Owing to the GPU memory

requirements, the BERT base model was used, and the

classiﬁer was trained on the NVIDIA GEFORCE RTX

3060 GPU. The problem is formulated as a multi-class

classiﬁcation task, with the ﬁnal label being the class

with a higher probability score. These can then be

used to calculate conﬁdence scores.

Table 1: Hyperparameter Optimisation Results through

Grid Search.

Model BERT-Base uncased

Environment 12-layer, 768-hidden, 12-heads

Parameters 110 M

Length embedding 27

Optimizer AdamW

Learning rate 3 × 10−5

Dropout hidden 0.3

Dropout attention 0.5

Weight-decay 4 × 10−2

2.2.2 GPT-3.5-Turbo

Generative Pre-trained Transformer 3.5 (GPT-3.5) is

a subclass in the GPT family of large language mod-

els released by OpenAI called GPT-3 Base Models.

Model GPT-3.5 turbo is the most capable model in

this subclass and is optimised for chat completion at

1/10th of the cost of its predecessors. It is trained

on data latest up to Sep 2021 and has almost double

HUCAPP 2024 - 8th International Conference on Human Computer Interaction Theory and Applications

502

the context length of GPT-3 base models. However,

unlike its predecessors, GPT-3.5 turbo is at this time

point not available for ﬁne-tuning to suit our needs.

Therefore, explicit prompt engineering is required to

get just the PHQ-8 scores from the model as text com-

pletion. Also, because of this, no training is done, and

the test set is directly evaluated. The initial prompt

that we used is

“Give me one score (0 to 24 like the PHQ-8) for the

whole interview. Only look at the Participant’s an-

swers. 0 is no depression, and 24 is severe depression.

If the Score is greater or equal to 10, the Participant

is classiﬁed as depressive. Only give me the score as

an INTEGER WITHOUT EXPLANATION!”

Then the preprocessed texts are supplied as sub-

sequent prompts to get the PHQ-8 scores from the

model.

2.3 Data Collection and Transformation

Adults aged 18 and over were recruited from a

convenience sample of the German population be-

tween May 2023 and July 2023 via social media

and the Board for Science Course Credits at the

PFH G

ottingen. Demographics such as gender, age,

highest educational degree and literacy in the Ger-

man language were assessed via an online question-

naire. Consent to participate in the study and for

voice recording in accordance with the Declaration

of Helsinki was obtained from all participants. The

research procedure was approved by the Ethics Com-

mittee of the PFH G

ottingen (OS 18 200423). Af-

ter being fully informed regarding the purpose of

the study, participants were asked to ﬁll in their de-

mographic data and complete the PHQ-9 (Kroenke

et al., 2001) in an online questionnaire. Interviews

were conducted by psychology students at the PFH

ottingen, who received expert training from a clini-

cal psychologist supervised by a psychotherapist. In-

terviews were conducted via a secure video hub and

only audio data was recorded. Participants were inter-

viewed using the GRID-HAMD-17 manual, with inter-

viewers scoring their responses following the GRID

scoring system (Schmitt et al., 2015). The partici-

pants were notiﬁed of their PHQ-9 score upon com-

pletion of the interview. They were also given a dig-

ital leaﬂet conveying general classiﬁcation for PHQ-

9 scores and details about the available resources

for individuals experiencing depressive symptoms.

As mentioned in Section 2.1.3 the interviews were

translated and transcribed via the pre-trained Whisper

model large-v2. During the data pre-processing, the

interviewer‘s transcribed text was removed and the

participant‘s transcribed text was saved as a string.

Since BERT can process a maximum of 512 tokens,

we divided the transcribed text into fractions for this

model. We decided to divide it into 25 words per frac-

tion. In the future text, this dataset will be referred to

as KID dataset.

2.4 Data Analysis

2.4.1 Scoring of PHQ-8/PHQ-9

To determine the PHQ-9 score, the values assigned

to each item response were added together. As the

BERT-based model was trained using PHQ-8 data as

the reference standard, PHQ-8 scores were obtained

for each participant by deducting the score of item 9

from the PHQ-9 score.

2.4.2 Classiﬁcation Effectiveness Metrics

The metrics used to assess classiﬁcation effectiveness

in machine learning and data analysis are accuracy,

precision, recall and F

score. According to (Juba

and Le, 2019), precision deﬁnes the ratio of accurate

positive predictions compared to all of the classiﬁer’s

positive predictions. Simply put, the precision score

indicates the proportion of cases the classiﬁer classi-

ﬁed as true positives. Recall (also known as sensitiv-

ity) assesses the model’s capacity to discover every

positive case in the dataset, which is how it evaluates

the model’s forecasting performance (Grandini et al.,

2020). Accuracy as a metric evaluates how well the

classiﬁer performs across all cases in the dataset, con-

sidering both true positive and negative predictions.

The F1 score is a statistic that combines recall and

precision into one number. The F

score provides an

appropriate recall and precision estimation, emphasis-

ing each metric’s equal importance. Precision and re-

call are very important in binary classiﬁcation tasks

with uneven datasets, it is the most commonly used

as a statistic measure (Juba and Le, 2019).

3 RESULTS

Descriptive Results. The preliminary sample con-

sisted of N = 17 participants drawn from a conve-

nience sample of the German population. Of these,

n = 5 indicated male as their gender, n = 11 indi-

cated female as their gender, and n = 1 person in-

dicated diverse as their gender. The mean age was

M = 29.24 years (SD = 9.38), with the youngest par-

ticipant being 19 and the oldest being 40 years old.

All participants identiﬁed as German native speak-

ers. Highest educational degree varied with n = 1

AI-Supported Diagnostic of Depression Using Clinical Interviews: A Pilot Study

503

person naming intermediate school-leaving qualiﬁca-

tion (5.88%), n = 9 general university entrance quali-

ﬁcation (52.94%), n = 2 Bachelor’s degree (11.76%),

and n = 5 Master’s degree or diploma. The aver-

age PHQ-9 score was M = 7.59 (SD = 5.50). Scores

ranged from min = 1 to max = 22. The average PHQ-8

score was M = 7.29 (SD = 4.97), with scores ranging

from min = 1 to max = 22. Considering the criteria of

a PHQ-8 score ≥10 for depression, n = 4 of the sub-

jects (23.53%) were categorised as depressed, while

n = 13 subjects (76.47%) were categorised as not de-

pressed.

Model Performance Metrics. Performance metrics

of developed models are presented in the Table 2.

Two different models, BERT and GPT-3.5, are evalu-

ated and compared considering standardised metrics:

accuracy, precision, recall and F

score. Both mod-

els delivered satisfying degrees of performance on

the given metrics. But when both models are com-

pared, GPT-3.5 displayed superior performance ob-

taining a precision score of 0.92, accuracy 0.88, re-

call 0.88 and F

score 0.89, demonstrating excellent

overall performance in the detection of depression.

This high F

score indicates a satisfactory balance

between reducing false positives and increasing the

detection of correctly detected cases. This balance

is especially important for the model predicting such

sensitive outcomes as it is the detection of depression

where false negative and positive predictions can po-

tentially have very serious consequences. Our BERT

model is consistent with prior work, whilst our GPT-

3.5 model outperforms all competitors on the DAIC-

WOZ benchmark. Regarding the KID dataset, our out-

comes demonstrate a marked improvement compared

to the prior experiment. This suggests that our models

exhibit enhanced performance in analysing human-

conducted interviews.

Besides obtaining previously introduced metrics

like accuracy, precision, recall and F

score, we also

analysed our model’s effectiveness using the Receiver

Operating Characteristic (ROC) curve. The ROC curve

gives healthcare professionals useful tools to assess

and contrast the effectiveness of various diagnostic

methods or tests. The area under the ROC curve (AUC)

is a commonly used indicator of a test’s overall ac-

curacy; an AUC of 0.5 implies no greater accuracy

than may be attributed to randomness, while the value

of 1 represents perfect diagnostic accuracy (Kim and

Hwang, 2020). Figure 1 illustrates the experimental

outcomes obtained from the DAIC-WOZ dataset using

the GPT-3.5 turbo API and the CHATGPT-4 prompt.

During the early stage of the study, we also experi-

mented with CHATGPT-4 to see what results we could

obtain. It is important to note that at that time, we still

didn’t have access to GPT-4 API; therefore, at this time

point, we were limited to using CHATGPT-4 prompt-

ing. Due to this restriction, we decided not to dis-

close these results on other performance metrics. Fig-

ure 1 highlights the remarkable performance achieved

on the KID dataset, showcasing its exceptionally high

efﬁcacy.

Figure 1: ROC Curve for GPT-3.5 and ChatGPT 4 on the

DAIC-WOZ and KID datasets.

4 DISCUSSION

The main objective of this study was to assess how

well NLP-based models could identify depression in

transcribed clinical interviews. (GRID-HAMD-17).

Our results showed that a BERT-based model, pre-

trained on clinical interview data from the DAIC-WOZ

and EXTENDED DAIC datasets, performed adequately

with an accuracy of 0.71. Notably, this was out-

performed by an untrained GPT-3.5 model, which

showed a higher accuracy of 0.88. Compared to our

previous research (Danner et al., 2023), the BERT-

based model showed superior performance for the

truthful respondent dataset in comparison to the sim-

ulation dataset (Accuracy 0.43) and even surpassed

its prior performance for the DAIC-WOZ dataset (Ac-

curacy 0.64). The ﬁndings for superior performance

of GPT-3.5 could also be replicated for the truthful re-

spondent dataset coming from the general population.

A possible explanation could be the nature of the in-

teraction between the human interviewer and partici-

pants in the KID dataset and virtual avatar Ellie in the

DAIC. Further studies are encouraged to explore the

nature of Human-Computer interaction in the domain

of clinical interviews. At this place, it is very im-

portant to emphasise that our approach in this current

form should only be used as a helping tool for mental

health experts in the diagnostic process. The results

of our models could not be considered as a ﬁnal di-

HUCAPP 2024 - 8th International Conference on Human Computer Interaction Theory and Applications

504

Table 2: Performance Metrics.

Work Accuracy Precision Recall F

score

DAIC-WOZ Dataset

(Villatoro-Tello et al., 2021) (BERT) 0.59 0.59 0.59

(Senn et al., 2022) (BERT) 0.60

Ours (Danner et al., 2023) (BERT) 0.63 0.66 0.64

Ours (Danner et al., 2023) (GPT-3.5) 0.78 0.79 0.78

KID Dataset

Ours (Current study BERT) 0.71 0.69 0.71 0.70

Ours (Current study GPT-3.5) 0.88 0.92 0.88 0.89

agnostic, and this step has to be done by trained pro-

fessionals. This approach is not intended to be used

solely in clinical practice. Considering its easy acces-

sibility and ease of use, it is intended to be utilised in

the everyday environment of potential users as a tool

for early detection. Lastly, it’s very important to ac-

knowledge that the difference in results could poten-

tially be attributed to the smaller number of depressed

individuals in our sample. Further studies should shed

light on this possibility.

4.1 Limitations

The interpretability of the results is limited by the

small overall sample size (N = 17), especially under

the premise that only n = 4 participants reached the

threshold of a PHQ-score ≥ 10 and were therefore

classiﬁed as depressive. Larger sample sizes are re-

quired to make inferences and test speciﬁc hypothe-

ses. It should be remarked that depression detec-

tion models are to be seen as mere screening meth-

ods which cannot yet provide diagnoses. The diag-

nosis of a disorder like unipolar depression requires

direct interaction between patient and clinician, espe-

cially since these models can not provide differential

diagnosis yet. However, differential diagnostics are

crucial, for example, due to the proximity of depres-

sive symptoms towards symptoms of other disorders,

such as the prodromal phase in schizophrenia (H

afner,

2005). Despite this crucial limitation, screening sys-

tems might contribute to people seeking support and

therefore receiving a professional diagnosis and early

intervention.

4.2 Future Directions

In the future, we aim towards signiﬁcantly increasing

our sample size to test our model more rigorously and

as well in the clinical environment. One of the pos-

sible directions is to evaluate its sensitivity to detect

score changes over time. This would provide our ap-

proach with an insightful, real-world dimension and

enable us to assess how well our model performs in

an environment that closely corresponds to its hypo-

thetical clinical use. Just like with any new technolog-

ical system, it would be recommended to conduct us-

ability and acceptance studies of such a model among

potential users and clinicians. A more comprehensive

analysis integrating audio and video data should be

considered in future work. To ensure data protection,

privacy and ethical issues of our approach, in the next

steps, we are planning to utilise open-source GPT-like

models that are locally executable, conﬁgurable and

more transparent and open to the broader scientiﬁc

community.

4.3 Implications for Clinical Practice

If the results can be replicated in a larger sample, it

would validate the potential for AI to aid clinicians

in diagnosing depression. Furthermore, new possi-

bilities for depression screening in the home could

emerge if the interview process could be automated.

Diagnosis via AI systems could be a ﬁrst indicator

for people suffering from depression, especially in

rural areas with a low density of clinicians (Douthit

et al., 2015; Poß-Doering et al., 2021; Thom et al.,

2015). Implementing AI systems for depression de-

tection additionally provides particular advantages in

dealing with the prevalent societal stigma associated

with mental health issues. Stigma is very often the

reason why individuals with speciﬁc mental health is-

sues are discouraged from seeking help or speaking

about their problems (Brower, 2021). Future AI sys-

tems for depression detection could provide a stigma-

free, ethical, safe and unbiased tool for depression

screening.

AI-Supported Diagnostic of Depression Using Clinical Interviews: A Pilot Study

505

5 CONCLUSION

The prevalence of depression within the general pop-

ulace, coupled with the imperative of prompt inter-

vention, underscores the necessity of early diagnosis.

While conventional screening and diagnostic tech-

niques encounter several impediments, contemporary

AI systems have the capacity to surmount these chal-

lenges. Initial outcomes highlight the promise of NLP

in discerning depression from transcribed clinical in-

terview data. However, it is crucial to note that the

outlined method is designed as an early detection tool

to be accessible to a broader audience from the com-

fort of home, rather than as a diagnostic tool in a clin-

ical setting, which would necessitate additional certi-

ﬁcations. The pilot study stands out as a unique en-

deavour, achieving state-of-the-art promising results

on the small dataset. However, the substantiation of

these ﬁndings requires additional data to strengthen

the robustness of the study. In future, it is needed to

compare these results to the latest models, as it is cur-

rently hindered by the literature gap in the approaches

that follow the latest development trends.

ACKNOWLEDGEMENTS

This work is partially funded by VwV Invest BW - In-

novation II funding program. Project funding num-

ber: BW1 4056/03

REFERENCES

Berner, M. M., Kriston, L., Sitta, P., and H

arter, M. (2008).

Treatment of depressive symptoms and attitudes to-

wards treatment options in a representative german

general population sample. International Journal of

Psychiatry in clinical practice, 12(1):5–10.

Brower, K. J. (2021). Professional stigma of mental health

issues: physicians are both the cause and solution.

Academic medicine, 96(5):635.

Bueno-Notivol, J., Gracia-Garc

ıa, P., Olaya, B., Lasheras,

I., L

opez-Ant

on, R., and Santab

arbara, J. (2021).

Prevalence of depression during the COVID-19 out-

break: A meta-analysis of community-based studies.

International journal of clinical and health psychol-

ogy, 21(1):100196.

Coley, R. L. and Baum, C. F. (2022). Trends in mental

health symptoms, service use, and unmet need for ser-

vices among us adults through the ﬁrst 8 months of

the COVID-19 pandemic. Translational Behavioral

Medicine, 12(2):273–283.

Danner, M., Hadzic, B., Gerhardt, S., Ludwig, S., Uslu,

I., Shao, P., Weber, T., Shiban, Y., and R

atsch, M.

(2023). Advancing mental health diagnostics: Gpt-

based method for depression detection. In Proceed-

ings Title, pages 1290–1296, Tsu City, Japan. 62nd

Annual Conference of the Society of Instrument and

Control Engineers of Japan (SICE).

Deshpande, M. and Rao, V. (2017). Depression detection

using emotion artiﬁcial intelligence. In 2017 inter-

national conference on intelligent sustainable systems

(iciss), pages 858–862. IEEE.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding.

Douthit, N., Kiv, S., Dwolatzky, T., and Biswas, S. (2015).

Exposing some important barriers to health care ac-

cess in the rural usa. Public health, 129(6):611–620.

First, M. B. (2014). Structured clinical interview for the

dsm (SCID). The encyclopedia of clinical psychology,

pages 1–6.

afe, K., Zipfel, S., Herzog, W., and L

owe, B. (2004).

Screening psychischer st

orungen mit dem “gesund-

heitsfragebogen f

ur patienten (PHQ-D)“. Diagnos-

tica, 50(4):171–181.

Grandini, M., Bagli, E., and Visani, G. (2020). Metrics for

multi-class classiﬁcation: an overview. arXiv preprint

arXiv:2008.05756.

Gratch, J., Artstein, R., Lucas, G. M., Stratou, G., Scherer,

S., Nazarian, A., Wood, R., Boberg, J., DeVault, D.,

Marsella, S., et al. (2014). The distress analysis in-

terview corpus of human and computer interviews. In

LREC, pages 3123–3128. Reykjavik.

Hamilton, M. (1960). A rating scale for depression. Journal

of neurology, neurosurgery, and psychiatry, 23(1):56.

afner, H. (2005). Schizophrenia and depression: Chal-

lenging the paradigm of two separate diseases—a con-

trolled study of schizophrenia, depression and healthy

controls. Schizophrenia Research, 77:11–24.

Institute for Health Metrics and Evaluation (2023).

Global burden of disease study results.

https://vizhub.healthdata.org/gbd-results. Accessed:

10/08/2023.

Islam, M. R., Kabir, M. A., Ahmed, A., Kamal, A. R. M.,

Wang, H., and Ulhaq, A. (2018). Depression detec-

tion from social network data using machine learning

techniques. Health information science and systems,

6:1–12.

Itai, A., Papadimitriou, C. H., and Szwarcﬁter, J. L. (1982).

Hamilton paths in grid graphs. SIAM Journal on Com-

puting, 11(4):676–686.

Jacob, V., Chattopadhyay, S. K., Sipe, T. A., Thota, A. B.,

Byard, G. J., Chapman, D. P., Force, C. P. S. T., et al.

(2012). Economics of collaborative care for man-

agement of depressive disorders: a community guide

systematic review. American journal of preventive

medicine, 42(5):539–549.

Jacobi, F., H

oﬂer, M., Meister, W., and Wittchen, H. (2002).

Prevalence, detection and prescribing behavior in de-

pressive syndromes. a german federal family physi-

cian study. Der Nervenarzt, 73(7):651–658.

Juba, B. and Le, H. S. (2019). Precision-recall versus accu-

racy and the role of large data sets. In Proceedings of

HUCAPP 2024 - 8th International Conference on Human Computer Interaction Theory and Applications

506

the AAAI conference on artiﬁcial intelligence, volume

33-01, pages 4039–4048.

Kim, J. and Hwang, I. C. (2020). Drawing guidelines for

receiver operating characteristic curve in preparation

of manuscripts. Journal of Korean medical science,

35(24).

Kraus, C., Kadriu, B., Lanzenberger, R., Zarate Jr, C. A.,

and Kasper, S. (2019). Prognosis and improved out-

comes in major depression: a review. Translational

psychiatry, 9(1):127.

Kroenke, K., Spitzer, R. L., and Williams, J. B. (2001). The

PHQ-9: validity of a brief depression severity mea-

sure. Journal of general internal medicine, 16(9):606–

613.

Kroenke, K., Strine, T. W., Spitzer, R. L., Williams, J. B.,

Berry, J. T., and Mokdad, A. H. (2009). The PHQ-8

as a measure of current depression in the general pop-

ulation. Journal of affective disorders, 114(1-3):163–

173.

Liu, Y., Ozodiegwu, I. D., Yu, Y., Hess, R., and Bie, R.

(2017). An association of health behaviors with de-

pression and metabolic risks: data from 2007 to 2014

us national health and nutrition examination survey.

Journal of affective disorders, 217:190–196.

OpenAI (2022). Whisper. https://github.com/openai/

whisper.

Poß-Doering, R., Hegelow, M., Borchers, M., Hartmann,

M., Kruse, J., Kampling, H., Heuft, G., Spitzer, C.,

Wild, B., Szecsenyi, J., et al. (2021). Evaluating the

structural reform of outpatient psychotherapy in ger-

many (es-rip trial)-a qualitative study of provider per-

spectives. BMC Health Services Research, 21(1):1–

14.

Roberts, L. W., Chan, S., and Torous, J. (2018). New tests,

new tools: mobile and connected technologies in ad-

vancing psychiatric diagnosis. NPJ Digital Medicine,

1(1):20176.

Rodrigues Makiuchi, M., Warnita, T., Uto, K., and Shin-

oda, K. (2019). Multimodal fusion of bert-cnn and

gated cnn representations for depression detection. In

Proceedings of the 9th International on Audio/Visual

Emotion Challenge and Workshop, pages 55–63.

Roh, T., Hong, S., and Yoo, H.-J. (2014). Wearable de-

pression monitoring system with heart-rate variability.

In 2014 36th Annual International Conference of the

IEEE Engineering in Medicine and Biology Society,

pages 562–565. IEEE.

Schmitt, A., Kulzer, B., and Hermanns, N. (2015).

German version of the GRID Hamilton Rat-

ing Scale for Depression (GRID-HAMD).

10.13140/RG.2.1.3569.0725.

Senn, S., Tlachac, M., Flores, R., and Rundensteiner, E.

(2022). Ensembles of BERT for depression classiﬁca-

tion. In 2022 44th Annual International Conference of

the IEEE Engineering in Medicine & Biology Society

(EMBC), pages 4691–4694.

Sheehan, D. V., Lecrubier, Y., Sheehan, K. H., Amorim, P.,

Janavs, J., Weiller, E., Hergueta, T., Baker, R., Dun-

bar, G. C., et al. (1998). The mini-international neu-

ropsychiatric interview (mini): the development and

validation of a structured diagnostic psychiatric inter-

view for DSM-IV and ICD-10. Journal of clinical

psychiatry, 59(20):22–33.

Shin, D., Kim, K., Lee, S.-B., Lee, C., Bae, Y. S., Cho, W. I.,

Kim, M. J., Hyung Keun Park, C., Chie, E. K., Kim,

N. S., et al. (2022). Detection of depression and sui-

cide risk based on text from clinical interviews using

machine learning: possibility of a new objective diag-

nostic marker. Frontiers in psychiatry, 13:801301.

Smarr, K. L. and Keefer, A. L. (2011). Measures of

depression and depressive symptoms: Beck depres-

sion inventory-II (BDI-II), center for epidemiologic

studies depression scale (CES-D), geriatric depression

scale (GDS), hospital anxiety and depression scale

(HADS), and patient health questionnaire-9 (PHQ-9).

Arthritis care & research, 63(S11):S454–S466.

Smith, L. T., Levita, L., Amico, F., Fagan, J., Yek, J. H.,

Brophy, J., Zhang, H., and Arvaneh, M. (2020). Us-

ing resting state heart rate variability and skin conduc-

tance response to detect depression in adults. In 2020

42nd annual international conference of the IEEE

engineering in medicine & biology society (EMBC),

pages 5004–5007. IEEE.

Thom, J., Bretschneider, J., M

ullender, S., Becker, M., and

Jacobi, F. (2015). Regionale variationen der ambu-

lanten prim

ar-und fach

arztlichen versorgung psychis-

cher st

orungen. Die Psychiatrie, 12(04):247–254.

Torous, J., Staples, P., and Onnela, J.-P. (2015). Realizing

the potential of mobile mental health: new methods

for new data in psychiatry. Current psychiatry reports,

17:1–7.

Victor, E., Aghajan, Z. M., Sewart, A. R., and Christian,

R. (2019). Detecting depression using a framework

combining deep multimodal neural networks with a

purpose-built automated evaluation. Psychological

assessment, 31(8):1019.

Villatoro-Tello, E., Ramirez-de-la Rosa, G., G

atica-P

erez,

D., Magimai.-Doss, M., and Jim

enez-Salazar, H.

(2021). Approximating the mental lexicon from clin-

ical interviews as a support tool for depression detec-

tion. In Proceedings of the 2021 International Con-

ference on Multimodal Interaction, pages 557–566.

Williams, J. B., Kobak, K. A., Bech, P., Engelhardt, N.,

Evans, K., Lipsitz, J., Olin, J., Pearson, J., and Kalali,

A. (2008). The GRID-HAMD: standardization of the

hamilton depression rating scale. International clini-

cal psychopharmacology, 23(3):120–129.

Zhang, B., Zhou, W., Cai, H., Su, Y., Wang, J., Zhang, Z.,

and Lei, T. (2020). Ubiquitous depression detection of

sleep physiological data by using combination learn-

ing and functional networks. IEEE Access, 8:94220–

94235.

Zhang, T., Schoene, A. M., Ji, S., and Ananiadou, S. (2022).

Natural language processing applied to mental illness

detection: a narrative review. NPJ digital medicine,

5(1):46.

AI-Supported Diagnostic of Depression Using Clinical Interviews: A Pilot Study

507