Data Augmentation for Reliability and Fairness in Counselling Quality

Classiﬁcation

Vivek Kumar

1,3 a

, Simone Balloccu

2,3 b

, Zixiu Wu

1,3 c

Ehud Reiter

2 d

, Rim Helaoui

3 e

, Diego Reforgiato Recupero

1 f

and Daniele Riboni

1 g

University of Cagliari, Cagliari, Italy

University of Aberdeen, Aberdeen, U.K.

Philips Research, Eindhoven, The Netherlands

Keywords:

AI Fairness, Motivational Interviewing, Counselling, Dialogue, Natural Language Processing, Machine

Learning.

Abstract:

The mental health domain poses serious challenges to the validity of existing Natural Language Processing

(NLP) approaches. Scarce and unbalanced data limits models’ reliability and fairness, therefore hampering

real-world application. In this work, we address these challenges by using our recently released Anno-MI

dataset, containing professionally annotated transcriptions in motivational interviewing (MI). To do so, we

inspect the effects of data augmentation on classical machine (CML) and deep learning (DL) approaches for

counselling quality classiﬁcation. First, we adopt augmentation to balance the target label in order to improve

the classiﬁers’ reliability. Next, we conduct the bias and fairness analysis by choosing the therapy topic as the

sensitive variable. Finally, we implement a fairness-aware augmentation technique, showing how topic-wise

bias can be mitigated by augmenting the target label with respect to the sensitive variable.Our work is the ﬁrst

step towards increasing reliability and reducing the bias of classiﬁcation models, as well as dealing with data

scarcity and imbalance in mental health.

1 INTRODUCTION

Recent advancements in Natural Language Process-

ing (NLP) captured the interest of research commu-

nity in healthcare (Kumar et al., 2020b; Dess

ı et al.,

2020; Locke et al., 2021; Kumar et al., 2020a), in-

cluding mental health and its subdomains such as de-

pression, anxiety or substance abuse (Le Glaz et al.,

2021). However, real world application of clinical

NLP is hampered by multiple elements such as do-

main complexity, rigorous accuracy and reliability

standards and data scarcity (Ibrahim et al., 2021).

Lastly, recent research highlighted critical concerns

on artiﬁcial intelligence (AI) fairness (Chouldechova

https://orcid.org/0000-0003-3958-4704

https://orcid.org/0000-0002-9812-5092

https://orcid.org/0000-0002-3679-5701

https://orcid.org/0000-0002-7548-9504

https://orcid.org/0000-0001-6915-8920

https://orcid.org/0000-0001-8646-6183

https://orcid.org/0000-0002-0695-2040

and Roth, 2020; John-Mathews et al., 2022), that is

imperative to address when applying NLP to mental

health.

As the ﬁrst step towards addressing these issues,

we adopt data augmentation to improve AI relia-

bility and fairness in the context of scarce mental

health data. We leverage our recently released dataset

Anno-MI (Wu et al., 2022), consisting of profession-

ally annotated therapy transcriptions in MI (Miller

and Rollnick, 2012; Rollnick et al., 2008). We model

a classiﬁcation task, targeting overall therapy quality,

one of Anno-MI most unbalanced labels, using each

therapist’s utterance as input data. In the fairness con-

text, we inspect therapy topics, e.g., “smoking ces-

sation”, “reducing alcohol consumption” or “diabetes

management” as the sensitive variable. We conduct

a quantitative analysis of the effects of data augmen-

tation to balance target and sensitive variables. Our

experimental results show little to no effect on Classi-

cal Machine learning (CML) classiﬁers, but prove that

Deep Learning (DL) ones beneﬁt from augmented

data, showing consistent improvement in both accu-

Kumar, V., Balloccu, S., Wu, Z., Reiter, E., Helaoui, R., Recupero, D. and Riboni, D.

Data Augmentation for Reliability and Fairness in Counselling Quality Classiﬁcation.

DOI: 10.5220/0011531400003523

In Proceedings of the 1st Workshop on Scarce Data in Artiﬁcial Intelligence for Healthcare (SDAIH 2022), pages 23-28

ISBN: 978-989-758-629-3

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Table 1: The overall distribution of high and low quality therapy utterances.

Dataset Total utterances (no.) High quality(%) Low quality(%)

Anno-MI 2601 91% 9%

Anno-AugMI 5302 45% 55%

Anno-FairMI 9154 50% 50%

Figure 1: Sensitive variable statistics for each dataset. We show topic-wise (a) utterances distribution and (b) average therapy

quality. For brevity, only common topics for each dataset are shown.

racy and reliability. Fairness assessment shows that

more work on augmentation is required to properly

mitigate eventual classiﬁcation BIAS.

2 MATERIAL AND METHODS

Anno-MI

(Wu et al., 2022) contains 110 high-quality

and 23 low-quality MI conversational dialogues from

a total of 44 topics e.g.: “smoking cessation”, “dia-

betes management”, “anxiety management” and oth-

ers. Therapy quality indicates the therapist’s adher-

ence to “general counseling principles taken from

the literature on client-centered counseling” (P

erez-

Rosas et al., 2019). Therapy quality distribution

in Anno-MI is heavily skewed towards high-quality

(HQ-MI) utterances. This is because the conversa-

tions that constitute the dataset belong to MI training

videos, which rarely showcase low-quality (LQ-MI)

Data available at https://github.com/uccollab/AnnoMI

counseling scenarios. We employ data augmentation

to overcome these issues.

We leverage NL-Augmenter

(Dhole et al., 2021)

to develop a 11-step augmentation pipeline, each one

taking one utterance as input. Therefore, for each

given utterance, we obtain n ≥ 11 augmentations (due

to certain augmenters potentially producing multiple

alternatives for the same utterance). The adopted aug-

mentation techniques include noising, paraphrasing

and sampling (Li et al., 2022). Since our augmen-

tation process is unsupervised, we avoid using tech-

niques that could lead to semantic changes with re-

spect to the original utterance. With this setup, we

generate two augmented versions of Anno-MI, target-

ing classiﬁer reliability and fairness, respectively.

Code available at https://github.com/GEM-benchmark/

NL-Augmenter

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare

2.1 Problem Statement

We model a binary classiﬁcation task to detect

therapy quality from a single therapist utterance.

We assign each therapist utterance, to the corre-

sponding conversation quality, in order to formulate

the positive and negative examples for our task.

Indeed, assessing the quality of MI sessions can

boost therapist training and skills assessment, as

conﬁrmed from the existing related work on empathy

modelling (Xiao et al., 2012; Gibson et al., 2015;

Gibson et al., 2016; Wu et al., 2020), automatic

coding of therapeutic utterances (Atkins et al., 2014;

Xiao et al., 2016; Cao et al., 2019) and session-level

therapist performance (Flemotomos et al., 2022).

Given the previously mentioned quality skewness,

the target variable represents the ﬁrst potential source

of classiﬁcation unreliability. In this context, we

introduce Anno-AugMI, consisting of all the therapist

utterances from Anno-MI, augmented in order to

balance quality proportion. Anno-AugMI creation

proceeds in a topic-agnostic fashion, with the goal

of obtaining a roughly balanced amount of HQ-MI

and LQ-MI utterances across the entire dataset. Since

therapy quality is the target of our classiﬁers, we

call this procedure target-aware augmentation. No

check is in place with regards to which utterances are

augmented, meaning that target-aware augmentation

merely iterates over the dataset and augments every

low-quality utterance until the target label is balanced.

To assess classiﬁcation fairness, it is necessary

to identify the sensitive variable and ﬁeld-test it

with the employed classiﬁers. We choose the ther-

apy topic (MI-topic) as our sensitive variable, as

inter-topic fairness guarantees stable performances

across a wide range of therapy goals, and because

therapy quality in Anno-MI is also unbalanced at

topic-level (as shown in Figure 1). To address fair-

ness, we introduce Anno-FairMI, consisting of all

the therapist utterances from Anno-MI, augmented

to balance therapy quality proportion with respect

to MI-topic. Anno-FairMI creation proceeds in

a topic-aware fashion, with the goal of having the

same amount of HQ-MI and LQ-MI utterances for each

MI-topic. Since MI-topic is the sensitive variable

of our classiﬁer, we call this procedure fairness-aware

augmentation. This last procedure introduces the ne-

cessity to cut out those MI-topic which have no low-

quality example since augmentation would have been

impossible. As a result, Anno-MI and Anno-AugMI

share all the 44 topics (134 conversations), while

Anno-FairMI keeps only 9 topics (55 conversations),

resulting in a much lower pre-augmentation data size.

The comparative distribution of topic-wise utterances,

and average therapy quality per topic is shown in Fig-

ure 1. The overall distribution of labels in Anno-MI,

Anno-AugMI and Anno-FairMI is shown in Table 1.

3 EXPERIMENTS AND RESULTS

We design a series of experiments, where each exper-

iment’s input is based on the output of the preceding

ones. The experimental setup is as follows:

• Therapist utterances quality classiﬁcation of

Anno-MI.

• Augmentation of Anno-MI to balance therapy

quality.

• Therapist utterances quality classiﬁcation of

Anno-AugMI.

• Fairness assessment of Anno-AugMI.

• Augmentation of Anno-MI based on MI-topic.

• Therapist utterances quality classiﬁcation of

Anno-FairMI.

• Fairness assessment and BIAS mitigation of

Anno-FairMI.

We use Support Vector Machine (SVM) and Ran-

dom Forest (RF) as CML classiﬁers,and a Bidirec-

tional Long Short Term Memory (Bi-LSTM) with

Word2Vec pre-trained word embedding for the em-

bedding layer. We use balanced accuracy and F-1

score as performance evaluation metrics for classi-

ﬁers. We use one universal test set for all the experi-

ments, created by extracting 400 high quality and 100

low quality utterances from Anno-MI. The rest of the

data is considered as training set and constitutes the

basis for the augmentation.

To assess the fairness and mitigate eventual BIAS

of our classiﬁers we use Microsoft FairLearn

(Bird

et al., 2020) and inspect Selection Rate (SR), False

Negative Rate (FNR) and Balanced Accuracy (BA)

as evaluation metrics. Where applicable, we adopt

“Threshold Optimization” with BA as the target and

False Negative Parity as the fairness constraint. Since

Anno-MI and Anno-AugMI contain multiple topics

that lack LQ-MI utterances, it is not possible to split

training, test and validation data so that each partition

contains both therapy quality classes. The presence

of degenerate labels prevents BIAS mitigation, so for

these datasets we only evaluate the initial metrics val-

ues.

The classiﬁcation results of CML and DL ap-

proaches for each of the three datasets are summed up

Code available at https://github.com/fairlearn/fairlearn

Data Augmentation for Reliability and Fairness in Counselling Quality Classiﬁcation

Table 2: Performance of CML and DL approaches with Anno-MI, Anno-AugMI, Anno-FairMI. For each dataset we report

Balanced Accuracy and F1 score calculated with regards to MI quality.

SVM Random Forest Bi-LSTM (DNN)

Dataset Bal.Acc. F-1 Bal.Acc. F-1 Bal.Acc. F-1

Anno-MI 50.00 44.44 50.75 46.34 50.00 44.44

Anno-AugMI 48.87 38.12 50.37 45.78 73.12 71.85

Anno-FairMI 53.87 48.15 51.00 50.99 64.13 59.50

Figure 2: Confusion matrix for the Bi-LSTM trained on each dataset. For Anno-FairMI we provide pre and post-mitigation

matrix.

in Table 2. The obtained results are indicative of con-

sistent low performance of the CML with Anno-MI.

Our augmentation techniques are quite simple so they

do not add prominent features to Anno-MI, which can

be very helpful in distinguishing classes with bag-of-

words representation. This explains the minor perfor-

mance improvement of the CML algorithms. Since

both SVM and RF did not beneﬁt from data augmen-

tation and are comparable to random classiﬁers, we

do not go any further with their analysis. On the

other hand, Bi-LSTM model shows signiﬁcant per-

formance enhancement of 23-14% for Anno-AugMI

and Anno-FairMI respectively over Anno-MI. Further

considerations can be drawn by looking at the confu-

sion matrix in Figure 2. The initial model, trained

on Anno-MI, suffers from the skewed therapy quality

distribution and is unable to recognise LQ-MI utter-

ances. This problem also reﬂects on HQ-MI, with no

false positives at all. With target-aware augmenta-

tion on Anno-AugMI we see more promising results

with about 40% of false positives and 14% of false

negatives. Finally, with fairness-aware augmenta-

tion on Anno-FairMI we see pretty much no change

in LQ-MI classiﬁcation, but a considerable drop with

HQ-MI, with about 30% false negatives. This can

be motivated by the reduced amount of topics in

Anno-FairMI, making the Bi-LSTM suffer from the

unseen ones in test set. In both cases, data aug-

mentation led to an accuracy improvement, which

makes our approach promising for future develop-

ments (Rice and Harris, 2005).

Fairness metrics values for each dataset are

showed in Figure 3. SR and FNR are apparently

ideal for Anno-MI, but this is purely related to the low

BA value. Anno-AugMI shows more unbalanced val-

ues for SR and FNR, but higher BA than Anno-MI

across pretty much every topic. For Anno-FairMI,

BIAS mitigation can be ran because of the absence

of degenerate labels in training set. Pre-mitigation,

Anno-FairMI shows generally more balanced SR,

lower FNR and higher BA than the other two datasets

for known topics, and little to no effect after mitiga-

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare

Table 3: The effects of BIAS mitigation on Bi-LSTM trained on Anno-FairMI. For each metric, we report the mean value

calculated with regards to the sensitive variable (therapy topic). “TO” stands for “Threshold Optimisation”.

Dataset Selection Rate False Negative Rate Bal. Acc.

Anno-FairMI 67.29 23.94 75.72

Anno-FairMI + TO 19.60 72.86 21.89

Figure 3: Fairness assessment and BIAS mitigation for Bi-LSTM on each dataset. For brevity, only common topics for each

dataset are shown.

tion. However, moving to unseen topics the overall

Bi-LSTM performances greatly worsened, with com-

promised classiﬁcation (Figure 2) and fairness met-

rics dropping signiﬁcantly (Table 3).

4 CONCLUSION AND FUTURE

WORK

In this work we employed data augmentation to bal-

ance target and sensitive variable on our dataset of MI

transcriptions Anno-MI, resulting in two augmented

datasets, namely Anno-AugMI and Anno-FairMI. We

evaluated our approaches on a classiﬁcation task,

aimed at recognising therapy quality. Our results

show a promising accuracy increase for DL classiﬁers

by using augmented datasets, especially Anno-AugMI.

This motivates us to consider other target attributes in

future works, such as client talk type or therapist be-

haviour, also extending to other tasks like forecasting.

The fairness assessment and BIAS mitigation show

that Anno-FairMI is too sensitive to unseen topics,

opening interesting future work on the adoption of

more advanced augmentation techniques. Overall, we

consider target-aware augmentation effective at ad-

dressing the challenges of unbalanced and scarce data

in the mental health domain. Finally, we aim to per-

form human evaluation of the developed classiﬁer, to

sanity check the reliability of the obtained results.

ACKNOWLEDGEMENTS

This work is supported by the EU’s Marie Curie train-

ing network PhilHumans—Personal Health Interfaces

Leveraging Human–Machine Natural Interactions un-

der Agreement 812882.

REFERENCES

Atkins, D. C., Steyvers, M., Imel, Z. E., and Smyth, P.

(2014). Scaling up the evaluation of psychotherapy:

Data Augmentation for Reliability and Fairness in Counselling Quality Classiﬁcation

evaluating motivational interviewing ﬁdelity via sta-

tistical text classiﬁcation. Implementation Science,

9(1):1–11.

Bird, S., Dud

ık, M., Edgar, R., Horn, B., Lutz, R., Milan,

V., Sameki, M., Wallach, H., and Walker, K. (2020).

Fairlearn: A toolkit for assessing and improving fair-

ness in ai. Microsoft, Tech. Rep. MSR-TR-2020-32.

Cao, J., Tanana, M., Imel, Z., Poitras, E., Atkins, D., and

Srikumar, V. (2019). Observing dialogue in therapy:

Categorizing and forecasting behavioral codes. In

Proceedings of the 57th Annual Meeting of the As-

sociation for Computational Linguistics, pages 5599–

5611.

Chouldechova, A. and Roth, A. (2020). A snapshot of the

frontiers of fairness in machine learning. Communi-

cations of the ACM, 63(5):82–89.

Dess

ı, D., Helaoui, R., Kumar, V., Recupero, D. R., and

Riboni, D. (2020). TF-IDF vs word embeddings for

morbidity identiﬁcation in clinical notes: An initial

study. In Consoli, S., Recupero, D. R., and Ri-

boni, D., editors, Proceedings of the First Workshop

on Smart Personal Health Interfaces co-located with

25th International Conference on Intelligent User In-

terfaces, SmartPhil@IUI 2020, Cagliari, Italy, March

17, 2020, volume 2596 of CEUR Workshop Proceed-

ings, pages 1–12. CEUR-WS.org.

Dhole, K. D., Gangal, V., Gehrmann, S., Gupta, A., Li, Z.,

Mahamood, S., Mahendiran, A., Mille, S., Srivastava,

A., Tan, S., et al. (2021). Nl-augmenter: A frame-

work for task-sensitive natural language augmenta-

tion. arXiv preprint arXiv:2112.02721.

Flemotomos, N., Martinez, V. R., Chen, Z., Singla, K.,

Ardulov, V., Peri, R., Caperton, D. D., Gibson, J.,

Tanana, M. J., Georgiou, P., et al. (2022). Automated

evaluation of psychotherapy skills using speech and

language technologies. Behavior Research Methods,

54(2):690–711.

Gibson, J., Can, D., Xiao, B., Imel, Z. E., Atkins, D. C.,

Georgiou, P., and Narayanan, S. S. (2016). A Deep

Learning Approach to Modeling Empathy in Addic-

tion Counseling. In Proc. Interspeech 2016, pages

1447–1451.

Gibson, J., Malandrakis, N., Romero, F., Atkins, D. C., and

Narayanan, S. S. (2015). Predicting therapist empathy

in motivational interviews using language features in-

spired by psycholinguistic norms. In Sixteenth annual

conference of the international speech communication

association.

Ibrahim, H., Liu, X., Zariffa, N., Morris, A. D., and Den-

niston, A. K. (2021). Health data poverty: an assail-

able barrier to equitable digital health care. The Lancet

Digital Health, 3(4):e260–e265.

John-Mathews, J.-M., Cardon, D., and Balagu

e, C. (2022).

From reality to world. a critical perspective on ai fair-

ness. Journal of Business Ethics, pages 1–15.

Kumar, V., Mishra, B. K., Mazzara, M., Thanh, D. N., and

Verma, A. (2020a). Prediction of malignant and be-

nign breast cancer: A data mining approach in health-

care applications. In Advances in data science and

management, pages 435–442. Springer.

Kumar, V., Recupero, D. R., Riboni, D., and Helaoui, R.

(2020b). Ensembling classical machine learning and

deep learning approaches for morbidity identiﬁcation

from clinical notes. IEEE Access, 9:7107–7126.

Le Glaz, A., Haralambous, Y., Kim-Dufor, D.-H., Lenca, P.,

Billot, R., Ryan, T. C., Marsh, J., Devylder, J., Wal-

ter, M., Berrouiguet, S., et al. (2021). Machine learn-

ing and natural language processing in mental health:

Systematic review. Journal of Medical Internet Re-

search, 23(5):e15708.

Li, B., Hou, Y., and Che, W. (2022). Data augmentation

approaches in natural language processing: A survey.

AI Open.

Locke, S., Bashall, A., Al-Adely, S., Moore, J., Wilson, A.,

and Kitchen, G. B. (2021). Natural language process-

ing in medicine: a review. Trends in Anaesthesia and

Critical Care, 38:4–9.

Miller, W. R. and Rollnick, S. (2012). Motivational inter-

viewing: Helping people change. Guilford press.

erez-Rosas, V., Wu, X., Resnicow, K., and Mihalcea, R.

(2019). What makes a good counselor? learning

to distinguish between high-quality and low-quality

counseling conversations. In Proceedings of the 57th

Annual Meeting of the Association for Computational

Linguistics, pages 926–935, Florence, Italy. Associa-

tion for Computational Linguistics.

Rice, M. E. and Harris, G. T. (2005). Comparing effect sizes

in follow-up studies: Roc area, cohen’s d, and r. Law

and human behavior, 29(5):615–620.

Rollnick, S., Miller, W. R., and Butler, C. (2008). Moti-

vational interviewing in health care: helping patients

change behavior. Guilford Press.

Wu, Z., Balloccu, S., Kumar, V., Helaoui, R., Reiter, E.,

Recupero, D. R., and Riboni, D. (2022). Anno-

mi: A dataset of expert-annotated counselling dia-

logues. In ICASSP 2022-2022 IEEE International

Conference on Acoustics, Speech and Signal Process-

ing (ICASSP), pages 6177–6181. IEEE.

Wu, Z., Helaoui, R., Kumar, V., Reforgiato Recupero,

D., and Riboni, D. (2020). Towards detecting need

for empathetic response in motivational interviewing.

In Companion Publication of the 2020 International

Conference on Multimodal Interaction, pages 497–

502.

Xiao, B., Can, D., Georgiou, P. G., Atkins, D., and

Narayanan, S. S. (2012). Analyzing the language

of therapist empathy in motivational interview based

psychotherapy. In Proceedings of The 2012 Asia Pa-

ciﬁc Signal and Information Processing Association

Annual Summit and Conference, pages 1–4. IEEE.

Xiao, B., Can, D., Gibson, J., Imel, Z. E., Atkins, D. C.,

Georgiou, P. G., and Narayanan, S. S. (2016). Behav-

ioral coding of therapist language in addiction coun-

seling using recurrent neural networks. In Interspeech,

pages 908–912.

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare