How Different Elements of Audio Affect the Word Error Rate of

Transcripts in Automated Medical Reporting

Emma Kwint, Anna Zoet, Katsiaryna Labunets

and Sjaak Brinkkemper

Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, Utrecht, The Netherlands

Keywords:

Speech Recognition, Automated Speech Recognition Software, Automated Medical Reporting, Word Error

Rate.

Abstract:

Automated Speech Recognition software is implemented in different ﬁelds. One of them is healthcare in

which it can be used for automated medical reporting, the ﬁeld of focus of this research. For the ﬁrst step of

automated medical reporting, audio ﬁles of consultations need to be transcribed. This research contributes to

the investigation of the optimization of the generated transcriptions, focusing on categorizing audio ﬁles on

speciﬁc characteristics before analyzing them. The literature research within this study shows that speciﬁc

elements of speech signals and audio, such as accent, voice frequency and noise, can have inﬂuence on the

quality of a transcription an Automated Speech Recognition system carries out. By analyzing existing medical

audio data and conducting an pilot experiment, the inﬂuence of those elements is established. This is done by

calculating the Word Error Rate of the transcriptions, a useful percentage that shows the accuracy. Results of

the analysis of the existing data show that noise is an element that carries out signiﬁcant differences. However

the data of the experiment did not show signiﬁcant differences. This was mainly due to having not enough

participants to reason with signiﬁcance. Further research into the effect of noise, language and different

Automated Speech Recognition technologies should be done based on the outcomes of this research.

1 INTRODUCTION

Nowadays, audio-based speech recognition systems

are implemented in many types of smartphones and

other devices. These recognition systems convert spo-

ken language into written text. When used in settings

such as healthcare, this kind of software could be use-

ful to ease the process of creating reports of consulta-

tions. The administrative tasks take over 100.000 full-

time positions and cost over ﬁve billion euros per year

in the Netherlands alone (Maas et al., 2020). This

calls for a solution to lower this burden in the health-

care sector and allow medical professionals to spend

more of their time on patients.

One proposed solution is Care2Report, a combi-

nation of hardware and software that makes use of

multi-sensory input to translate and interpret infor-

mation. This is done in such a way that allows the

care provider and patient to interact without the ad-

ministrative burden a consultation creates. One of the

sensory inputs is speech. This is translated through

https://orcid.org/0000-0003-0884-2440

https://orcid.org/0000-0002-2977-8911

Speech Recognition software into a formal represen-

tation of a medical report (Maas et al., 2020). How-

ever, the output of such Automated Speech Recog-

nition (ASR) software can be corrupted, which de-

creases the accuracy of the translation.

In other words, ASR can come with certain dif-

ﬁculties. Machine learning techniques interpret the

software’s inputs, for example using (hidden) Markov

Models. This is a machine learning technique that

classiﬁes input, trained on a set of data that updates

the performance of the Markov Model (Fung and Kat,

1999). It is then tested on another set of data to ﬁnally

determine the goodness of ﬁt of the technique. In this

case this means that the ﬁnal speech input would be

correctly transcribed. Taking all elements of speech

and audio ﬁles into account that can have an impact

on the outcome of the ASR software is cumbersome.

For example, patients can slur words or have a heavy

accent, that is not or hardly picked up by the software.

Furthermore, the distance to the microphone could

be too grand, which is why certain sounds are not

picked up on, or why too much external noise might

be recorded. Such elements can impact the accuracy

of the error rate of the transcription that is given as an

Kwint, E., Zoet, A., Labunets, K. and Brinkkemper, S.

How Different Elements of Audio Affect the Word Error Rate of Transcripts in Automated Medical Reporting.

DOI: 10.5220/0011794100003414

In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 5: HEALTHINF, pages 179-187

ISBN: 978-989-758-631-6; ISSN: 2184-4305

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

179

output.

A research into the elements of speech and audio

that might impact this rate has been performed. This

has been done by the application of a mixed-method

approach. First, a literature research has been carried

out. This is described in Section 3. Furthermore, a

data set consisting of audio ﬁles and transcripts has

been analyzed. These have been checked for errors

and were labeled with speech or audio elements that

were come across during the literature research. This

can be found in Section 4. The next step entailed the

computation of the Word Error Rate of every ﬁle in

the data set and then comparing these to each other.

The results are described in Section 5.

2 BACKGROUND

As was already brieﬂy mentioned in Section 1,

Care2Report facilitates automated healthcare report-

ing using speech recognition technology and action

recognition technology in combination with language

tools. This way they record and summarize the in-

teraction between the healthcare worker and the pa-

tient in order to minimize administrative work (Maas

et al., 2020). To automatically create a medical report

of a medical consultation, domain ontology learning

and ontological conversation interpretation are com-

bined, as can be seen in Figure 1. This domain ontol-

ogy learning is used for the semantic interpretation

of the conversation between the healthcare worker

and patient. By combining medical guidelines and a

medical glossary, SNOMED CT, the statements of a

patient are translated into deﬁned medical concepts.

Ontological conversation interpretation consists of 4

steps of which the ﬁrst one is the most important for

this research. The ontological conversation interpre-

tation starts with consultation transcription, in which

the audio ﬁle of the consultation is converted into

written text utilizing Automated Speech Recognition

software. As this is step results in a transcript of a

medical consultation, it is extremely important that

this transcription is close to perfect. That is what this

this study shows too. Within this research, even short

spoken sentences such as ’I have a neck pain when I

sit in-front of my laptop’ are transcribed as ’I have a

big thing when I sing in front of my laptop’. Besides,

’My heart is beating fast and it scares me’ was tran-

scribed as ’My heart is bleeding ... ’. Crucial parts

of conversation for medical reports are substituted or

deleted in these two examples. The relevance of op-

timizing this transcription process becomes clear. To

optimize the transcription process, one may look at

the software that is used, because these ASR systems

Figure 1: Care2Report ontological conversation interpreta-

tion (ElAssy et al., 2022).

are far from optimally accurate now. However, the

audio ﬁle resulting from a recorded medical consul-

tation and used by the ASR systems, can maybe be

optimized as well. The focus of this study can thus be

placed before consultation transcription in Figure 1,

out of the scope of the Care2Report project described

in the paper of Elassy et al. (ElAssy et al., 2022).

When going back in time, TRACE was one of

the ﬁrst computational models of spoken-word recog-

nition. It is the pioneer of matching multiple word

candidates to speech input. Weber and Scharenborg

(2012) mention that on hearing the word sun, TRACE

would then also consider under and run, for exam-

ple. The best match between the speech input and the

utterances known to the machine is eventually cho-

sen and transcribed (Weber and Scharenborg, 2012).

This means that the probability of occurrence of the

so-called invariance problem would then be mini-

mized. This problem entails the possibility of speech

input being disrupted, because sounds might be re-

duced, deleted, inserted or even substituted (Scharen-

borg et al., 2005). Besides making sure the language

model on which the speech recognizer is based is op-

timized for understanding its input, there is another

manner in which the recognizer’s performance can be

measured (Wang et al., 2003).

2.1 Deﬁning a Research Question

Namely the Word Error Rate (WER), which is an

error metric that is widely used to determine a ma-

chine’s accuracy in ASR. It measures the amount of

words that differ from hypothesis and result. Four sit-

uations are possible (Evermann, 1999):

• The word in the transcript is the same as in the

hypothesis: the word is correct;

• The word in the hypothesis is substituted with an-

other word that should not be in the transcript;

• A word has been inserted that was not in the hy-

pothesis;

• A word has been deleted in its entirety, which was

in the hypothesis.

HEALTHINF 2023 - 16th International Conference on Health Informatics

180

The WER is calculated by the division of the num-

ber of word errors by the total reference words. How-

ever, it should be noted that the WER treats every er-

ror with the same weight. This means that every er-

ror is treated equally, even though some mistakes are

graver than others (Evermann, 1999). This is why

we would like to look into mistakes that are made

and see which elements seem to have the biggest im-

pact. When possibly improving these elements and

preventing such errors to happen as frequently, the

WER could be decreased. This has lead to the fol-

lowing main research question:

What elements of speech and audio affect the

Word Error Rate of the transcripts of audio

ﬁles the most in Automated Speech Recogni-

tion Software?

To answer this research question, it is important to

look into two factors. First of all, literature will pro-

vide insight into the existing elements of speech and

audio. When these have been researched extensively,

a connection can be made to the WER. However, it

should ﬁrst be clear what an acceptable WER would

be and how this can be achieved, before possible so-

lutions to decrease this rate can be proposed. This is

why the following questions will be answered primar-

ily; a) what would be an acceptable Word Error Rate,

b) what are the elements of speech that are impor-

tant for Automated Speech Recognition and c) what

are the elements of audio that are important for Auto-

mated Speech Recognition.

3 LITERATURE RESEARCH

This Section gives an extensive overview of ASR soft-

ware and its accuracy, based on elements of speech

and audio. First, a deﬁnition of Automated Speech

Recognition is given, as well as a superﬁcial expla-

nation of its functionality. Then, it is compared with

Human Speech Recognition and resulting Word Er-

ror Rate. Furthermore, the accuracy of the WER is

explained, after which the various speech and audio

elements are elaborated upon separately.

3.1 Automated Speech Recognition

Technologies

The goal of speech recognition is for a machine to

be able to ’hear’, ’understand’ and ’act upon’ spoken

information (Gaikwad et al., 2010). A speech rec-

ognizer attempts to fulﬁll its goal by matching input

with known data, such as (expected) words or sen-

tences (Kotelly, 2003).

Nowadays, this speech recognition technology is

a very common feature of smartphones and thus be-

coming widely available (Moore and Cutler, 2001).

This is called small vocabulary voice command-and-

control, which can be used for everyday tasks, such as

searching the web or setting a timer. However, speech

recognizers designed speciﬁcally for sectors - such as

healthcare - need to be able to create a more accu-

rate transcription using the correct terminology. In

other words, ASR that is used for such speciﬁc tasks

should be fully developed before being put into prac-

tice. Therefore, a lot of research has been done to

determine whether ASR could perform as greatly as

human speech recognition can.

3.2 Human Speech Recognition Versus

Automated Speech Recognition

Human Speech Recognition (HSR) focuses on under-

standing how listeners recognize words that are spo-

ken. This is mainly done by computational models

that simulate and explain data related to the recogni-

tion of words (Scharenborg, 2007). ASR, on the other

hand, researches the building of algorithms that rec-

ognize these utterances automatically, whilst taking

many contexts into account and trying to achieve an as

accurate transcript as possible (Scharenborg, 2007).

ASR is said to be an end-to-end system, whereas HSR

only focuses on the human speech process. Humans

are able to recognize spoken nonsense, even when

little grammatical information is given (Lippmann,

1997). Lippmann (1997) states that speech recogniz-

ers still lack performance in comparison to humans,

even when spontaneous, noisy or quietly read speech

is compared (Lippmann, 1997).

ASR models can recognize speech, but offer lit-

tle when looking at human behavior. A study by

Scharenborg et al. (2005) shows promising results.

Their investigation has proven that the parallels be-

tween HSR and ASR are not as grand as was ex-

pected. Even though ASR makes use of techniques,

such as dynamic programming (DP) and preprocess-

ing, while HSR drives on human lexical capacity and

auditory perception, the two are more connected than

inititally thought (Scharenborg et al., 2005). These

results have not only lead to a new, computational

speech recognition model, but also shows improve-

ment in the models or tools used might lead to a closer

resemblance of ASR to HSR. In other words, these

results might indicate that a resembling performance

between the two ﬁelds is indeed possible.

How Different Elements of Audio Affect the Word Error Rate of Transcripts in Automated Medical Reporting

181

3.3 Word Error Rate

To increase the resemblance between these two ﬁelds,

one goal is to decrease the WER. As has been stated

in Section 2.1, the resulting transcription is compared

to the hypothesis to determine the WER, which gives

an indication of the speech recognizer’s performance.

This indication is a percentage resembling the error

rate of transcribed words that have not been recog-

nized or wrongly classiﬁed. But from what rate would

a performance be considered acceptable?

The transcription error rate is less than 4% for

recorded conversations over the phone between hu-

man listeners, according to R.P. Lippmann (1997)

(Lippmann, 1997). However, human listeners are able

to achieve an error rate of 1.6% during normal face-

to-face conversations. With an WER of around 9%,

Automated Speech Recognition software could still

be improved and peak performance seems yet to be

achieved.

3.4 Accuracy of Word Error Rate

As has been explained, there are four states a word

can be in, regarding the word error rate. Vilar et al.

(2006) state that of all input errors, substitutions are

responsible for more than half (54.7%) (Vilar et al.,

2006). Deleted words take up a quarter of the to-

tal amount of errors (25%) and insertions a little less

(23%). Deletion or insertion errors are not as impact-

ful as substitution errors might be, since those types of

mistakes would not necessarily change the semantics

of a sentence. However, substitution errors could al-

ter the context, which might lead to confusion or even

drastic mistakes.

3.5 Elements That Affect Accuracy of

WER

An average of 9% WER still seems to be too high

and is therefore worth researching whether it could

be reduced. However, it is necessary to ﬁnd out what

kind or errors are made and how these can possibly

be prevented. There are certain elements that affect

the accuracy of the WER more than others. In this

Section, ﬁrst, elements having an effect on speech

will be elaborated upon. Then, the elements of audio

ﬁles that might impact the WER of ASR software are

described.

3.5.1 Different Elements of Speech

To address the possible difﬁculties of ASR software,

it is important to look into the elements of speech and

audio ﬁrst. There are three aspects that make map-

ping the speech input to known words cumbersome

(Weber and Scharenborg, 2012). Firstly, many words

have a high resemblance. Therefore, context is ut-

terly important (Kaplan, 1990). Secondly, speech has

a high variability. The existing literature mentions

four main features of speech, namely speaker accents,

speaking style, speaking rate and phonological con-

text (Scharenborg et al., 2005). Lastly, speech is con-

tinuous and brief.

Additionally, words and sentences spoken by fe-

males are harder to recognize than when produced

by males (Lippmann, 1997). This means that differ-

ences in speaking style deﬁnitely affect the accuracy

of ASR software and should be taken into account.

Men tend to speak louder and slower, whereas fe-

males speak softly, but fast, whilst applying correct

grammar. Females tend to enunciate more as well

(Simpson, 2009). Furthermore, males and females of-

ten differ in voice frequency, which results in a either

high or low sounding voice.

Due to the invariance problem, the WER can be

rather high. This determines whether the system suc-

cessfully hears, understands and acts upon the speech

input. Solutions are proposed to this problem, which

aim at the expansion of (acoustic) language models

with more (acoustic) realizations of words (Simpson,

2009). This would ease the classiﬁcation of words

and therefore create a better accuracy because of a de-

creased WER. Furthermore, as was used in TRACE,

multiple-word candidate matching increases the accu-

racy of the transcription and should therefore be im-

plemented more to ensure better results.

3.5.2 Different Elements of Audio

The characteristics such as speaker variability and

style are important to take into account when catego-

rizing audio ﬁles. However, not only the characteris-

tics of different speakers are important to distinguish

different audio ﬁles. Other elements that might inﬂu-

ence the quality of the audio ﬁle, but are not related

to the speaker, need to be reviewed too. The ﬁrst el-

ement of audio that is independent from the speech

signal but does inﬂuence the quality, is noise (Fors-

berg, 2003). When recording a consultation in a med-

ical setting, unwanted sounds, such as the ticking of

a clock or a creaking door, can disturb the purity of

the audio ﬁle. This might make it harder to convert

the speech input into written text. Another element

that could infer the speech signal is called the echo

HEALTHINF 2023 - 16th International Conference on Health Informatics

182

effect. This occurs when the speech signal is bounced

on something within the room and arrives in the mi-

crophone a few seconds later again.

The two elements that are stated above that in-

ﬂuence the audio ﬁle are both focused on additional

sounds. Another aspect of sound itself that is im-

portant to take into account is the channel variabil-

ity. This accounts for the setting in which the audio is

recorded. The quality of the microphone and the dis-

tance of the microphone to the speaker belong to this

kind of variability (Forsberg, 2003).

Lastly, sounds can be characterized using acous-

tic features that already have been used to analyze

different audio ﬁles. These traditional, acoustic fea-

tures are pitch, loudness, duration, and timbre (Wold

et al., 1996). As Wold et al. (1996) state in their pa-

per, every feature except timbre is relatively easy to

measure and model. As these features can be differ-

ent for every audio ﬁle, investigating correlations be-

tween these features and the WER of the correspond-

ing transcripts can be interesting.

4 RESEARCH METHOD

To answer the main research question stated in Sec-

tion 2.1, we used a mixed-method approach, starting

with extensive literature research. The overview of

all relevant literature is stated in the previous Section

3. Table 1 presents a set of null and alternative re-

search hypotheses. To test these research hypothe-

ses, we conducted two studies. In the ﬁrst study, we

took an existing dataset of medical audio recordings

(Mooney, 2018) that we manually labeled with re-

spect to the presence of Accent, levels of Frequency

and Noise. In the second study, we conducted a small

pilot experiment to test the observations from the ﬁrst

study. Figure 2 provides an overview of our research

method. In both studies, the independent variables

are the presence of the speaker’s Accent, levels of

voice Frequency and Noise, and the dependent vari-

able is the WER. These variables were derived from

the existing literature as core factors potentially af-

fecting the quality of speech recognition and limiting

our study score to this set for practical reasons.

4.1 Study 1 – Existing Medical Data

For the analysis of existing medical audio ﬁles, an on-

line available data set is used. The data set (Mooney,

2018) has been retrieved from the Kaggle platform.

The audio ﬁles are recordings of two sentences at

most and are pronounced by different speakers, each

in possession of distinguishable speaking styles and

Table 1: Experimental Hypotheses.

Hyp Null Hypothesis Alternative Hypothesis

H1 No difference in WER be-

tween samples with and with-

out Accent.

There is a difference in WER

between samples with and with-

out Accent.

H2 No difference in WER be-

tween samples with different

levels of Frequency.

There is a difference in WER

between samples with different

levels of Frequency.

H3 No difference in WER be-

tween samples with different

levels of Noise.

There is a difference in WER

between samples with different

levels of Noise.

rates. The spoken language is English and the sen-

tences are along the lines of the following example:

”Oh, my head hurts me. I try to be calm but I

can’t.”

The data set consisted of 8.5 hours of audio in to-

tal, however, we included only 30 random ﬁles in our

study.

Labeling Audio Files The selected ﬁles were as-

signed with speciﬁc labels to enable the classiﬁca-

tion and comparison of the different ﬁles. For accent,

voice frequency and noise, the ﬁle got assigned with

different intensities. The speaker accent is a binary

item: ”Accent” or ”No accent”. British English was

considered no accent, which automatically classiﬁes

every other accent as yes. The frequency and noise

can both be assigned a level, namely high, medium

or low. However, according to the source of the used

data set containing existing audio ﬁles, the noise of

every item was either low or non-existent (Mooney,

2018), thus no noise was included as an intensity for

noise too.

Obtaining Corresponding Transcriptions Using

ASR Software After having labeled the audio ﬁles, the

ﬁles were exposed to an Automated Speech Recog-

nition technology, namely an application which is

called voice recorder (Software, ). This gave a tran-

scription, which was compared to the expected out-

come. The difference between the hypothesis and ex-

pected transcript is the Word Error Rate. The WERs

of all transcripts were calculated, so further analysis

could be done.

Grouping Characteristics And Comparing Error

Rates As the labels and the computed WERs of the

ﬁles and corresponding transcripts were noted, the

last step of the medical data analysis was taken. The

ﬁles were then grouped, to ease the process of com-

puting possible signiﬁcant differences between the

characteristics of the ﬁles. This and the statistical

analyses are done by a short Python script. To dou-

ble check the results of statistical test and calculate

effect size, we used R.

How Different Elements of Audio Affect the Word Error Rate of Transcripts in Automated Medical Reporting

183

Figure 2: Steps of the Research Method.

4.2 Study 2 – Pilot Experiment

In this study, we conducted a small pilot experiment

with ﬁve Dutch speakers. We asked our participants

to read out loud two times a part of a medical consul-

tation transcript in Dutch and recorded them. There-

fore, we adopted a within-subject design. The differ-

ence between the two recordings is the presence of

noise. Besides, the participants differed in voice fre-

quency. Due to conﬁdentiality reasons, we did not

include the transcript that the participants had to read

out loud. The script was a dialogue of roughly 3 min-

utes in which one of the authors acted as the general

practitioner.

Experiment execution The participants were gath-

ered using convenience sampling. At the beginning of

the experiment, we introduced our research quickly

and asked participants to sign the informed consent.

The microphone was placed directly in front of them.

Based on the outcomes of the ﬁrst study, we focus on

the control for the noise factor. To limit the effect of

the order of treatment levels, we recorded three partic-

ipants reading ﬁrst with noise and then without noise,

while the other two participants followed the opposite

order (ﬁrst reading without noise, then with noise).

Thus, we collected two sets of recordings from ﬁve

participants resulting in ten recordings in total.

Analysis of the data resulting from the experiment

Similarly to the ﬁrst study, we labelled the audio ﬁles

with respect to Accent, Frequency, and Noise (see

Section 4.1). We used the ASR software of Microsoft

to recognise speech and generate transcripts, because

the technology used in Study 1 only transcribed up to

1 minute of audio input. We then calculated the WER

value for each transcript and used statistical tests to

test our research hypotheses (see Table 1).

5 RESULTS

5.1 Results – Study 1 (Existing Medical

Data)

We calculated the Word Error Rate (WER) for every

audio ﬁle transcription. Table 2 reports the descriptive

statistics of WER value per level of each factor.

Table 2: Study 1 – Descriptive Statistics.

WER

mean med std

Accent

Yes 0.19 0.13 0.23

No 0.14 0.04 0.27

Frequency

Low 0.28 0.25 0.28

Medium 0 0 0

High 0.12 0.06 0.21

Noise

No 0.14 0.1 0.22

Low 0.06 0 0.11

Medium 0.42 0.375 0.22

High 0.46 0.4 0.36

We observed a lot of deletion and substitution

errors made by the Speech Recognition technology.

For example, ’whiteheads’ was transcribed as ’white

hats’ and ’there’ was transcribed as ’where’. Spoken

words such as ’ran’ were substituted with ’rang’ and

’knee’ was transcribed as ’name’. These errors were

counted, and statistical tests were done. Table 3 re-

ports the p-value returned by a corresponding statisti-

cal test for each independent variable and whether the

corresponding null hypothesis is rejected or not.

To validate our null hypotheses, we could use

the ANOVA test as we have two or more levels for

some of our independent variables. However, the

ANOVA test requires our samples to be normally

distributed (tested by Shapiro–Wilk test) and have a

homogeneity of variance (tested by Levene’s test).

Our samples do not have distribution normality.

Therefore, for H1, we use Mann–Whitney test, and

for H2 and H3, we use the Kruskal–Wallis (KW) test

HEALTHINF 2023 - 16th International Conference on Health Informatics

184

and a post-hoc Mann–Whitney (MW) test (corrected

for multiple tests with the Bonferroni method). We

adopt 5% as a threshold of α (i.e., the probability

of committing a Type-I error). For the statistically

signiﬁcant results, we also report the approach to

report effect size following Tomczak & Tomczak

(2014).

Table 3: Study 1 – Summary of the ﬁndings.

Variable Statistical test p-value Null hyp.

Accent MW test 0.34 Not rejected

Frequency KW test 0.14 Not rejected

Noise KW test 0.016 Open*

*The post-hoc test did not conﬁrm any statistically sig-

niﬁcant difference between each pair of Noise levels.

Our statistical tests did not reveal any statisti-

cally signiﬁcant differences in WER for the Accent

and Frequency variables. Therefore, we cannot reject

the corresponding null hypotheses. For H3, the KW

test returned p − value = 0.016, showing a statisti-

cally signiﬁcant effect of the Noise on WER with a

large effect (η

= 0.281). However, the post-hoc MW

test with Bonferroni correction did not conﬁrm this

(p − value ≥ 0.18). Thus, we keep our H3 hypothe-

sis open and explore the effect of Noise further in the

follow-up study.

5.2 Results – Study 2 (Experiment)

In our experiments, we focus on the effect of Noise

(based on the ﬁndings from Study 1) and Frequency

(based on the literature study results). Accent was left

out, because none of the participants’ speaking style

was classiﬁed as having an accent. The transcripts

produced by the ASR software contained a lot of sub-

stitution errors and even more deletion word errors,

comparing to the Study 1 results. Table 4 reports the

descriptive statistics of WER value per levels of each

factor.

Similar to Study 1, we used statistical analysis to

test the remaining null hypothesis H2 and H3. Due

to low number of data points and to keep the anal-

ysis as close as possible to Study 1, we used non-

parametric tests. As we followed a within-subject de-

sign, for H2 we used Wilcoxon test (paired compar-

ison) and for H3 we used KW test. Table 5 reports

the results of our analysis. For Frequency, the KW

test return p − value = 0.23 and, thus, we cannot re-

ject the null hypothesis H2. The Wilcoxon test did

not show any statistically signiﬁcant results for Noise

(p − value = 0.92) and, therefore, we cannot reject

the null hypothesis H3.

Table 4: Study 2 – Descriptive Statistics.

WER

mean med std

Sample without noise

Frequency

Low 0.27 0.27 N/A

Medium 0.20 0.21 0.025

High 0.29 0.29 N/A

Sample with noise

Frequency

Low 0.28 0.28 N/A

Medium 0.21 0.19 0.09

High 0.26 0.26 N/A

Table 5: Study 2 – Summary of the ﬁndings.

Variable Statistical test p-value Null hyp.

Noise Wilcoxon test p=0.91 Not rejected

Frequency KW test p=0.23 Not rejected

6 DISCUSSION

6.1 Limitations

This research was limited in a number of ways.

Firstly, the literature on the topic of one of our sub-

questions was limited. Scientiﬁc papers on acceptable

Word Error Rates within the medical ﬁeld do not exist

widely since this is a new ﬁeld of research. A speciﬁc

limit was therefore not being selected and the research

was forced to only look into the differences between

groups instead of meeting a speciﬁc criteria.

For the analysis on both the data selected of the

data set already available and the data that was col-

lected through participants, limitations arose too. The

two studies differed in language and ASR systems

used. As the audio ﬁles of Study 1 was in English

and ASR systems are likely to have been trained on

English words more often, WER values of Study 1

could be lower. Results of both studies are therefore

hard to compare.

What is important to consider too, is that the pi-

lot experiment that was carried out was not ecologi-

cally valid. The participants read a script out loud in

a controlled setting. If the optimization of Automated

Reporting would be researched any further, an uncon-

trolled setting is recommended.

Lastly, due to time constraints, we had to select a

set of thirty ﬁles for the ﬁrst analysis. For the experi-

ment only ﬁve participants were included. To strongly

state differences, data of more participants should be

gathered.

How Different Elements of Audio Affect the Word Error Rate of Transcripts in Automated Medical Reporting

185

6.2 Future Work

The ﬁeld of Automated Medical Reporting is a rel-

atively new research ﬁeld. The ﬁrst step of Auto-

mated Medical Reporting is transcribing the consulta-

tion, which was the main focus of this research. It was

concluded that there are elements of speech and audio

that have inﬂuence on the Word Error Rate, which is

a useful start for further research. The experimental

part of this research took place in a controlled setting

and was therefore not ecologically valid as explained

in the previous Section 6.1. For studies that want to

build upon the results within this study, audio ﬁles

of real consultations should be used as data. Then,

research can be focused on the speaking style of pa-

tients. It is important to consider that patients are try-

ing to convey symptoms and possible diagnoses that

they thought about themselves. This is often not a

ﬂuent story and might be hard to translate into a tran-

scription and eventually report.

Other interesting factors such as language and dif-

ferent Automated Speech Recognition technologies

were not included in this research and can be relevant

in the ﬁeld of automated medical reporting. If doing

so, more participants need to be included as well. A

signiﬁcant difference was hard to ﬁnd on the data of

the experiment within this study, due to the lack of

participants as a result of the lack of time. There is

deﬁnitely a lot to build upon.

Furthermore, as stated by Fung and Kat (Fung

and Kat, 1999) using an accent-adaptive recognizing

system would decrease the overall error rate. Their

research showed that using a dictionary focused on

speech with accents in regard to knowledge about the

native speech allows for less errors being made when

the speech recognizer is trained well on this dictio-

nary. Their Word Error Rate dropped 4% and this

outcome not only shows that a lot can be achieved

when using training data that takes into account vari-

ous speakers, but also promising results for future re-

search on the accuracy of Word Error Rates.

7 CONCLUSION

This research started with the main question stated in

Section 2.1. Through extensive literature research, it

can be concluded that speech input can be classiﬁed

through various elements, focusing on either speech

or audio features. Speech can be distinguished in

speaking style, rate, accent and context, whereas au-

dio is distinguishable through channel variability and

style. There is no standard regarding the Word Error

Rate yet, but ﬁndings in the literature research show

there is still a lot of improvement that can be made.

This research focused on the inﬂuence of noise, voice

frequency and accent on the Word Error Rate of tran-

scriptions generated by Automated Speech Recogni-

tion software. We can partly state that noise has a

signiﬁcant impact on the WER since this was true for

a selection of existing data of an online data set; the

English short monologues that we have investigated.

The statistical analyses that were performed indicate

that only noise in comparison to the WER shows a

statistically signiﬁcant difference. Noise is an impor-

tant factor to cancel out when optimizing Automated

Medical Reporting. However, for the experiment car-

ried out afterwards, a dutch medical dialog, no signiﬁ-

cant differences were found on any of the observed el-

ements. The direct declaration for this was not found

within this research. The overall conclusion is that

more research into noise, the difference in language

and the difference between monologue and dialogue

need to be done.

ACKNOWLEDGEMENTS

We would like to acknowledge Sandra van Dulmen

for providing real transcripts of Dutch general practi-

tioner consultations. Based on these transcripts we

were able to create a script for the pilot study that

came as close to a real consultation as possible.

REFERENCES

ElAssy, O., de Vendt, R., Dalpiaz, F., and Brinkkemper,

S. (2022). A semi-automated method for domain-

speciﬁc ontology creation from medical guidelines. In

International Conference on Business Process Mod-

eling, Development and Support, International Con-

ference on Evaluation and Modeling Methods for

Systems Analysis and Development, pages 295–309.

Springer.

Evermann, G. (1999). Minimum word error rate decoding.

Cambridge University, UK, pages 45–67.

Forsberg, M. (2003). Why is speech recognition difﬁcult.

Chalmers University of Technology.

Fung, P. and Kat, L. (1999). Fast accent identiﬁcation and

accented speech recognition.

Gaikwad, S. K., Gawali, B. W., and Yannawar, P. (2010). A

review on speech recognition technique. International

Journal of Computer Applications, 10(3):16–24.

Kaplan, D. (1990). Words. Proceedings of the Aristotelian

society, supplementary volumes, 64:93–119.

Kotelly, B. (2003). The art and business of speech recogni-

tion: creating the noble voice. Addison-Wesley Pro-

fessional.

HEALTHINF 2023 - 16th International Conference on Health Informatics

186

Lippmann, R. P. (1997). Speech recognition by machines

and humans. Speech communication, 22(1):1–15.

Maas, L., Geurtsen, M., Nouwt, F., Schouten, S. F.,

Van De Water, R., Van Dulmen, S., Dalpiaz, F.,

Van Deemter, K., and Brinkkemper, S. (2020). The

care2report system: automated medical reporting as

an integrated solution to reduce administrative burden

in healthcare.

Mooney, P. (2018). Medical speech, transcription, and in-

tent [dataset].

Moore, R. K. and Cutler, A. (2001). Constraints on theories

of human vs. machine recognition of speech. In Work-

shop on Speech Recognition as Pattern Classiﬁcation

(SPRAAC), pages 145–150. Max Planck Institute for

Psycholinguistics.

Scharenborg, O. (2007). Reaching over the gap: A review

of efforts to link human and automatic speech recog-

nition research. Speech Communication, 49(5):336–

347.

Scharenborg, O., Norris, D., Ten Bosch, L., and McQueen,

J. M. (2005). How should a speech recognizer work?

Cognitive Science, 29(6):867–918.

Simpson, A. P. (2009). Phonetic differences between male

and female speech. Language and linguistics com-

pass, 3(2):621–640.

Software, A. Geluid opnemen - dictafoon (version 1.1)

[software].

Vilar, D., Xu, J., d’Haro, L. F., and Ney, H. (2006). Er-

ror analysis of statistical machine translation output.

In Proceedings of the ﬁfth international conference on

language resources and evaluation (LREC’06).

Wang, Y.-Y., Acero, A., and Chelba, C. (2003). Is word

error rate a good indicator for spoken language un-

derstanding accuracy. In 2003 IEEE workshop on au-

tomatic speech recognition and understanding (IEEE

Cat. No. 03EX721), pages 577–582. IEEE.

Weber, A. and Scharenborg, O. (2012). Models of spoken-

word recognition. Wiley Interdisciplinary Reviews:

Cognitive Science, 3(3):387–401.

Wold, E., Blum, T., Keislar, D., and Wheaten, J. (1996).

Content-based classiﬁcation, search, and retrieval of

audio. IEEE multimedia, 3(3):27–36.

How Different Elements of Audio Affect the Word Error Rate of Transcripts in Automated Medical Reporting

187