effect. This occurs when the speech signal is bounced
on something within the room and arrives in the mi-
crophone a few seconds later again.
The two elements that are stated above that in-
fluence the audio file are both focused on additional
sounds. Another aspect of sound itself that is im-
portant to take into account is the channel variabil-
ity. This accounts for the setting in which the audio is
recorded. The quality of the microphone and the dis-
tance of the microphone to the speaker belong to this
kind of variability (Forsberg, 2003).
Lastly, sounds can be characterized using acous-
tic features that already have been used to analyze
different audio files. These traditional, acoustic fea-
tures are pitch, loudness, duration, and timbre (Wold
et al., 1996). As Wold et al. (1996) state in their pa-
per, every feature except timbre is relatively easy to
measure and model. As these features can be differ-
ent for every audio file, investigating correlations be-
tween these features and the WER of the correspond-
ing transcripts can be interesting.
4 RESEARCH METHOD
To answer the main research question stated in Sec-
tion 2.1, we used a mixed-method approach, starting
with extensive literature research. The overview of
all relevant literature is stated in the previous Section
3. Table 1 presents a set of null and alternative re-
search hypotheses. To test these research hypothe-
ses, we conducted two studies. In the first study, we
took an existing dataset of medical audio recordings
(Mooney, 2018) that we manually labeled with re-
spect to the presence of Accent, levels of Frequency
and Noise. In the second study, we conducted a small
pilot experiment to test the observations from the first
study. Figure 2 provides an overview of our research
method. In both studies, the independent variables
are the presence of the speaker’s Accent, levels of
voice Frequency and Noise, and the dependent vari-
able is the WER. These variables were derived from
the existing literature as core factors potentially af-
fecting the quality of speech recognition and limiting
our study score to this set for practical reasons.
4.1 Study 1 – Existing Medical Data
For the analysis of existing medical audio files, an on-
line available data set is used. The data set (Mooney,
2018) has been retrieved from the Kaggle platform.
The audio files are recordings of two sentences at
most and are pronounced by different speakers, each
in possession of distinguishable speaking styles and
Table 1: Experimental Hypotheses.
Hyp Null Hypothesis Alternative Hypothesis
H1 No difference in WER be-
tween samples with and with-
out Accent.
There is a difference in WER
between samples with and with-
out Accent.
H2 No difference in WER be-
tween samples with different
levels of Frequency.
There is a difference in WER
between samples with different
levels of Frequency.
H3 No difference in WER be-
tween samples with different
levels of Noise.
There is a difference in WER
between samples with different
levels of Noise.
rates. The spoken language is English and the sen-
tences are along the lines of the following example:
”Oh, my head hurts me. I try to be calm but I
can’t.”
The data set consisted of 8.5 hours of audio in to-
tal, however, we included only 30 random files in our
study.
Labeling Audio Files The selected files were as-
signed with specific labels to enable the classifica-
tion and comparison of the different files. For accent,
voice frequency and noise, the file got assigned with
different intensities. The speaker accent is a binary
item: ”Accent” or ”No accent”. British English was
considered no accent, which automatically classifies
every other accent as yes. The frequency and noise
can both be assigned a level, namely high, medium
or low. However, according to the source of the used
data set containing existing audio files, the noise of
every item was either low or non-existent (Mooney,
2018), thus no noise was included as an intensity for
noise too.
Obtaining Corresponding Transcriptions Using
ASR Software After having labeled the audio files, the
files were exposed to an Automated Speech Recog-
nition technology, namely an application which is
called voice recorder (Software, ). This gave a tran-
scription, which was compared to the expected out-
come. The difference between the hypothesis and ex-
pected transcript is the Word Error Rate. The WERs
of all transcripts were calculated, so further analysis
could be done.
Grouping Characteristics And Comparing Error
Rates As the labels and the computed WERs of the
files and corresponding transcripts were noted, the
last step of the medical data analysis was taken. The
files were then grouped, to ease the process of com-
puting possible significant differences between the
characteristics of the files. This and the statistical
analyses are done by a short Python script. To dou-
ble check the results of statistical test and calculate
effect size, we used R.
How Different Elements of Audio Affect the Word Error Rate of Transcripts in Automated Medical Reporting
183