Effectiveness of Whisper’s Fine-Tuning for Domain-Speciﬁc Use Cases in

the Industry

Daniel Pawlowicz

1 a

, Jule Weber

2 b

and Claudia Dukino

3 c

University of Stuttgart IAT, Institute of Human Factors and Technology Management, Stuttgart, Germany

Eberhard Karls University T

ubingen, T

ubingen, Germany

Fraunhofer IAO, Fraunhofer Institute for Industrial Engineering IAO, Stuttgart, Germany

Keywords:

Whisper, Fine-Tuning, Domain, Speech-to-Text Transcription.

Abstract:

The integration of Speech-to-Text (STT) technology has the potential to enhance the efﬁciency of industrial

workﬂows. However, standard speech models demonstrate suboptimal performance in domain-speciﬁc use

cases. In order to gain user trust, it is essential to ensure accurate transcription, which can be achieved through

the ﬁne-tuning of the model to the speciﬁc domain. OpenAI’s Whisper was selected as the initial model and

subsequently ﬁne-tuned with domain-speciﬁc real-world recordings. The ﬁne-tuned model outperforms the

initial model in terms of transcription of technical jargon, as evidenced by the results of the study. The ﬁne-

tuned model achieved a validation loss of 1.75 and a Word Error Rate (WER) of 1. In addition to improving

accuracy, this approach addresses the challenges of noisy environments and speaker variability that are com-

mon in real-world industrial environments. The present study demonstrates the efﬁcacy of ﬁne-tuning the

Whisper model to new vocabulary with technical jargon, thereby underscoring the value of model adaptation

for domain-speciﬁc use cases.

1 INTRODUCTION

The integration of machine learning tools into mod-

ern business practices has revolutionized workﬂows

by enabling automation, improving efﬁciency, and

reducing administrative burdens. Digital assistants,

scription models, play a central role in these trans-

formations, allowing professionals to perform tasks

like scheduling, documentation, and customer rela-

tionship management (CRM) through intuitive voice-

based interactions (Drawehn and Pohl, 2023).

Although the umbrella term of digital assistant

systems is highly ﬂexible (Pereira et al., 2023), the

essence of these systems lies in their ability to lever-

age natural language input for intuitive communica-

tion with human users (Budzinski et al., 2019). Mod-

ern digital assistants rely on voice input to provide

fast, hands-free communication, increasing efﬁciency

and usability in various professional contexts (Hoy,

2018). By automating routine tasks, such as tran-

https://orcid.org/0009-0004-3090-0787

https://orcid.org/0009-0004-6441-2263

https://orcid.org/0000-0003-2556-3881

scribing meeting notes or generating reports, digital

assistants enable professionals to focus on core activ-

ities like customer engagement or problem-solving,

enhancing productivity across industries (Fraunhofer

IAO, 2024).

At the heart of these functionalities lies precise

STT transcription, which converts spoken language

into text that digital assistants can process and in-

terpret. However, while state-of-the-art STT mod-

els perform well with everyday language, they of-

ten fall short in domain-speciﬁc contexts. Challenges

arise in accurately transcribing technical terminology,

proper nouns, and multilingual content, which are

common in specialized ﬁelds. This limitation hinders

the broader adoption of digital assistants in industries

where precise transcription is essential for process-

ing customer data, product information, or technical

terms.

The need for reliable transcription accuracy is par-

ticularly evident in industries with high levels of cus-

tomer interaction, where digital assistants can auto-

mate tasks like updating CRM systems. In addition,

they enable the creation of precise summaries from

voice data. These systems help reduce administrative

overhead by automating routine tasks, which mini-

1396

Pawlowicz, D., Weber, J. and Dukino, C.

Effectiveness of Whisper’s Fine-Tuning for Domain-Speciﬁc Use Cases in the Industry.

DOI: 10.5220/0013378100003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 1396-1403

ISBN: 978-989-758-737-5; ISSN: 2184-433X

mizes the time employees spend on repetitive activi-

ties. As a result, employees can focus more on strate-

gic responsibilities, such as developing new business

initiatives and enhancing customer relationships. Ad-

ditionally, they have more time to engage in technical

responsibilities, which involves troubleshooting com-

plex issues and optimizing processes for greater efﬁ-

ciency (Fraunhofer IAO, 2024).

This paper explores the establishment of a com-

prehensive procedure for optimizing STT for domain-

speciﬁc applications, which includes both ﬁne-tuning

Whisper, a state-of-the-art STT model, and imple-

menting post-processing tailored to domain-speciﬁc

requirements. Whisper’s performance is evaluated in

an industrial setting, focusing on its ability to handle

complex technical terminology, multilingual content,

and noisy environments. By addressing the gap be-

tween general-purpose STT capabilities and domain-

speciﬁc requirements, this study contributes to the

broader adoption of digital assistants in specialized

ﬁelds.

2 METHODS

2.1 Model Selection

Based on an earlier evaluation, three models were

picked for an assessment of the state-of-the-art

STT performance in domain-speciﬁc transcription

tasks: DeepSpeech (Hannun et al., 2014), wav2vec

2.0 (Baevski et al., 2020), and Whisper large-v3

(Radford et al., 2022) (Pfeiffer, 2024). These models

were selected based on their established performance

in prior research and their suitability for multilingual

transcription, handling environmental noise, and

adapting to diverse accents.

The evaluation focused on three key criteria:

• Accuracy: The accuracy was measured as the

Word Error Rate (WER), which is the proportion

of words transcribed incorrectly compared to the

total number of words spoken. It includes sub-

stitutions, deletions, and insertions, calculated as

WER =

S+D+I

, where S, D, and I are the above

errors and N is the total number of words in the

reference transcript (Trabelsi et al., 2022; Favre

et al., 2013). The WER is popular and widely re-

ported in the literature (Bandi et al., 2023).

• Supported Languages: Non-English languages

are underrepresented in training datasets (Milde

and K

ohn, 2018; Huang et al., 2023a), which

is why English models perform better (Bermuth

et al., 2021). For this, it is essential to choose

a model that performs well in German to ensure

good accuracy in use cases.

• Domain Adaptability: Adaptability refers to how

easily a STT system can be customized to meet

the speciﬁc needs of different domains. Cus-

tomization is particularly crucial for specialized

applications, where general models may not per-

form optimally. Even if a model meets general

performance criteria, it may still struggle when

applied to the speciﬁc requirements of a digital

assistant. Therefore, it is essential that the system

can be easily adapted and trained to handle the nu-

ances of the new domain.

For instance, proper nouns are highly depen-

dent on the speciﬁc use cases of Automatic

Speech Recognition (ASR) systems, highlight-

ing the need for domain-speciﬁc adaptation (Lee

et al., 2021). Additionally speech patterns, includ-

ing vocabulary, pronunciation, and tempo, change

with age (Fukuda et al., 2020). If a model is

trained predominantly on data from younger indi-

viduals, its accuracy may decline when used by

older speakers. This principle can be general-

ized across different social groups, as it is well-

established that speech varies according to factors

such as social class and age.

Whisper stood out for these criteria, especially for its

multilingual support and its popularity, which implies

extensive resources and high adaptability. In addition,

Whisper’s architecture, based on large-scale trans-

former models, is designed to handle different accents

and noisy environments, making it particularly suit-

able for real-world scenarios. Its ability to operate

effectively in both pre-trained and ﬁne-tuned conﬁg-

urations provided the ﬂexibility to address domain-

speciﬁc challenges while maintaining high baseline

performance. Although Whisper is the leader in this

area at the time of this writing, other models should

be considered in the future as this ﬁeld of research

continues to evolve rapidly.

2.2 Whisper Model Evaluation

To assess the performance of the Whisper model in a

domain-speciﬁc use case, the Whisper large-v3 model

(Radford et al., 2022) was employed to transcribe a

collection of audio recordings from customer visits in

the mechanical engineering ﬁeld. This assessment in-

cluded evaluating how Whisper coped with challeng-

ing aspects such as environmental noise, accents, and

technical terminology speciﬁc to the industrial sec-

tor. A sample of 21 recordings showed a mean WER

of 11% (min. = 0.59%, max. = 50%). The detailed

Effectiveness of Whisper’s Fine-Tuning for Domain-Speciﬁc Use Cases in the Industry

1397

evaluation revealed that, while the model transcribed

common words correctly, it showed some ﬂaws.

1. The out-of-the-box Whisper model struggled

with the accurate transcription of proper nouns,

particularly customer and product names, which

are critical in industrial applications. For in-

stance, product names like ”HyperCut 1530”

were occasionally mistranscribed due to the

model’s lack of contextual knowledge about

domain-speciﬁc terminology. The accurate

transcription of proper nouns not only relies on

recognizing the correct word but also on under-

standing its speciﬁc spelling, which can vary

signiﬁcantly between different terms. This was

further complicated by the fact that proper nouns

were pronounced differently by different speakers

in the recording, introducing room for error. This

issue extended to other technical terms, where the

absence of relevant training data in the default

Whisper model led to inconsistent and sometimes

nonsensical outputs. These limitations made

manual correction necessary, particularly when

the transcription involved critical customer or

product information.

2. The Whisper model is multilingual, yet it is ill-

equipped to handle bilingualism. While the ma-

jority of the recordings were in German, English

words occasionally emerged in professional set-

tings. This is particularly problematic for techni-

cal terms or proper nouns with English-like char-

acteristics. In such instances, the default Whis-

per model primarily transcribed in German, as this

was the language setting, resulting in errors such

as ”Focus,” a component of a product name, being

transcribed as ”Fokus.”

These issues highlight the importance of domain-

speciﬁc ﬁne-tuning to enhance Whisper’s perfor-

mance for specialized tasks. While Whisper demon-

strated robust general-purpose transcription capabili-

ties, adapting the model to speciﬁc use cases is nec-

essary to ensure its reliability in professional environ-

ments. For instance, ﬁne-tuning Whisper on domain-

relevant data can improve its ability to handle techni-

cal terms, multilingual contexts, and structured data

such as email addresses and product names.

This study aims to explore the effectiveness of

ﬁne-tuning Whisper for such domain-speciﬁc appli-

cations. By adapting the model to the linguistic and

contextual nuances of a particular domain, the goal is

to bridge the gap between general-purpose STT per-

formance and the speciﬁc requirements of real-world

use cases.

2.3 Data Collection

The ﬁne-tuning process utilized a dataset of real-

world audio recordings from ﬁeld service employees

of a mechanical engineering company. These record-

ings were collected to reﬂect actual usage scenarios,

with content focusing on CRM notes, including prod-

uct names, customer details, and locations. A total of

73 recordings, ranging from 16 seconds to 1 minute

and 30 seconds in length, were gathered. To increase

variability, some texts were recorded multiple times

by different speakers, capturing accents, environmen-

tal noise (e.g., cars), and unclear pronunciations. The

recordings were primarily in German, with occasional

English words for technical terms and product names.

To enhance diversity and address gaps identiﬁed

during initial evaluations, 14 additional texts were

crafted and recorded by a female speaker. These sup-

plemental recordings simulated real CRM notes while

introducing domain-speciﬁc nuances, such as product

terminology. Additionally, the selection of topics for

the new texts was carefully made to ensure that they

cover relevant scenarios that frequently occur in real-

world applications. This not only improves the over-

all performance of the model but also helps in under-

standing and processing a broader range of user in-

quiries. This approach highlights the ﬂexibility of the

ﬁne-tuning process, which can be adapted to other do-

mains by tailoring datasets to their speciﬁc linguistic

challenges.

2.4 Data Annotation

The audio recordings were ﬁrst transcribed using

Whisper’s out-of-the-box model. Then an automatic

post-processing dictionary was applied, using a rule-

based approach to correct common mistakes in the

original transcription. While this post-processing

dictionary worked well for common misspellings, it

was not robust enough to replace the ﬁne-tuning of

Whisper. The mean WER was still at 3% after the

post-processing for the aforementioned 21 recordings

(min. = 0%, max. = 14%). However it worked well

to reduce manual annotation effort by reviewing the

original transcript and the corrections instead of tran-

scribing everything by hand.

Background noise, unclear speech, or heavily ac-

cented pronunciations were not explicitly annotated

or ﬂagged. Instead, such segments were corrected

to the best of the annotator’s ability. When speak-

ers used unconventional terms or deliberately spelled

out words (e.g., “dot” for “.”), the transcription was

directly edited to reﬂect the appropriate punctuation

mark. However, this method relied heavily on the

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1398

assumption that the ﬁne-tuned model would learn

to handle such patterns automatically in future iter-

ations. No special markings were used to indicate

noisy sections, which might have affected the model’s

learning in these cases.

The transcriptions and audio paths were formatted

into a JSON ﬁle as follows.

[

{

” a u d i o p a t h ” : ” p a t h \ t o \ a u d i o . m4a ” ,

” t r a n s c r i p t i o n ” : ” T h is i s t h e

c o r r e c t t r a n s c r i p t i o n of t h e

a u d i o ”

}

]

2.5 Training

The recordings, initially in .m4a format, were con-

verted to .wav with a sampling rate of 16 kHz to en-

sure compatibility with Whisper. Padding and trun-

cation ensured uniform input lengths, with shorter in-

puts padded and longer ones truncated. An attention

mask was used to differentiate meaningful data from

padding tokens, preventing unexpected behavior dur-

ing training as follows:

i n p u t s = s e l f . p r o c e s s o r ( a u d i o ,

s a m p l i n g r a t e = s r , r e t u r n t e n s o r s =”

p t ” , r e t u r n a t t e n t i o n m a s k = T r u e )

i n p u t f e a t u r e s = i n p u t s . i n p u t f e a t u r e s .

s q u e e z e ( )

# T o k e n i z e t h e t r a n s c r i p t i o n

l a b e l s = s e l f . p r o c e s s o r . t o k e n i z e r (

t r a n s c r i p t i o n , r e t u r n t e n s o r s =” p t ” ,

p a d d i n g = ” m a x l e n g t h ” , ma x l e n g t h

=4 4 8 ,

t r u n c a t i o n = Tru e ) . i n p u t i d s . s q u e e z e ( )

# Mark p a d d i n g t o k e n s t o i g n o r e l o s s

t h e r e

l a b e l s = t o r c h . whe re ( l a b e l s == s e l f .

p r o c e s s o r . t o k e n i z e r . p a d t o k e n i d ,

−100 , l a b e l s )

a t t e n t i o n m a s k = i n p u t s . a t t e n t i o n m a s k .

s q u e e z e ( )

The training loop consisted of forward and backward

passes to optimize model weights using the AdamW

optimizer. Key hyperparameters included:

• Batch size: Due to GPU memory limitations,

batch sizes of 1 and 5 were tested. As seen in

section 3 batch size 1 demonstrated better gener-

alization and was used for further analysis.

• Epochs: Training was conducted for ﬁve epochs,

as the validation loss stabilized after the fourth

epoch, preventing overﬁtting.

• Learning rate: The AdamW optimizer adapted the

learning rate. The starting learning rate was 5 ×

−5

To prevent overﬁtting, the dataset was split into

training (80%) and validation (20%) sets using

train test split from scikit-learn (Pedregosa et al.,

2011), ensuring that the model’s performance could

be monitored on unseen data. The model’s perfor-

mance was evaluated using the WER and the absolute

training and validation loss. This approach provided

a clear view of model improvements across epochs.

3 RESULTS

Comparing the results of models trained with batch

sizes of 1 and 5 revealed distinct differences in their

ability to generalize. For batch size 1, both training

and validation losses steadily decreased (Figure 1a),

with validation loss stabilizing after the third epoch.

This indicates effective learning and a reduced risk of

overﬁtting. In contrast, for batch size 5, while training

loss decreased, validation loss ﬂuctuated, suggesting

difﬁculties in generalizing to unseen data (Figure 1b).

Similarly, WER trends (Figures 1c and 1d) show

a consistent improvement with batch size 1, whereas

batch size 5 exhibited ﬂuctuations, particularly in

later epochs. These observations led to the decision

to focus on batch size 1 for further analysis, as it pro-

vided greater stability and better generalization de-

spite higher computational costs.

Training for ﬁve epochs was found to be optimal.

The loss graphs for batch size 1 showed continued re-

duction in training loss, but the stabilization of vali-

dation loss after the fourth epoch suggested that ad-

ditional epochs might lead to overﬁtting. The train-

ing loss decreased from approximately 2.46 to 0.44 by

the end of the ﬁfth epoch, reﬂecting successful model

learning. Validation loss initially mirrored the train-

ing loss but began to stabilize and increase slightly

after epoch three, indicating potential overﬁtting if

training continued beyond the ﬁfth epoch.

WER trends conﬁrmed this pattern. After an ini-

tial increase in WER in the ﬁrst two epochs, the metric

dropped signiﬁcantly from epoch three onward, with

the lowest WER of around 2% achieved at epoch ﬁve

(Figure 1c). This demonstrates that the model pro-

gressively improved its transcription accuracy with

ﬁne-tuning, particularly for domain-speciﬁc terms,

validating the choice of hyperparameters.

Fine-tuning the Whisper model addressed key

limitations observed in its out-of-the-box perfor-

mance, particularly with the transcription of proper

nouns and bilingual terms. These challenges, crit-

Effectiveness of Whisper’s Fine-Tuning for Domain-Speciﬁc Use Cases in the Industry

1399

(a) Loss for batch size 1. (b) Loss for batch size 5.

Figure 1: Comparison between batch size 1 and batch size

ical in domain-speciﬁc applications, were mitigated

through the customization of Whisper to industrial

data.

The out-of-the-box Whisper model struggled with

the accurate transcription of proper nouns, including

customer and product names, often producing incon-

sistent or incorrect results due to variability in pronun-

ciation among speakers. This issue was compounded

by the lack of domain-speciﬁc contextual knowledge,

introducing further ambiguity.

After ﬁne-tuning, the model showed signiﬁcant

improvements in its ability to transcribe proper nouns

accurately, even when pronounced differently by vari-

ous speakers. The inclusion of domain-speciﬁc train-

ing data allowed the model to better recognize and

adapt to industrial terminology, reducing the need for

manual corrections. Even new proper nouns with sim-

ilar pattern such as product names were transcribed

correctly with the ﬁne-tuned model, addressing one

of the primary ﬂaws of the out-of-the-box model.

Fine-tuning on bilingual recordings improved the

model’s ability to handle mixed-language content.

While some challenges with English terms persisted,

likely cause by the pronunciation of some speak-

ers, the model became better equipped to distin-

guish between German and English terms within the

same transcription, correctly transcribing examples

like ”Focus” in product names. This adaptation mini-

mized errors in critical technical terminology, enhanc-

ing the usability of transcriptions in professional set-

tings.

3.1 Testing the Final Model

The ﬁne-tuned model was tested on new audio exam-

ples to evaluate its performance in real-world scenar-

ios. Fine-tuning effectively addressed signiﬁcant lim-

itations of the out-of-the-box Whisper model, partic-

ularly in the transcription of domain-speciﬁc terms,

technical jargon, and proper nouns. For example,

product names like ”HyperCut 1530” and customer

names spelled out in audio were consistently tran-

scribed correctly, even when pronounced differently

by various speakers. These improvements highlight

the model’s enhanced understanding of industrial ter-

minology and its ability to adapt to speaker variability.

However, certain issues persisted. While the tran-

scription of individual terms improved, the model ex-

hibited challenges in maintaining coherence in longer

sentences. Repetitive loops, such as repeating phrases

unnecessarily, were still observed in complex inputs,

indicating difﬁculty in capturing the semantic struc-

ture of extended discourse. Additionally, although

the ﬁne-tuned model showed better handling of bilin-

gual content, occasional errors remained in cases

where English terms were embedded in predomi-

nantly German audio (e.g., ”Focus” being transcribed

as ”Fokus”).

Overall, the ﬁne-tuning process signiﬁcantly im-

proved the model’s ability to handle domain-speciﬁc

vocabulary and context, addressing many of the criti-

cal ﬂaws in the out-of-the-box conﬁguration. How-

ever, limitations in sentence-level coherence and

bilingual adaptability suggest room for further opti-

mization, particularly in handling longer and more

complex inputs.

4 DISCUSSION

4.1 Interpretation of Results

The STT transcription component was ﬁne-tuned to

a speciﬁc domain, focusing on accurately transcrib-

ing technical terms, product names, and customer-

related information, to optimize the model for eco-

nomical use in speciﬁc branches. A detailed quali-

tative evaluation of the transcriptions reveals several

key strengths and areas of concern, providing valu-

able insights into the model’s capabilities and limi-

tations. The primary focus was on the accuracy of

domain-speciﬁc vocabulary, the handling of environ-

mental noise, and the system’s overall ability to cap-

ture the intent and structure of spoken content.

One notable issue encountered during the evalu-

ation of the ﬁne-tuned model was the occurrence of

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1400

repeated phrases. At certain points during transcrip-

tion, the model would no longer produce coherent

transcriptions of the audio but instead repeat previous

or newly generated 2-3 word phrases until the tran-

scription was terminated. This repetitiveness problem

makes it challenging to quantitatively analyze and in-

terpret the results. Resolving this issue is crucial for

future work, as reliable and accurate transcription is

foundational for any downstream NLP tasks.

It is possible that this problem is due to the lim-

itations of Whisper, which is only designed to han-

dle 30 second recordings. This theory is supported

by the fact that the repetitive nature of the audio be-

gins after 30 seconds. Although Whisper has a built-

in parameter to overcome this, and the original model

could transcribe longer sequences, it seems that ﬁne-

tuning the model conﬂicts with this setting. Future

work will need to be done to overcome this signiﬁ-

cant challenge.

As the quantitative evaluation metrics like the

WER suffered from the above described problem,

they are not very meaningful to analyze the perfor-

mance of the model, independent of the repetitive

phrases. However it can be seen, that the ﬁne-tuned

model achieves a lower WER of 2% compared to

the out-of-the-box Whisper model and the rule-based

post-processing approach.

A qualitative interpretation reveals a notable im-

provement in correctly transcribing complex techni-

cal terms and industry-speciﬁc jargon, which is cru-

cial in the context of ﬁeld service reports. Terms such

as product-speciﬁc names were consistently tran-

scribed with high accuracy.

This level of precision is essential when docu-

menting technical discussions or generating reports,

as incorrect transcriptions could lead to misunder-

standings and potential delays in customer support.

A key challenge in transcribing documentation

is the accurate recognition of alphanumeric strings,

like serial numbers, email addresses or phone num-

bers, which are often pronounced differently by dif-

ferent speakers. The model generally performed well

in this regard, but occasional errors occurred when

speakers used abbreviations or non-standard pronun-

ciations. For example, variations in how certain prod-

uct names were articulated led to inconsistencies in

transcription, indicating a need for greater robustness

in handling phonetically similar but contextually dis-

tinct terms.

One of the most challenging aspects of real-world

audio is the presence of background noise, such as

conversations, machinery sounds, or trafﬁc noise,

which can signiﬁcantly impact transcription qual-

ity. The model exhibited a moderate level of noise

robustness, effectively ﬁltering out low-level back-

ground disturbances in controlled settings. However,

in recordings with high levels of environmental noise

or overlapping speech, the model struggled, leading

to incomplete or inaccurate transcriptions. This lim-

itation is particularly relevant in the ﬁeld service do-

main, where recordings are often made in noisy envi-

ronments such as workshops or cars.

The same results apply for speaker variation and

accents. While the model generally performed well

with native German speakers using a standard ac-

cent, it showed a decline in accuracy with regional

dialects or non-native speakers. This was highlighted

in the transcription of uncommon words or of English

phrases in a German context, where the pronuncia-

tion is highly critical for recognition. Since techni-

cal terms or product names often contained English

elements, this posed a problem for the model. This

points to a need for more diverse training data that

can better represent the full range of speaker varia-

tions encountered in the ﬁeld.

Overall, while the STT model shows strong per-

formance in key areas, its practical utility in unstruc-

tured and noisy environments remains an open chal-

lenge. Addressing these limitations by diversifying

the training data would be a critical step in making

the digital assistant a more versatile tool for ﬁeld ser-

vice professionals working in a variety of real-world

conditions.

4.2 Implications

The results presented in this paper demonstrate the

potential of leveraging ﬁne-tuned machine learning

models to improve the functionality and usability of

digital assistant systems, speciﬁcally in the context

of speciﬁc domains, that are underrepresented in the

current training data for model training. The STT

component showed strong performance in capturing

and processing domain-speciﬁc language, technical

terms, and context-speciﬁc entities. The insights

gained through the qualitative analysis of the STT

model conﬁrm that targeted adaptations of NLP mod-

els can signiﬁcantly enhance digital assistants in spe-

cialized industry settings.

The primary goal of this work was to simplify

the interaction between users and the digital assistant

by enabling intuitive, speech-based communication,

thereby reducing administrative workload and allow-

ing employees to focus on technical problem solving

rather than laborious documentation tasks. The ﬁne-

tuned model achieved this goal by reliably transcrib-

ing key technical terms. Despite some limitations in

handling environmental noise and non-standard pro-

Effectiveness of Whisper’s Fine-Tuning for Domain-Speciﬁc Use Cases in the Industry

1401

nunciations in the STT model, the system demon-

strated its ability to support technicians in real-world

scenarios by capturing essential details from their

spoken notes and structuring them into actionable

data, especially compared to the out-of-the-box Whis-

per model.

Another central objective of this project was to

create a system that would seamlessly integrate into

existing workﬂows without requiring extensive user

training or customization. To achieve this, the digital

assistant must be highly accurate, reliable, and able

to handle the wide variety of linguistic input encoun-

tered in day-to-day operations. This is in line with the

need to build trust between the system and its human

users. Trust is a key factor in the successful adoption

and long-term use of any digital assistant. A digi-

tal assistant that consistently performs well and deliv-

ers accurate results encourages employees to incorpo-

rate it into their daily routines. Conversely, negative

experiences–such as low accuracy, misinterpretation

of critical terms, or unexpected system behavior–can

undermine user trust and make integration difﬁcult.

This work has achieved a promising step to this goal.

5 FUTURE STEPS

The immediate priority for future work should be to

fully realize the potential of the assistant and address

the problems identiﬁed in the STT model, particu-

larly the issue of repeated phrases at the end of tran-

scriptions. As highlighted in the discussion, this is

likely due to the incorrect handling of the receptive

ﬁeld. Implementing the long-form algorithm in the

ﬁne-tuning would be the ﬁrst step towards solving this

problem.

In addition, improving the model’s handling of en-

vironmental noise and speaker variation should be a

high priority. To achieve this, more data augmenta-

tion techniques could be employed, such as adding

synthetic noise to training samples or using vocal

transformation methods to simulate different accents

and speech rates. These adjustments would help the

model to better generalize to the diverse conditions

encountered in real-world scenarios.

In the short term, solving the current problems

would make the transcription and therefore the dig-

ital assistant more reliable and user-friendly, encour-

aging adoption and reducing employee skepticism. In

the medium term, enhanced models with greater ro-

bustness and expanded entity coverage would provide

users with a tool capable of handling complex work-

ﬂows, further reducing manual work and minimizing

data entry errors.

Further enhancements should also take into ac-

count the fact that the ﬁeld of ASR is fast-moving

and progressive, and therefore domain-speciﬁc ﬁne-

tuning may be better supported by other models in the

future.

In the long term, the development of a multimodal

digital assistant with advanced contextual understand-

ing would fundamentally change industrial workﬂows

in a variety of sectors. Such an assistant would act

as a true collaborator, capable of understanding nu-

anced technical discussions, providing real-time sup-

port, and integrating seamlessly into complex service

environments. This would not only optimize the ef-

ﬁciency of service processes, but also improve cus-

tomer satisfaction by enabling faster and more accu-

rate resolution of technical issues.

6 CONCLUSION

In summary, this work has demonstrated the effec-

tiveness of ﬁne-tuning machine learning models for

domain-speciﬁc applications in digital assistant sys-

tems within the ﬁeld service context.

The industrial setting was chosen because it is

underrepresented in the literature, where ﬁne-tuning

of language models for medical or legal domains

is much more common (Yang et al., 2024; Huang

et al., 2023b). However, this paper serves as a proof

of concept of the efﬁcacy and feasibility of ﬁne-

tuning Whisper to any speciﬁc domain with real-

world recordings and use cases.

The STT component was successfully adapted to

capture complex technical terms and industry-speciﬁc

entities, providing a solid foundation for automating

and streamlining documentation processes. The qual-

itative evaluation highlighted the system’s strengths in

handling structured language, while also identifying

areas for further development to improve robustness

and usability.

REFERENCES

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020).

wav2vec 2.0: A Framework for Self-Supervised

Learning of Speech Representations. arXiv (Cornell

University).

Bandi, A., Adapa, P. V. S. R., and Kuchi, Y. E. V. P. K.

(2023). The Power of Generative AI: A Review of

Requirements, Models, Input–Output Formats, Evalu-

ation Metrics, and Challenges. Future Internet, 15(8).

Bermuth, D., Poeppel, A., and Reif, W. (2021). Scri-

bosermo: Fast Speech-to-Text models for German and

other Languages. arXiv (Cornell University).

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1402

Budzinski, O., Noskova, V., and Zhang, X. (2019). The

brave new world of digital personal assistants: ben-

eﬁts and challenges from an economic perspective.

Netnomics, 20(2-3):177–194.

Drawehn, J. and Pohl, V. (2023). Einsatz von KI mit Fokus

Kundenkommunikation.

Favre, B., Cheung, K., Kazemian, S., Lee, A., Liu, Y.,

Munteanu, C., Nenkova, A., Ochei, D., Penn, G.,

Tratz, S., et al. (2013). Automatic human utility eval-

uation of ASR systems: Does WER really predict per-

formance? In INTERSPEECH, pages 3463–3467.

Fraunhofer IAO (2024). DafNe: Digitaler Außendienst-

Assistent f

ur Nebenzeitoptimierung. Retrieved De-

cember 10, 2024.

Fukuda, M., Nishizaki, H., Iribe, Y., Nishimura, R., and

Kitaoka, N. (2020). Improving Speech Recognition

for the Elderly: A New Corpus of Elderly Japanese

Speech and Investigation of Acoustic Modeling for

Speech Recognition. In Calzolari, N., B

echet, F.,

Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi,

S., Isahara, H., Maegaard, B., Mariani, J., Mazo,

H., Moreno, A., Odijk, J., and Piperidis, S., editors,

Proceedings of the Twelfth Language Resources and

Evaluation Conference, pages 6578–6585, Marseille,

France. European Language Resources Association.

Hannun, A. Y., Case, C., Casper, J., Catanzaro, B., Diamos,

G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S.,

Coates, A., and Ng, A. Y. (2014). Deep Speech: Scal-

ing up end-to-end speech recognition. arXiv (Cornell

University).

Hoy, M. B. (2018). Alexa, Siri, Cortana, and More: An

Introduction to Voice Assistants. Medical Reference

Services Quarterly, 37(1):81–88.

Huang, H., Tang, T., Zhang, D., Zhao, X., Song, T., Xia, Y.,

and Wei, F. (2023a). Not All Languages Are Created

Equal in LLMs: Improving Multilingual Capability

by Cross-Lingual-Thought Prompting. In Bouamor,

H., Pino, J., and Bali, K., editors, Findings of the Asso-

ciation for Computational Linguistics: EMNLP 2023,

pages 12365–12394, Singapore. Association for Com-

putational Linguistics.

Huang, Q., Tao, M., Zhang, C., An, Z., Jiang, C., Chen, Z.,

Wu, Z., and Feng, Y. (2023b). Lawyer llama technical

report.

Lee, T., Lee, M.-J., Kang, T. G., Jung, S., Kwon, M.,

Hong, Y., Lee, J., Woo, K.-G., Kim, H.-G., Jeong,

J., Lee, J., Lee, H., and Choi, Y. S. (2021). Adapt-

able Multi-Domain Language Model for Transformer

ASR. In ICASSP 2021 - 2021 IEEE International

Conference on Acoustics, Speech and Signal Process-

ing (ICASSP), pages 7358–7362.

Milde, B. and K

ohn, A. (2018). Open source automatic

speech recognition for german.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Pereira, R., Lima, C., Pinto, T., and Reis, A. (2023). Vir-

tual Assistants in Industry 4.0: A Systematic Litera-

ture Review. Electronics, 12(19).

Pfeiffer, M. (2024). Comparison of existing technologies

and models for the design of an AI-supported digital

assistance system.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey,

C., and Sutskever, I. (2022). Robust Speech Recogni-

tion via Large-Scale Weak Supervision.

Trabelsi, A., Warichet, S., Aajaoun, Y., and Soussilane, S.

(2022). Evaluation of the efﬁciency of state-of-the-art

Speech Recognition engines. Procedia Computer Sci-

ence, 207:2242–2252. Knowledge-Based and Intel-

ligent Information & Engineering Systems: Proceed-

ings of the 26th International Conference KES2022.

Yang, Q., Wang, R., Chen, J., Su, R., and Tan, T. (2024).

Fine-tuning medical language models for enhanced

long-contextual understanding and domain expertise.

Effectiveness of Whisper’s Fine-Tuning for Domain-Speciﬁc Use Cases in the Industry

1403