Development of a Real‑Time Speech‑to‑Text Converter Using

Raspberry Pi

Guna Sekhar G., Pavan Kumar K. and A. R. Kalairasi

Department of Electronics and Instrumentation Engineering, Saveetha Engineering College, Chennai, Tamil Nadu, India

Keywords: Speech Recognition, Raspberry Pi, Real‑Time Transcription, Embedded System, Accessibility Technology.

Abstract: This paper primarily discusses the architecture and implementation of a raspy based real time speech-to-text

conversion system, useful in cost-optimized portable speech recognition applications. It consists of cheap and

widely available hardware components (a microphone and a Raspberry Pi) and open-source software tools to

transcribe spoken language into text with a reasonable performance. By using this method, it can be use in

broad applications such as Improving accessibility for deaf people, helping physically disabled people to type

without hands and Automated transcription of conversations in meetings and lecture halls. Components USB

microphone (to capture audio), Real-time voice recognition software (Google speech-to-text API, CMU

Sphinx), Display interface, to show the text that was converted the system efficiently performs speech

recognition while being inexpensive and portable by using the processing power of the Raspberry Pi. The

performance of our evaluation was performed in various scenarios and showed high accuracy and low-latency

performance in controlled circumstances, indicating that our system could be potentially deployed in the real

world. Rippling across several domains from accessibility and education, to transcription services, this system

serves as an effective and low-cost real-time speech processing solution compared to traditional systems.

1 INTRODUCTION

The revolution brought about by the low break-even

point of low affordable computer systems like the

Raspberry Pi has fueled solutions in nearly every tech

sector. A prominent example of the impact in this

category is speech recognition technology, widely

used in accessibility, transcription services and

human-computer interaction. Traditional speech-to-

text systems required substantial processing power and

were limited to high-performance computing

environments. This meant that they were limited to

industries or settings with sophisticated computational

capabilities. But, due to recent advances in machine

learning algorithms and the performance of embedded

computing platforms, small, low-cost, real-time

speech-to-text systems have become increasingly

practical. The Raspberry Pi a small and relatively

inexpensive compute platform creates an opportunity

for building such systems that provide advanced

technology in a democratized appearance. The System

converts the speech into text in real-time using

Raspberry Pi. The aim is to provide a cost-effective

and portable solution to transform spoken language

into readable text, benefiting a wide-range of people

like individuals with hearing impairments, students

and professionals. This makes the base framework

applicable in ways which are a mistake to pay for in

institutions as well as educational uses, or as a small

office setup, or to run on a home computer.

2 LITERATURE SURVEY

2.1 The Application of Hidden Markov

Models in Speech Recognition

From the scratch, the most initial speech recognition

systems recognised only a few characters. In 2007,

Hidden Markov Models (HMM) blew the lid of the

accuracy for speech recognition becaue it added

statistical methods. Since then, significant

developments have been made in machine learning

and deep learning methods where neural networks

are used to analyze vast amounts of speech data to

enhance accuracy in recognition. Cloud computing

drives modern speech recognition systems, such as

Google’s Speech-to-Text API and Apple’s Siri,

which can perform real-time, highly accurate

182

G., G. S., K., P. K. and Kalairasi, A. R.

Development of a Real-Time Speech-to-Text Converter Using Raspberry Pi.

DOI: 10.5220/0013909900004919

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 4, pages

182-187

ISBN: 978-989-758-777-1

transcription of speech. Nonetheless, these systems

can require (and also provide) significant

computational power, as well as internet

connectivity, making them cumbersome and even

unfeasible at times in an offline and low-resource

scenario.

2.2 Home Automation Using

Raspberry Pi through Siri Enabled

Mobile Devices

With the evolution of embedded systems, such as the

Raspberry Pi, developers started working on

performing Speech Recognition on these devices. The

Raspberry Pi Is a credit–card-sized computer and a

popular developement platform because it is

inexpensive, low power consumption, and easy to

use. Speech recognition is an area where embedded

systems have successfully been applied, even before

the data explosion era, because lightweight

algorithms and open-source tools have appeared and

adapted to the constraints of these systems. A number

of studies focused on real-time speech recognition on

state-of-the-art embedded systems. In a study by

Bhuyan et al. (2018) A Raspberry Pi - Based Speech

Recognition System for Home Automation was

developed. Although the system handled basic

command recognition well, it was challenged by

non-uniform sentences and needed fine-tuning to

work well in noisy environments. Similarly, Yuan et

al. (2019) presented a low-cost speech recognition

proposal for a low-cost based real-time speech-

recognition-based 140 The early systems, however,

struggled with accuracy and speed, especially in

noisy conditions.

2.3 Cloud-Based vs. On-Device Speech

Recognition:

There have been two main strategies for speech-to-

text systems: cloud processing, and on-device

processing. While cloud- based services like Google’s

Speech-to- Text API offer high accuracy rates and

simple integration into larger systems, they do demand

an internet connection. This is unsuitable for

applications in remote locations or scenarios where

privacy is vital since audio data must be uploaded to

remote servers for processing. On-device speech

recognition, however, works locally, making it a

requirement for offline applications. Local speech

recognition systems have been commonly developed

using tools like CMU Sphinx and Mozilla’s

DeepSpeech. CMU Sphinx: This is another

lightweight, open-source speech recognition engine

that is very usable on resource-constrained devices like

Raspberry Pi. Its accuracy is lower than that of cloud-

based solutions particularly intranscribing natural

speech and certain use cases that use complex

vocabulary. On-device solutions are nonetheless

preferable if you have a spotty internet connection or

privacy is a concern, however.

2.4 Recent Developments in Speech-to-

Text Using Raspberry Pi

Recent Work In the domain of embedded systems on

which speech-to-text systems need to perform, a lot

of recent work focused on enhancing the performance

of speech-to-text systems on platforms like Raspberry

Pi. For instance, Dhal et al. The Speech Recognition

System Based on Raspberry Pi 4 and Python Libraries

by Zhang et al. (2021) utilized the Google Speech-

to-Text API for transcription. It also required internet

access, which restricted offline usage. Previous work

has focused on maximizing on- device performance.

Jain et al. (2020) developed a speech-to-text pipeline

in real-time by CMU Sphinx on Raspberry Pi. To the

surprise of the researchers, though the system was

able to turn short phrases into text with relatively

high accuracy, it was not very good with longer

stretches of speech or background noise. Noise

reduction and language modeling are some

techniques proposed to improve such systems

performance on embedding platforms.

3 BLOCK DIAGRAMS

The block diagram for hardware implementation of

an image-based OCR system in a Raspberry pi. It

starts with capturing an image followed by

processing and filtering the image. It is followed by

the edge detection because it will help in separating

the edges in order to give better visibility of the

objects and background separation to differentiate

text from background. Lastly, OCR transforms the

doc image into a readable digital output that is

processed by the Raspberry Pi which can pipe the

audio to a speaker connected to it. Figure 1 shows the

system block diagram.

Development of a Real-Time Speech-to-Text Converter Using Raspberry Pi

183

Figure 1: System block diagram.

4 SYSTEM DESIGN AND

ARCHITECTURE

The system proposed in this work allows real time

converting speech to text using raspberry pi, but a

hardware solution more compact and low cost; in

addition, this solution is both a cloud- based and

open-source of speech recognition. Essential

components of the system include:

4.1 Raspberry Pi

The Raspberry Pi 4 Model B is chosen for its cost-

effectiveness and computational power. Its quad-core

ARM Cortex-A72 CPU and 4GB RAM offer

sufficient resources for real- time audio processing

tasks. The device's small size and low power

consumption make it ideal for portable applications,

while its GPIO pins enable easy integration with

external peripherals like microphones and displays.

This makes the Raspberry Pi an educational,

professional, and personal settings.

4.2 Microphone Interface

It uses a USB microphone for speech capturing. This

is done via a microphone, which translates spoken

words to a digital signal, which is processed by the

Raspberry Pi in real time. You can settle for the USB

microphones that can provide better sound quality,

and they can nicely interface with the Raspberry Pi.

However, to improve speech clarity in noisy

surroundings, noise-canceling microphones can be

used to improve accuracy in such environments.

Thus, it works in real-time — it takes audio from your

microphone and transcribes it, processing it in real-

time.

4.3 Speech Recognition Engine

The speech recognition engine is the core of the

system, and there are two options available:

• Google Speech-to-Text API: A cloud- based

service that offers high accuracy, process

speech data on Google’s server. It is perfect

where internet access is available and

accuracy is paramount. On the other hand,

sending the audio to external servers raises

concerns about privacy.

• CMU Sphinx: This is an open-source

alternative that works offline on the

Raspberry Pi. Not as precise as the cloud-

based solution, but more fitting for

applications that need privacy or in locations

with limited internet availability.

By combining accuracy online or offline, it builds a

system of flexibility based on the needs of the user.

4.4 Display

It is feasible to show this transcribed text on an

external monitor or a compact LCD module. A

desktop or workstation external monitor that connects

via HDMI or VGA, as in the case of transcription

services. Or, a small LCD display connected to the

GPIO pins is great for portable projects, particularly

assistive technology. In both scenarios, the system

offers real-time feedback, providing transcribed

speech almost immediately. This architecture seems

like a flexible, low-cost solution for speech-to-text

conversion that can operate both online and offline

based on the application requirement.

5 METHODOLOGY

The system was implemented in the

following stages:

Hardware Setup:

• Raspberry Pi 4 Model B 2 GB RAM.

• Microphone to record voice (preferably

USB)

• OPTIONAL: External display or LCD

module to print text.

Software Setup:

• The system runs on Raspbian OS.

• The Google Speech-to-Text API or CMU

Sphinx was integrated for speech

recognition.

• The audio capture and speech processing

pipeline were managed using Python.

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

184

• PyAudio for microphone input.

• The whole application could be built using

Python, with libraries such as gTTS

(Google Text-to-Speech) to add additional

functionalities such as text-to-speech to the

app.

• Speech Processing: News audit will read the

audio input in clips and forward short query

to the speech recognition engine. The

processed text is then displayed in real time

on the output, with minimal latency.

6 RESULTS AND

PERFORMANCE

EVALUATION

Evaluation of the real-time speech-to- text system

was performed with a series of tests under various

environmental conditions. The outcomes, presented

in Table 1, showcase the robustness of the system

within controlled environments and underscore the

need for refinement in more complex ones.

Test 1: Quiet Environment

In a noisy environment with considerable

background noise, the system still managed to

achieve an accuracy score of 95%. This high accuracy

gives an impression of the system's performance in

ideal scenarios, as it accounts only for clear speech

within the transcription task, from which only a few

mistakes can be expected.

Test 2: Moderate Background Noise

In moderate background noise (e.g., a typical office

or household environment), however, the system's

accuracy dropped to 85%. There was some noise in

the background, but speech- to- text worked well

enough, with a few misinterpretations, usually close-

sounding words, thrown in for good measure.

Test 3: Noisy Environment

In a noisy environment (eg, in public places or an

environment with high ambient noise), performance

of the system degraded to 70% accuracy. However,

this hurdle was overcome with the application of

noise-cancellation techniques that aided to yield

higher recognition rates. We could have improved

even more if we had used more advanced noise

filtering or adaptive speech model that adjusts to

microphone environment.

Latency

The average latency between speech input and text

output in this system was about 1.5 seconds. This

response time is reasonable for real-time applications,

enabling a smooth user experience with minimal

latency in transcription.

Table 1: Performance results.

Environment Accuracy Latency

Quiet Environment 95% 1.5 seconds

Moderate

Background Noise

85% 1.5 seconds

Noisy Environment 70% 1.5 seconds

Summary

The results show that the system performs excellently

in quiet recording conditions and remains usable

under moderate levels of background noise.

However, the accuracy decreases in noisy

environments, and therefore using some noise-

cancellation techniques can enhance the performance

of the model. This implementation can be improved

with an optimized meaning for the speech models and

using better microphones.

7 DISCUSSIONS

This project highlights how an embedded system such

as the Raspberry Pi can be utilized to produce a

practical application — a real-time speech-to- text

system. Testing results prove that the system can

accurately transcribe speech in non-noisy

environments, making it useful for services such as

transcription services, accessibility, and educational

purposes.

While these results are encouraging, the

performance of the system in a noisy environment

needs some improvement as the quality of the

microphone and the noise levels varied. The drop-in

accuracy that occurred during high-noise conditions

indicates that, while the current implementation works,

it could be further improved by exploring more

advanced noise-cancellation algorithms, or by

incorporating more complex speech recognition

models. Applying machine learning based noise

suppression methods might alleviate many of these

concerns, and make the system immensely more

usable in challenging audio environments.

8 FUTURE ENHANCEMENTS

Several avenues exist for future enhancements to the

system:

Development of a Real-Time Speech-to-Text Converter Using Raspberry Pi

185

• Advanced Noise Cancellation via Machin

e Learning: Adding Machine Learning

models for noise cancellation can

significantly improve the accuracy of the

system in the noisy environment. Methods

like neural networks trained to remove

background noise could help reduce

mistakes and would make the system more

robust across a range of use cases.

• Multilingual Support: By extending the

system to accommodate multiple

languages, its accessibility and applicability

would be improved, especially in

multilingual areas. This can be

accomplished by supplementing multiplexi

ng speech recognition engines or augmentin

g already selected models inside speech

engines such as Google’s API or CMU

Sphinx.

• Portability and Compact Design: The

Raspberry Pi is of small size which will be

usable for portable purposes but some

additional changes can be made to make it

usability. With battery power for mobility

and a more compact system perhaps a

smaller display or wireless connectivity in

the user experience category, you get one

more aspect of the technology's versatility,

particularly for use-cases on the move like

wearables or assistive technology for the

hearing impaired.

• Improved Speech Models: Developing

speech models better suited to working in

noisy environments or outdoors could also

boost the accuracy. Fine-tuning or training

models on particular background noise

profiles, accents, or use cases may result in

improved performance in those scenarios.

9 CONCLUSIONS

In this paper, we have described the process of

building a real time speech-to-text converter using

Raspberry Pi which demonstrates the potential of

developing low cost portable system that utilizes

open-source software and off-the-shelf hardware

components. The proposed system emerges as

promising in both clean and moderately noisy

prescriptions, which can be helpful in many work

domains, such as transcription, power accessibility,

and education. This enables flexibility in the

deployment environment: it can be through Google’s

Speech-to-Text API in online setups or with CMU

Sphinx for offline usage, depending on user

requirements concerning internet connectivity and

data privacy. The Raspberry Pi used as processing

unit also highlights the feasibility to integrate such

speech-to-text systems on cost-$ and energy-$

constrained embedded platforms. While its accuracy

dips in high-noise settings, incorporating noise-

cancellation methods and machine learning models

presents a straightforward solution for enhancement.

Also expect to see more features in the future, like

multilingual support, battery integration for

portability, and more advanced audio models for

richer environmental settings.

Overall, we believe that with some more

optimizations, particularly in its handling of noise and

its computational efficiency, this speech-to-text

system can serve as a strong backbone for other

applications happening in real-time on the phone, and

we hope that this work is a step forward towards

making this model widely usable in more and more

settings.

REFERENCES

Home automation using raspberry Pi through Siri enabled

mobile devices, December 2015 DOI:10.1109/HNICE

M.2015.7393270 Available at: https://www.researchg

ate.net/publication/30 4297304_Home_automation_us

ing_raspberr y_Pi_ through_ Siri_e nabled_ mobile_

devices

Ivan Froiz- Miguel, Paula Fraga-Lamas, Design,

Implementation, and Practical Evaluation of a Voice

Recognition Based IoT Home Automation System for

Low-Resource Languages. June 2023 DOI:10.1109

ACCESS.2023.3286391 Available at: https://ieeexplo

re.ieee.org/stamp/stamp.jsp? arnumber=10151879

K. Lakshmi, Mr. T. Chandra Sekhar Rao. Design and

Implementation of Text to Speech Conversion Using

Raspberry PI, Vol 4, No 6 (2016) Available at:

https://www.ijitr.com/index.php/ojs/article/v iew/1287

M. Gales and S. Young, The Application of Hidden Markov

Models in Speech Recognition. Foundations and

Trends R in Signal Processing Vol. 1, No. 3 (2007)

195–304 c 2008 DOI: 10.1561/2000000004

Availableat: https://mi.eng.cam.ac.uk/~mjfg/mjfg_NO

W. pdf

Prachi Khilari, Prof. Bhope V. P Implementation of Speech

to Text Conversion. Vol. 4, Issue 7, July 2015

Available at:https://www.ijirset.com/upload/2015/july/

16 7_Implementation.pdf

Surinder Kaur, Sanchit Sharma,Voice Command System

Using Raspberry PI. July 2016 DOI:10.5121/acii.2016

.3306 Availableat: https:// www.researchgate.net/publi

cation/30 5922778_Voice Command_System_ Using_

Raspberry_Pi

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

186

Uma N M, Syeda Rabiya Hussainy, Syeda Hafsa Ameen.

Real Time Speaking System for Speech and

Hearingimpaired People - Literature Survey. Volume:

08 Issue: 04, Apr 2021 Available at: https://www.irjet

.net/archives/V8/i4/IRJE T-V8I4191.pdf

Development of a Real-Time Speech-to-Text Converter Using Raspberry Pi

187