Using Neural Network to Develop Speech Recognition
Yida Liu
School of Communication, Taishan University of Science and Technology, Yamaguchi Town, Tai 'an, China
Keywords: Speech Recognition, End-to-End, Convolutional Neural Network, Recurrent Neural Network.
Abstract: The application of neural networks in the field of speech recognition has made remarkable progress in recent
years, which greatly improves the accuracy and robustness of the system. This paper reviews the key
technologies and recent progress of neural networks in speech recognition, with emphasis on different types
of neural network architectures, such as multi-layer perceptrons (MLP), convolutional neural networks (CNN),
and recurrent neural networks (RNN), and their specific applications in processing speech signals. This paper
also discusses the differences between deep neural network synthesis models and end-to-end systems,
analyzes the existing methods, evaluates the performance of these two current mainstream speech recognition
systems from different perspectives such as the usefulness of methods, response speed and synthesis, and
analyzes their performance in different languages, noisy environments and speaker variation. Finally, the
purpose of this study is to analyze the future development trend of neural network speech recognition, such
as more efficient model structure, cross-language transfer learning, and the ability to analyze and discriminate
speech intonation. Through these discussions, this paper provides a valuable reference for the future
development direction of speech recognition technology.
1 INTRODUCTION
Speech recognition technology, as an important
means of human-computer interaction, has been
widely concerned and deeply studied in recent years.
With the improvement of computing power and the
accumulation of big data, speech recognition
technology based on neural network has gradually
become the mainstream method in this field. (Xu,
2021) Traditional speech recognition systems usually
rely on complex manual feature extraction and rule
matching algorithms, which have obvious limitations
when dealing with varied speech signals and complex
language environments. (Shi, 2017) With the rapid
development of deep learning technology, the
application of neural networks in the field of speech
recognition has made remarkable progress. Since
Bell LABS in 1952 and IBM LABS in 1962
developed speech recognition systems based on
isolated words, the application of hidden Markov
model in speech recognition systems has made
milestone progress in the 1980s. (Hadi, 2013) Until
modern times due to the rise of neural networks,
researchers began to apply neural networks into
speech recognition, from the most basic multi-layer
perceptron, to time delay neural networks, to
recurrent neural networks, and then to convolutional
neural networks. These different networks have
achieved good results in different speech recognition
tasks.
In contrast, neural networks can automatically
learn features and patterns in speech signals through
large-scale data training, thus making breakthroughs
in the accuracy and robustness of speech recognition.
In recent years, deep learning models such as
convolutional neural network (CNN) and recurrent
neural network (RNN) have been widely used in the
field of speech recognition. These models can
effectively capture the temporal and spatial features
of speech signals, and significantly improve the
performance of speech recognition systems. (Feng,
2017) In addition, the rise of End-to-End speech
recognition methods has made the entire speech
recognition process more simplified and efficient, no
longer relying on cumbersome intermediate steps and
manual intervention. (Vazhenina, 2020) The progress
of these technologies not only promotes the landing
of speech recognition in various application
scenarios, such as intelligent assistants, speech
translation, automatic subtitle generation, etc., but
also lays a solid foundation for the development of
human-computer natural interaction in the future.
Liu, Y.
Using Neural Network to Develop Speech Recognition.
DOI: 10.5220/0013235200004558
In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024), pages 139-145
ISBN: 978-989-758-738-2
Copyright © 2025 by Paper published under CC license (CC BY-NC-ND 4.0)
139
This paper reviews the key technologies,
development history, application scenarios,
challenges and future prospects of neural networks in
speech recognition applications. First, speech
recognition is the process of recognizing and
understanding speech signals through a computer. It
uses signal processing and machine learning
techniques to convert sound into words or commands.
Common applications include voice assistants, voice
search, and voice command control. Second, a neural
network is a machine learning model that mimics the
neural network of the human brain and is used to
process complex data. It consists of multiple layers,
each of which deals with different features. Through
training, neural networks can learn and recognize
patterns, which are used in a wide range of fields,
including image recognition and natural language
processing. There are roughly four ways to apply
neural networks to speech recognition: basic neural
network model, RNN, convolutional neural network,
hybrid model and end-to-end system. Finally, this
paper will look forward to the future research
direction, such as the efficiency of the model,
extensibility, cross-language transfer learning, etc., in
order to provide reference and inspiration for future
research.
2 RESEARCH METHODS
2.1 Basic Neural Network Model
The earliest speech recognition systems used
basic neural network models such as multilayer
perceptrons (MLPS). MLP models perform
classification or recognition tasks by learning feature
representations of input speech. However, these
models are more limited in processing complex
speech signals. MLP is a basic Feedforward Neural
Network. It consists of multiple layers of neurons,
each layer of neurons fully connected to the next layer
of neurons. An MLP usually consists of at least
several input layers, one or more hidden layers, and
one output layer. The following is an overview of the
application of MLP in speech recognition and some
related experimental results:
2.1.1 Feature Extraction
Speech signals usually need to be pre-processed
and feature extracted before they can be input into an
MLP. Common speech features include Meir
frequency cepstrum coefficient (MFCC), spectral
features, linear predictive coding (LPC), etc.
The main purpose of feature extraction is to
transform the speech signal into a stable and low-
dimensional feature vector sequence to reduce
computational complexity and data noise.
2.1.2 Experimental Analysis
During training, the MLP adjusts the weights by
learning patterns in the training data so that it can
accurately map input features to the correct output
category. In speech recognition tasks, MLPS can be
used as independent classifiers to identify categories
of speech fragments (such as speech command
recognition). MLP can also be combined with other
models (such as Hidden Markov models, HMM) to
handle more complex speech recognition tasks.
In early speech recognition experiments, MLPS
were often used for small-scale speech data sets, such
as tasks to identify isolated words or small
vocabularies. Experiments have shown that MLP
performs well when dealing with small-scale tasks,
but with the increase of task complexity and data size,
traditional MLP has certain shortcomings in
modeling long-term dependencies. To improve the
performance of MLP in speech recognition,
researchers have proposed a variety of improvements,
such as using deep MLP (increasing the number of
hidden layers) or combining techniques such as
context Windows. Some experimental results show
that through these improvements, MLP can still
achieve better results in some specific speech
recognition tasks, but its applicability and scalability
are still limited. Multilayer perceptrons have a long
history of application in the field of speech
recognition, especially in the early small-scale tasks.
Although MLP is gradually being replaced by more
complex models in large and complex speech
recognition tasks, it remains an important starting
point for understanding and studying deep learning
models as a fundamental neural network structure. In
some specific scenarios, an optimized MLP can still
provide an effective solution.
2.2 Recurrent Neural Networks (RNN)
RNN is a kind of neural network structure especially
suitable for processing sequence data. RNN can
effectively process sequence data due to its internal
cyclic structure, so it is widely used in speech
recognition. In particular, variant models such as
long term memory network (LSTM) and gated cycle
unit (GRU) can better capture long-term
dependencies in speech signals, improving the
accuracy and robustness of recognition.
MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management
140
2.2.1 Feature Extraction
Like other neural networks, the speech data processed
by RNN first goes through feature extraction.
Commonly used features include MFCC, Mel
spectrum diagram and so on. The extracted feature
vector sequence preserves the temporal correlation of
the speech signal, which is crucial for sequence
modeling of RNN. The main feature of RNN is the
loop structure of its hidden layer, which can take the
output of the previous time step as the input of the
current time step through the internal loop connection.
This structure enables RNNS to capture the time
dependence of the sequence data. In speech
recognition, the input of an RNN is a sequence of
feature vectors and the output is usually a probability
distribution of the corresponding phoneme or lexical
category.
2.2.2 Training
Basic RNNS are prone to disappearing gradients or
exploding gradients when dealing with long
sequences, making it difficult for the model to capture
long distance dependencies. To overcome this
problem, LSTM and GRU are proposed. LSTM and
GRU effectively mitigate gradient problems by
introducing gating mechanisms and perform well in
speech recognition tasks. RNN are usually trained
using the Backpropagation Through Time (BPTT)
algorithm. Optimization methods such as gradient
descent and Adam can be used. To improve the
training effect, techniques such as dropout and
regularization can be used to prevent overfitting.
2.2.3 Experimental Results
Early experiments have shown that RNNS works
better than traditional multi-layer perceptron (MLP)
models at processing speech data, especially on tasks
that require capturing time series information. In
experiments on isolated word recognition and small-
scale data sets, RNNS outperform traditional models
such as HMM (Hidden Markov model). The
experimental results show that LSTM and GRU have
significant advantages in long series modeling. In
large-scale speech recognition tasks, such as
continuous speech recognition, LSTMS and GRUs
significantly outperform basic RNNS. Many studies
have shown that LSTM can effectively capture long-
term dependencies and reduce recognition error rate
when processing speech data.
Recurrent neural networks and their variants have
demonstrated powerful capabilities in the field of
speech recognition, particularly when working with
time series data. Although the use of RNNS has
declined with the development of more advanced
models, it has played a key role in the evolution of
speech recognition technology. The application of
RNN has made significant progress in speech
recognition technology, and in some specific
scenarios, it is still an effective solution.
2.3 Convolutional Neural Networks (CNN)
CNNS are mainly used to extract acoustic features,
such as spectrograms. The application of CNN in
speech recognition is usually as a front-end feature
extractor, providing time-frequency feature
representation of the input sequence to provide input
for subsequent neural network models (such as RNN
or Transformer). CNN were originally designed to
process image data, but they also have a wide range
of applications in speech recognition. CNN is good
at capturing local patterns and spatial features.
Through proper structural design, CNN can
effectively process the time-frequency features of
speech signals to achieve excellent recognition results.
2.3.1 Feature Extraction and Input
Representation
In speech recognition tasks, speech signals are
usually converted into two-dimensional time-
frequency representations, such as Mayer
spectrograms or MFCC. (Rabiner, 1993) This two-
dimensional representation is similar to an image and
can be used directly as input to a CNN. The
convolution layer of CNN can carry out sliding
convolution of local receptive fields on this two-
dimensional representation and extract local time-
frequency dependent features. The structure of CNN
usually includes multiple convolution layers, pooling
layers and fully connected layers. The convolution
layer extracts local features through convolution
kernel, the pooling layer is used for downsampling
and feature dimensionality reduction, and the fully
connected layer maps extracted features to the output
class space. In speech recognition, the input of a
CNN is usually a two-dimensional time-frequency
image (such as a 40x100 MFCC), and the output is a
category distribution of the corresponding phonemes
or words. CNNS are often used in combination with
other models. For example, in end-to-end speech
recognition systems, CNNS are often used as front-
end feature extractors, followed by RNN or full
connection layers for sequence modeling, Another
common method is to combine CNN with long short-
Using Neural Network to Develop Speech Recognition
141
term memory network (LSTM), using the local
features extracted by CNN and the time series
modeling capability of LSTM to improve the overall
recognition performance. (Ayo, 2020)
2.3.2 Experimental Results
Early experiments show that CNNS can effectively
capture the local time-frequency pattern of speech
signals, and the performance of CNNS is significantly
better than that of traditional MLPS in isolated word
recognition and small-vocabulary speech recognition
tasks. CNN is especially suitable for processing time-
frequency features (such as Mayer spectrum).
Compared with direct processing of waveform data,
using CNN to extract time-frequency features is
better. In end-to-end speech recognition systems,
CNNS are often used as front-end feature extractors.
Experimental results show that the end-to-end system
combined with CNN performs well on large-scale
data sets and has met or exceeded the performance of
traditional methods in several speech recognition
benchmarks. For example, experiments on the
popular LibriSpeech dataset show that models using
CNN as a front end can achieve lower word error rates
(WER), especially when there is background noise or
speaker changes.
2.3.3 Multi-Language and Dialect
Recognition
CNN is also widely used in multilingual speech
recognition and dialect recognition tasks.
Experiments show that CNN has strong adaptability
to speech signals of different languages and can
effectively capture the unique time-frequency
characteristics of different languages or dialects. In
multi-language recognition experiments, CNN-based
models often show better cross-language
generalization ability than traditional methods.
2.3.4 Experimental Conclusion
CNN show strong feature extraction ability in the
field of speech recognition, especially for the time-
frequency feature processing of speech signals, CNN
shows significant advantages. Although CNN may
not be as good as RNN or LSTMS when it comes to
time series modeling, by mixing models or using
them in combination with other technologies, CNN
are able to provide superior performance in complex
speech recognition tasks. With the advancement of
deep learning technology, the application of CNN in
speech recognition will continue to be expanded and
optimized.
2.4 Comparison of Results and
Advantages and Disadvantages
2.4.1 MLP
Modern deep learning methods such as CNN and
LSTM have gradually replaced the dominant position
of MLP in large speech recognition tasks. Studies
have shown that while MLPS perform better for some
tasks, models such as RNN, LSTM, and Transformer
tend to provide better performance and accuracy
when working with time series data, such as
continuous speech.
2.4.2 RNN
Although RNNS and their variants, such as LSTMS
and GRUs, have excelled in speech recognition, the
dominance of RNNS has gradually begun to be
replaced with the development of models such as
Transformer. Especially in large scale speech
recognition tasks, Transformer shows higher
parallelism and lower computing costs. However,
RNNS and their variants remain an important
foundation for understanding and implementing
speech recognition systems and continue to perform
well in many applications.
2.4.3 CNN
The experimental results show that in some speech
recognition tasks, CNN has higher parallelism and
lower computational cost than RNN, especially in the
processing of short speech fragments, the
performance advantage of CNN is more obvious.
However, for tasks that need to capture long-term
dependencies, CNNS are often used in combination
with RNNS or LSTMS to compensate for their
shortcomings in time series modeling.
3 HYBRID MODELS AND END-
TO-END SYSTEMS
In recent years, hybrid models and end-to-end
systems have become the focus of research. Hybrid
models combine classical acoustic models (such as
HMM) and neural networks to achieve better
recognition results. 0The end-to-end system starts
directly from the raw speech waveform and outputs
text or commands directly through a deep neural
network, simplifying the process of traditional speech
recognition systems. (Davis, 2015) Hybrid model and
end-to-end system are two important directions in the
MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management
142
development of speech recognition technology, each
with unique advantages and application scenarios.
The following is an analysis and comparison of their
application in speech recognition, experimental
results, and advantages and disadvantages:
3.1 Hybrid Model
Hybrid modeling in speech recognition usually refers
to the combination of multiple models or technologies
to play to their respective advantages. A typical
hybrid model involves combining a hidden Markov
model (HMM) with a deep neural network (e.g., CNN,
RNN, DNN).
3.2 Experimental Results and
Performance
Hybrid models perform well in most traditional
speech recognition tasks, especially in the case of
multi-speaker recognition and more background
noise, and HMM combined with deep neural
networks can effectively improve the recognition
accuracy. Experiments show that HMM-DNN
usually achieves high recognition accuracy on large
speech data sets (such as TIMIT and Switchboard).
The hybrid model has good adaptability to the
variability of speech signals (such as different
pronunciation modes, tones, noises, etc.), which
makes it more robust in practical applications.
Hybrid models are usually more complex, and need
to optimize and tune each part of HMM and neural
network respectively, and the training time is long,
and it is difficult to parallelize.
3.3 End-to-End Systems
The end-to-end system is designed to generate text
output directly from the original speech input,
eliminating multiple independent steps (such as
feature extraction, acoustic modeling, language
modeling, etc.) in traditional speech recognition
systems. Typical end-to-end models include seque-
to-sequence (Seq2Seq) models and attention-
mechanism-based models, with end-to-end systems
gradually outperforming traditional hybrid models on
large-scale data sets, especially on data sets with no
background noise or high standardization (such as
LibriSpeech). In recent experiments, Transformer-
based end-to-end models have tended to achieve
lower word error rates (WER) across multiple
benchmarks. The end-to-end model eliminates many
steps in the traditional system, such as HMM
modeling, feature engineering, etc., making the
model structure more concise and easier to optimize
and deploy. Since there is no need to manually design
features, the end-to-end model is more robust to
errors during feature extraction. However, the
amount of data is often higher, and a large number of
labeled data is needed to train a high-precision model.
(Zhao, 2019) This can become a bottleneck in some
application scenarios where data is scarce.
3.4 Summary and Comparison
Hybrid models are generally more complex and
difficult to develop and debug, but they have strong
adaptability and robustness in diverse application
scenarios. End-to-end systems are more concise, less
expensive to develop and maintain, and are
particularly suitable for large-scale data sets and
parallel computing. Hybrid models are generally
stable when faced with traditional tasks and scenarios
with complex background noise. (Dai, 2022) End-to-
end systems perform better on large data sets with
high standardization, and their potential increases as
models and data sizes increase.
By combining the advantages of these two
approaches, future speech recognition systems may
evolve in a more intelligent, adaptive direction.
4 CHALLENGES AND
PROSPECTS
Based on relevant data and reports, this paper
analyzes the current challenges and future directions
of speech recognition
Acoustic models in neural networks include DNN,
CNN and RNN, which are used to extract and
represent speech features. The optimization of
acoustic model directly affects the performance of
speech recognition system.
Language models use context information to
improve the accuracy and fluency of recognition
results, often optimized in conjunction with the
output of neural networks.
Neural networks benefit from the development of
large-scale data sets and pre-trained models in speech
recognition, and the application of pre-trained
language models such as BERT in speech recognition
is gradually increasing.
With further optimization of deep learning and
neural network models, speech recognition systems
will be able to achieve higher recognition accuracy
and more reliable performance, especially in various
noise environments and speech changes.
Using Neural Network to Develop Speech Recognition
143
Technology will move toward real-time speech
recognition and instant response to support faster and
more natural human-computer interaction, such as
real-time translation, real-time response from voice
assistants, and more.
In the future, with the continuous progress and
deepening of neural network technology, researchers
can foresee the further development of speech
recognition in multi-language, multi-modal (such as
speech and image combination), adaptive
environment and other aspects. At the same time, the
structure optimization of neural network model and
the improvement of computing power will further
improve the performance and universality of speech
recognition system. With the process of globalization,
speech recognition technology will further enhance
the support for multiple languages and dialects, thus
promoting the application and popularization of the
world.
This paper discusses the application of several
common neural networks, compares the advantages
and disadvantages of various common methods,
compares and discusses them, and obtains the
following improvement schemes:
Future research may focus more on how to
develop high-performance speech recognition
systems in low-resource environments. Techniques
such as semi-supervised learning, transfer learning,
and multitasking learning are likely to play an
important role in this area.
Improving the robustness and adaptability of
speech recognition systems is still an important
direction. Researchers may continue to explore
training and optimization strategies on diverse data
(e.g., noise data, multi-dialect data).
The development of adaptive systems, such as
those that can automatically adjust parameters to
different environments or speakers, may lead to
breakthroughs in the future.
Combining speech recognition with other modes
(e.g., vision, text) is a direction worth exploring. By
integrating multimodal information, future systems
may be able to provide more accurate and robust
speech recognition services. For example, the
combination of lip-reading information in video with
voice data is expected to improve recognition rates in
noisy environments.
5 CONCLUSION
In this paper, through experimental comparison, it is
found that neural networks DNN have shown
significant advantages in the study of speech
recognition, and become the core technology of many
modern speech recognition systems. Neural network-
based methods, especially CNNS and RNNS, are
extremely capable of processing the spatiotemporal
properties of speech signals. They can accurately
extract features and classify them in complex audio
environments, thus improving the accuracy and
robustness of speech recognition. In particular, deep
learning methods significantly improve the
performance of speech recognition, making the
system more stable in a variety of environments.
Neural networks can learn large amounts of data and
automatically adapt to different speech variants,
accents and background noise, with stronger
generalization ability. While traditional speech
recognition systems often rely on multiple
independent components (such as feature extraction,
acoustic models, language models, etc.), the end-to-
end system based on neural networks simplifies these
steps and can be jointly optimized through
backpropagation to further improve the overall effect.
The application of neural networks in speech
recognition has made remarkable progress in this
technology, especially the rise of end-to-end models,
marking the speech recognition technology has
entered a new era. However, there are still many
challenges to overcome in the future, especially in
terms of application, diversity and robustness in low-
resource environments. Through continued research
and exploration, speech recognition technology is
expected to achieve breakthroughs in more areas, and
provide more accurate and reliable services to a wider
range of users.
REFERENCES
Shi, Y., 2017. Optimization and Design of speech
recognition Scheme based on recurrent neural network.
Beijing Jiaotong University.
Dai, W., Liu, H., 2022. Chinese speech recognition based
on neural network. Journal of Sichuan Normal
University.
Zhao, Z. B., Lan, L., Jiang D., et al, 2019. Research on
Small Sample Speech Recognition based on Transfer
learning. Journal of Beijing Institute of Printing and
Technology
Xu, D. D., Jiang, Z. X., 2021. End-to-end Speech
Recognition based on HOPE-CTC. Computer
Engineering and Design
Vazhenina, D., Markov, K., 2020. End-to-end noisy speech
recognition using Fourier and Hilbert spectrum features.
Electronics.
Davis, K. H., Bidduloh, R., Balashek, S., 2015. Automatic
Recognition of spoken digits. Journal of the Acoustical
Society of America.
MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management
144
Rabiner, L. R., Iuang, B. H., 1993. Fundamentals of Voice
Recognition. New Jersey: PrenticeHall.
Feng, H. Z., Wang, Y. F., 2017. A vector identification
method for spectral features. Journal of Chongqing
University.
Hadi, V., Hossein, S., 2013. Speech enhancement using
hidden Markov models in Mel-frequency domain.
Speech Communication.
Ayo, F. E., Folorunso, O., Ibharalu, F. T., Osinuga, I. A.,
2020. Machine learning techniques for hate speech
classification of twitter data: State-of-the-art, future
challenges and research directions. Computer Science
Review.
Using Neural Network to Develop Speech Recognition
145