Using Neural Network to Develop Speech Recognition

Yida Liu

School of Communication, Taishan University of Science and Technology, Yamaguchi Town, Tai 'an, China

Keywords: Speech Recognition, End-to-End, Convolutional Neural Network, Recurrent Neural Network.

Abstract: The application of neural networks in the field of speech recognition has made remarkable progress in recent

years, which greatly improves the accuracy and robustness of the system. This paper reviews the key

technologies and recent progress of neural networks in speech recognition, with emphasis on different types

of neural network architectures, such as multi-layer perceptrons (MLP), convolutional neural networks (CNN),

and recurrent neural networks (RNN), and their specific applications in processing speech signals. This paper

also discusses the differences between deep neural network synthesis models and end-to-end systems,

analyzes the existing methods, evaluates the performance of these two current mainstream speech recognition

systems from different perspectives such as the usefulness of methods, response speed and synthesis, and

analyzes their performance in different languages, noisy environments and speaker variation. Finally, the

purpose of this study is to analyze the future development trend of neural network speech recognition, such

as more efficient model structure, cross-language transfer learning, and the ability to analyze and discriminate

speech intonation. Through these discussions, this paper provides a valuable reference for the future

development direction of speech recognition technology.

1 INTRODUCTION

Speech recognition technology, as an important

means of human-computer interaction, has been

widely concerned and deeply studied in recent years.

With the improvement of computing power and the

accumulation of big data, speech recognition

technology based on neural network has gradually

become the mainstream method in this field. (Xu,

2021) Traditional speech recognition systems usually

rely on complex manual feature extraction and rule

matching algorithms, which have obvious limitations

when dealing with varied speech signals and complex

language environments. (Shi, 2017) With the rapid

development of deep learning technology, the

application of neural networks in the field of speech

recognition has made remarkable progress. Since

Bell LABS in 1952 and IBM LABS in 1962

developed speech recognition systems based on

isolated words, the application of hidden Markov

model in speech recognition systems has made

milestone progress in the 1980s. (Hadi, 2013) Until

modern times due to the rise of neural networks,

researchers began to apply neural networks into

speech recognition, from the most basic multi-layer

perceptron, to time delay neural networks, to

recurrent neural networks, and then to convolutional

neural networks. These different networks have

achieved good results in different speech recognition

tasks.

In contrast, neural networks can automatically

learn features and patterns in speech signals through

large-scale data training, thus making breakthroughs

in the accuracy and robustness of speech recognition.

In recent years, deep learning models such as

convolutional neural network (CNN) and recurrent

neural network (RNN) have been widely used in the

field of speech recognition. These models can

effectively capture the temporal and spatial features

of speech signals, and significantly improve the

performance of speech recognition systems. (Feng,

2017) In addition, the rise of End-to-End speech

recognition methods has made the entire speech

recognition process more simplified and efficient, no

longer relying on cumbersome intermediate steps and

manual intervention. (Vazhenina, 2020) The progress

of these technologies not only promotes the landing

of speech recognition in various application

scenarios, such as intelligent assistants, speech

translation, automatic subtitle generation, etc., but

also lays a solid foundation for the development of

human-computer natural interaction in the future.

Liu, Y.

Using Neural Network to Develop Speech Recognition.

DOI: 10.5220/0013235200004558

In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024), pages 139-145

ISBN: 978-989-758-738-2

139

This paper reviews the key technologies,

development history, application scenarios,

challenges and future prospects of neural networks in

speech recognition applications. First, speech

recognition is the process of recognizing and

understanding speech signals through a computer. It

uses signal processing and machine learning

techniques to convert sound into words or commands.

Common applications include voice assistants, voice

search, and voice command control. Second, a neural

network is a machine learning model that mimics the

neural network of the human brain and is used to

process complex data. It consists of multiple layers,

each of which deals with different features. Through

training, neural networks can learn and recognize

patterns, which are used in a wide range of fields,

including image recognition and natural language

processing. There are roughly four ways to apply

neural networks to speech recognition: basic neural

network model, RNN, convolutional neural network,

hybrid model and end-to-end system. Finally, this

paper will look forward to the future research

direction, such as the efficiency of the model,

extensibility, cross-language transfer learning, etc., in

order to provide reference and inspiration for future

research.

2 RESEARCH METHODS

2.1 Basic Neural Network Model

The earliest speech recognition systems used

basic neural network models such as multilayer

perceptrons (MLPS). MLP models perform

classification or recognition tasks by learning feature

representations of input speech. However, these

models are more limited in processing complex

speech signals. MLP is a basic Feedforward Neural

Network. It consists of multiple layers of neurons,

each layer of neurons fully connected to the next layer

of neurons. An MLP usually consists of at least

several input layers, one or more hidden layers, and

one output layer. The following is an overview of the

application of MLP in speech recognition and some

related experimental results:

2.1.1 Feature Extraction

Speech signals usually need to be pre-processed

and feature extracted before they can be input into an

MLP. Common speech features include Meir

frequency cepstrum coefficient (MFCC), spectral

features, linear predictive coding (LPC), etc.

The main purpose of feature extraction is to

transform the speech signal into a stable and low-

dimensional feature vector sequence to reduce

computational complexity and data noise.

2.1.2 Experimental Analysis

During training, the MLP adjusts the weights by

learning patterns in the training data so that it can

accurately map input features to the correct output

category. In speech recognition tasks, MLPS can be

used as independent classifiers to identify categories

of speech fragments (such as speech command

recognition). MLP can also be combined with other

models (such as Hidden Markov models, HMM) to

handle more complex speech recognition tasks.

In early speech recognition experiments, MLPS

were often used for small-scale speech data sets, such

as tasks to identify isolated words or small

vocabularies. Experiments have shown that MLP

performs well when dealing with small-scale tasks,

but with the increase of task complexity and data size,

traditional MLP has certain shortcomings in

modeling long-term dependencies. To improve the

performance of MLP in speech recognition,

researchers have proposed a variety of improvements,

such as using deep MLP (increasing the number of

hidden layers) or combining techniques such as

context Windows. Some experimental results show

that through these improvements, MLP can still

achieve better results in some specific speech

recognition tasks, but its applicability and scalability

are still limited. Multilayer perceptrons have a long

history of application in the field of speech

recognition, especially in the early small-scale tasks.

Although MLP is gradually being replaced by more

complex models in large and complex speech

recognition tasks, it remains an important starting

point for understanding and studying deep learning

models as a fundamental neural network structure. In

some specific scenarios, an optimized MLP can still

provide an effective solution.

2.2 Recurrent Neural Networks (RNN)

RNN is a kind of neural network structure especially

suitable for processing sequence data. RNN can

effectively process sequence data due to its internal

cyclic structure, so it is widely used in speech

recognition. In particular, variant models such as

long term memory network (LSTM) and gated cycle

unit (GRU) can better capture long-term

dependencies in speech signals, improving the

accuracy and robustness of recognition.

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

140

2.2.1 Feature Extraction

Like other neural networks, the speech data processed

by RNN first goes through feature extraction.

Commonly used features include MFCC, Mel

spectrum diagram and so on. The extracted feature

vector sequence preserves the temporal correlation of

the speech signal, which is crucial for sequence

modeling of RNN. The main feature of RNN is the

loop structure of its hidden layer, which can take the

output of the previous time step as the input of the

current time step through the internal loop connection.

This structure enables RNNS to capture the time

dependence of the sequence data. In speech

recognition, the input of an RNN is a sequence of

feature vectors and the output is usually a probability

distribution of the corresponding phoneme or lexical

category.

2.2.2 Training

Basic RNNS are prone to disappearing gradients or

exploding gradients when dealing with long

sequences, making it difficult for the model to capture

long distance dependencies. To overcome this

problem, LSTM and GRU are proposed. LSTM and

GRU effectively mitigate gradient problems by

introducing gating mechanisms and perform well in

speech recognition tasks. RNN are usually trained

using the Backpropagation Through Time (BPTT)

algorithm. Optimization methods such as gradient

descent and Adam can be used. To improve the

training effect, techniques such as dropout and

regularization can be used to prevent overfitting.

2.2.3 Experimental Results

Early experiments have shown that RNNS works

better than traditional multi-layer perceptron (MLP)

models at processing speech data, especially on tasks

that require capturing time series information. In

experiments on isolated word recognition and small-

scale data sets, RNNS outperform traditional models

such as HMM (Hidden Markov model). The

experimental results show that LSTM and GRU have

significant advantages in long series modeling. In

large-scale speech recognition tasks, such as

continuous speech recognition, LSTMS and GRUs

significantly outperform basic RNNS. Many studies

have shown that LSTM can effectively capture long-

term dependencies and reduce recognition error rate

when processing speech data.

Recurrent neural networks and their variants have

demonstrated powerful capabilities in the field of

speech recognition, particularly when working with

time series data. Although the use of RNNS has

declined with the development of more advanced

models, it has played a key role in the evolution of

speech recognition technology. The application of

RNN has made significant progress in speech

recognition technology, and in some specific

scenarios, it is still an effective solution.

2.3 Convolutional Neural Networks (CNN)

CNNS are mainly used to extract acoustic features,

such as spectrograms. The application of CNN in

speech recognition is usually as a front-end feature

extractor, providing time-frequency feature

representation of the input sequence to provide input

for subsequent neural network models (such as RNN

or Transformer). CNN were originally designed to

process image data, but they also have a wide range

of applications in speech recognition. CNN is good

at capturing local patterns and spatial features.

Through proper structural design, CNN can

effectively process the time-frequency features of

speech signals to achieve excellent recognition results.

2.3.1 Feature Extraction and Input

Representation

In speech recognition tasks, speech signals are

usually converted into two-dimensional time-

frequency representations, such as Mayer

spectrograms or MFCC. (Rabiner, 1993) This two-

dimensional representation is similar to an image and

can be used directly as input to a CNN. The

convolution layer of CNN can carry out sliding

convolution of local receptive fields on this two-

dimensional representation and extract local time-

frequency dependent features. The structure of CNN

usually includes multiple convolution layers, pooling

layers and fully connected layers. The convolution

layer extracts local features through convolution

kernel, the pooling layer is used for downsampling

and feature dimensionality reduction, and the fully

connected layer maps extracted features to the output

class space. In speech recognition, the input of a

CNN is usually a two-dimensional time-frequency

image (such as a 40x100 MFCC), and the output is a

category distribution of the corresponding phonemes

or words. CNNS are often used in combination with

other models. For example, in end-to-end speech

recognition systems, CNNS are often used as front-

end feature extractors, followed by RNN or full

connection layers for sequence modeling, Another

common method is to combine CNN with long short-

Using Neural Network to Develop Speech Recognition

141

term memory network (LSTM), using the local

features extracted by CNN and the time series

modeling capability of LSTM to improve the overall

recognition performance. (Ayo, 2020)

2.3.2 Experimental Results

Early experiments show that CNNS can effectively

capture the local time-frequency pattern of speech

signals, and the performance of CNNS is significantly

better than that of traditional MLPS in isolated word

recognition and small-vocabulary speech recognition

tasks. CNN is especially suitable for processing time-

frequency features (such as Mayer spectrum).

Compared with direct processing of waveform data,

using CNN to extract time-frequency features is

better. In end-to-end speech recognition systems,

CNNS are often used as front-end feature extractors.

Experimental results show that the end-to-end system

combined with CNN performs well on large-scale

data sets and has met or exceeded the performance of

traditional methods in several speech recognition

benchmarks. For example, experiments on the

popular LibriSpeech dataset show that models using

CNN as a front end can achieve lower word error rates

(WER), especially when there is background noise or

speaker changes.

2.3.3 Multi-Language and Dialect

Recognition

CNN is also widely used in multilingual speech

recognition and dialect recognition tasks.

Experiments show that CNN has strong adaptability

to speech signals of different languages and can

effectively capture the unique time-frequency

characteristics of different languages or dialects. In

multi-language recognition experiments, CNN-based

models often show better cross-language

generalization ability than traditional methods.

2.3.4 Experimental Conclusion

CNN show strong feature extraction ability in the

field of speech recognition, especially for the time-

frequency feature processing of speech signals, CNN

shows significant advantages. Although CNN may

not be as good as RNN or LSTMS when it comes to

time series modeling, by mixing models or using

them in combination with other technologies, CNN

are able to provide superior performance in complex

speech recognition tasks. With the advancement of

deep learning technology, the application of CNN in

speech recognition will continue to be expanded and

optimized.

2.4 Comparison of Results and

Advantages and Disadvantages

2.4.1 MLP

Modern deep learning methods such as CNN and

LSTM have gradually replaced the dominant position

of MLP in large speech recognition tasks. Studies

have shown that while MLPS perform better for some

tasks, models such as RNN, LSTM, and Transformer

tend to provide better performance and accuracy

when working with time series data, such as

continuous speech.

2.4.2 RNN

Although RNNS and their variants, such as LSTMS

and GRUs, have excelled in speech recognition, the

dominance of RNNS has gradually begun to be

replaced with the development of models such as

Transformer. Especially in large scale speech

recognition tasks, Transformer shows higher

parallelism and lower computing costs. However,

RNNS and their variants remain an important

foundation for understanding and implementing

speech recognition systems and continue to perform

well in many applications.

2.4.3 CNN

The experimental results show that in some speech

recognition tasks, CNN has higher parallelism and

lower computational cost than RNN, especially in the

processing of short speech fragments, the

performance advantage of CNN is more obvious.

However, for tasks that need to capture long-term

dependencies, CNNS are often used in combination

with RNNS or LSTMS to compensate for their

shortcomings in time series modeling.

3 HYBRID MODELS AND END-

TO-END SYSTEMS

In recent years, hybrid models and end-to-end

systems have become the focus of research. Hybrid

models combine classical acoustic models (such as

HMM) and neural networks to achieve better

recognition results. 0The end-to-end system starts

directly from the raw speech waveform and outputs

text or commands directly through a deep neural

network, simplifying the process of traditional speech

recognition systems. (Davis, 2015) Hybrid model and

end-to-end system are two important directions in the

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

142

development of speech recognition technology, each

with unique advantages and application scenarios.

The following is an analysis and comparison of their

application in speech recognition, experimental

results, and advantages and disadvantages:

3.1 Hybrid Model

Hybrid modeling in speech recognition usually refers

to the combination of multiple models or technologies

to play to their respective advantages. A typical

hybrid model involves combining a hidden Markov

model (HMM) with a deep neural network (e.g., CNN,

RNN, DNN).

3.2 Experimental Results and

Performance

Hybrid models perform well in most traditional

speech recognition tasks, especially in the case of

multi-speaker recognition and more background

noise, and HMM combined with deep neural

networks can effectively improve the recognition

accuracy. Experiments show that HMM-DNN

usually achieves high recognition accuracy on large

speech data sets (such as TIMIT and Switchboard).

The hybrid model has good adaptability to the

variability of speech signals (such as different

pronunciation modes, tones, noises, etc.), which

makes it more robust in practical applications.

Hybrid models are usually more complex, and need

to optimize and tune each part of HMM and neural

network respectively, and the training time is long,

and it is difficult to parallelize.

3.3 End-to-End Systems

The end-to-end system is designed to generate text

output directly from the original speech input,

eliminating multiple independent steps (such as

feature extraction, acoustic modeling, language

modeling, etc.) in traditional speech recognition

systems. Typical end-to-end models include seque-

to-sequence (Seq2Seq) models and attention-

mechanism-based models, with end-to-end systems

gradually outperforming traditional hybrid models on

large-scale data sets, especially on data sets with no

background noise or high standardization (such as

LibriSpeech). In recent experiments, Transformer-

based end-to-end models have tended to achieve

lower word error rates (WER) across multiple

benchmarks. The end-to-end model eliminates many

steps in the traditional system, such as HMM

modeling, feature engineering, etc., making the

model structure more concise and easier to optimize

and deploy. Since there is no need to manually design

features, the end-to-end model is more robust to

errors during feature extraction. However, the

amount of data is often higher, and a large number of

labeled data is needed to train a high-precision model.

(Zhao, 2019) This can become a bottleneck in some

application scenarios where data is scarce.

3.4 Summary and Comparison

Hybrid models are generally more complex and

difficult to develop and debug, but they have strong

adaptability and robustness in diverse application

scenarios. End-to-end systems are more concise, less

expensive to develop and maintain, and are

particularly suitable for large-scale data sets and

parallel computing. Hybrid models are generally

stable when faced with traditional tasks and scenarios

with complex background noise. (Dai, 2022) End-to-

end systems perform better on large data sets with

high standardization, and their potential increases as

models and data sizes increase.

By combining the advantages of these two

approaches, future speech recognition systems may

evolve in a more intelligent, adaptive direction.

4 CHALLENGES AND

PROSPECTS

Based on relevant data and reports, this paper

analyzes the current challenges and future directions

of speech recognition

Acoustic models in neural networks include DNN,

CNN and RNN, which are used to extract and

represent speech features. The optimization of

acoustic model directly affects the performance of

speech recognition system.

Language models use context information to

improve the accuracy and fluency of recognition

results, often optimized in conjunction with the

output of neural networks.

Neural networks benefit from the development of

large-scale data sets and pre-trained models in speech

recognition, and the application of pre-trained

language models such as BERT in speech recognition

is gradually increasing.

With further optimization of deep learning and

neural network models, speech recognition systems

will be able to achieve higher recognition accuracy

and more reliable performance, especially in various

noise environments and speech changes.

Using Neural Network to Develop Speech Recognition

143

Technology will move toward real-time speech

recognition and instant response to support faster and

more natural human-computer interaction, such as

real-time translation, real-time response from voice

assistants, and more.

In the future, with the continuous progress and

deepening of neural network technology, researchers

can foresee the further development of speech

recognition in multi-language, multi-modal (such as

speech and image combination), adaptive

environment and other aspects. At the same time, the

structure optimization of neural network model and

the improvement of computing power will further

improve the performance and universality of speech

recognition system. With the process of globalization,

speech recognition technology will further enhance

the support for multiple languages and dialects, thus

promoting the application and popularization of the

world.

This paper discusses the application of several

common neural networks, compares the advantages

and disadvantages of various common methods,

compares and discusses them, and obtains the

following improvement schemes:

Future research may focus more on how to

develop high-performance speech recognition

systems in low-resource environments. Techniques

such as semi-supervised learning, transfer learning,

and multitasking learning are likely to play an

important role in this area.

Improving the robustness and adaptability of

speech recognition systems is still an important

direction. Researchers may continue to explore

training and optimization strategies on diverse data

(e.g., noise data, multi-dialect data).

The development of adaptive systems, such as

those that can automatically adjust parameters to

different environments or speakers, may lead to

breakthroughs in the future.

Combining speech recognition with other modes

(e.g., vision, text) is a direction worth exploring. By

integrating multimodal information, future systems

may be able to provide more accurate and robust

speech recognition services. For example, the

combination of lip-reading information in video with

voice data is expected to improve recognition rates in

noisy environments.

5 CONCLUSION

In this paper, through experimental comparison, it is

found that neural networks DNN have shown

significant advantages in the study of speech

recognition, and become the core technology of many

modern speech recognition systems. Neural network-

based methods, especially CNNS and RNNS, are

extremely capable of processing the spatiotemporal

properties of speech signals. They can accurately

extract features and classify them in complex audio

environments, thus improving the accuracy and

robustness of speech recognition. In particular, deep

learning methods significantly improve the

performance of speech recognition, making the

system more stable in a variety of environments.

Neural networks can learn large amounts of data and

automatically adapt to different speech variants,

accents and background noise, with stronger

generalization ability. While traditional speech

recognition systems often rely on multiple

independent components (such as feature extraction,

acoustic models, language models, etc.), the end-to-

end system based on neural networks simplifies these

steps and can be jointly optimized through

backpropagation to further improve the overall effect.

The application of neural networks in speech

recognition has made remarkable progress in this

technology, especially the rise of end-to-end models,

marking the speech recognition technology has

entered a new era. However, there are still many

challenges to overcome in the future, especially in

terms of application, diversity and robustness in low-

resource environments. Through continued research

and exploration, speech recognition technology is

expected to achieve breakthroughs in more areas, and

provide more accurate and reliable services to a wider

range of users.

REFERENCES

Shi, Y., 2017. Optimization and Design of speech

recognition Scheme based on recurrent neural network.

Beijing Jiaotong University.

Dai, W., Liu, H., 2022. Chinese speech recognition based

on neural network. Journal of Sichuan Normal

University.

Zhao, Z. B., Lan, L., Jiang D., et al, 2019. Research on

Small Sample Speech Recognition based on Transfer

learning. Journal of Beijing Institute of Printing and

Technology

Xu, D. D., Jiang, Z. X., 2021. End-to-end Speech

Recognition based on HOPE-CTC. Computer

Engineering and Design

Vazhenina, D., Markov, K., 2020. End-to-end noisy speech

recognition using Fourier and Hilbert spectrum features.

Electronics.

Davis, K. H., Bidduloh, R., Balashek, S., 2015. Automatic

Recognition of spoken digits. Journal of the Acoustical

Society of America.

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

144

Rabiner, L. R., Iuang, B. H., 1993. Fundamentals of Voice

Recognition. New Jersey: PrenticeHall.

Feng, H. Z., Wang, Y. F., 2017. A vector identification

method for spectral features. Journal of Chongqing

University.

Hadi, V., Hossein, S., 2013. Speech enhancement using

hidden Markov models in Mel-frequency domain.

Speech Communication.

Ayo, F. E., Folorunso, O., Ibharalu, F. T., Osinuga, I. A.,

2020. Machine learning techniques for hate speech

classification of twitter data: State-of-the-art, future

challenges and research directions. Computer Science

Review.

Using Neural Network to Develop Speech Recognition

145