Artificial Intelligence Speech Synthesis Based on the Deep Learning

Sihang Li

Beijing-Dublin International College at BJUT, Beijing University of Technology, China

Keywords: Artificial Intelligence, Speech Synthesis, Deep Learning.

Abstract: Speech synthesis is one of the most popular topics in machine learning, and it aims to generate an expressive

voice that can satisfy the demands of different fields. This survey introduces the technology of generating

speech based on deep learning. Besides, it reviews the development of autoregressive frames (represented by

Transformer TTS) as well as non-autoregressive (represented by Fastspeech) in terms of speech synthesis.

Autoregressive frame strengths in generating expressive speech while non-autoregressive tend to efficiently

generate that. Moreover, extra refined programs based on the above are also included. It contains an

Autoregressive Acoustic Model with Mixed Self-attention and Lightweight Convolution(AAMSLC),

Autoregressive Diffusion Transformer(ARDiT), RoubuTrans, FastSpeech 3, FastPitch, ProbSparseFS,

LinearizedFS, LightTTS and so forth. These techniques represent current cutting-edge advances in the field

of speech synthesis. The purpose of this passage is to provide a systematic knowledge review for beginners

in the field, which helps them to better understand the latest developments in speech synthesis technology

while providing new ideas for future research and applications. .

1 INTRODUCTION

Speech synthesis is a way that machine can generate

speech through given text(also called text-to-speech),

which is widely deployed to human-computer natural

language interaction (HCM), Intelligent customer

service, voice navigation, assistance for the blind,

audiobooks, and game voice-overs. Speech synthesis

is a technique that originated in the 18th century with

mechanical synthesis devices. Later, it developed

from the Vocoder electronic synthesizer in the 20th

century, to the DECtalk resonance peak synthesis

technology. Then it went to the unit splicing-based

synthesis and statistical parameter-based synthesis

techniques. Though speech synthesis has experienced

rapid development and has made great progress, the

synthesized speech is still not as good as it could be.

Besides, high labor costs, Complex parameter

adjustment, and problems such as distorted speech

and stiff intonation were still the blocks of applying

speech synthesis. As technology advances, The deep

neural network-based speech synthesis method

overcomes the above problems better and generates

high-quality speech through a simple and flexible

https://orcid.org/0009-0001-9057-6550

process, which has led to significant development of

speech synthesis technology (Tan et al., 2021).

The autoregressive framework is one of the most

mature types in end-to-end speech generation

systems. It generates sequences on the principle of

conditional probability, which means the next

sequences will be generated based on what has

already been produced. In general, the Autoregressive

frame has a sequential, one-way synthesis pattern.

The advantage of this framework is that it can capture

the sequential dependencies of the sequences well and

thus exhibit strong coherence and logic, showing

natural content, timbre, rhythm, emotion, and style

(Tang et al., 2023). Whereas, its low synthesis speed,

slow training speed, and high training costs result in

discomfort with faster interaction scenarios such as

face-to-face conversations or conversations.

In recent years, the non-autoregressive framework

has been proposed as one of the natural language

processing frameworks based on the autoregressive

framework, which is distinctly different from the

latter one. The differences have been shown in rapid,

parallel generation of sequences and shorter training

Li, S.

Artiﬁcial Intelligence Speech Synthesis Based on the Deep Learning.

DOI: 10.5220/0013702500004670

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Data Science and Engineering (ICDSE 2025), pages 591-595

ISBN: 978-989-758-765-8

591

time. However, the generation series of non-

autoregressive is lower in accuracy and fluency

compared with the autoregressive framework due to

its inherently relatively simple structure and weak

order dependence.

This paper investigates the speech synthesis

literature in recent years. It reviews current

developments in this field in terms of autoregressive

and non-autoregressive frameworks, with

Transformer and FastSpeech as the classical

frameworks respectively. Besides, all frameworks

will be evaluated by performances, qualities, training

and calculation costs, and so forth. This paper aims to

help practitioners of speech synthesis understand its

development, provide fundamental knowledge of

speech generation techniques, and realize the current

problems in this field.

2 SPEECH SYNTHESIS

Deep learning for speech generation has been a

research frontier in the field in recent years. In the

process of generation, texts are passed to the TTS

front end to extract the feature of pronunciation and

rhyme (i.e. the basic language feature). Then it would

be received by the acoustic framework and

transformed into an acoustic feature (most commonly

the Mel-spectrum), which is the most essential part of

the text-to-speech process. The acoustic framework

can also be divided into the autoregressive and non-

autoregressive framework. Lastly, The vocoder

converts the Mel-spectrogram into sound waves and

eventually synthesizes speech.

2.1 Transformer and its improved

versions

As proposed by Vaswani et al., Transformer TTS

frameworks are typical of autoregressive frameworks

(Vaswani et al., 2017). The principle is that the text is

numericized and passed into the Transformer

framework, which consists of several encoders and

decoders. The numericized text is first via the

encoder, which has a significant sub-attention Layer.

Its self-attention mechanism allows each word to be

encoded taking into account the influence of other

words in the sentence as they are encoded. After

passing all the encoders, texts will be passed on to

decoders where texts re-experience the sub-Attention

Layer. The final output of the Mel-spectrogram can

be translated to approximate human speech.

Noticeably, the texts have been passed to a number of

attention layers. This is called a multi-head attention

mechanism, which enables the framework to extract

different semantic information. The advantages of

Transformer TTS are parallel training, fast training,

and good processing results. However, it also

encountered problems such as slow synthesis, poor

fine-grained control, and instability such as possible

missing information between word positions over

long distances.

For the past few years, there has been constant

improvement and innovation in the Transformer TTS

framework. As proposed by Zhao, AAMSLC

Replacing partial self-attention operations with a

small number of convolutional operations showed a

powerful performance (Zhao, 2023). The results of

his experiment have revealed that the hybrid

frameworks of different structures can be trained and

reason more efficiently(36 percent in training

efficiency and 95 percent rise in that of reasoning).

Moreover, it can also avoid a considerable number of

redundant operations. However, the Introduction of

Convolution has increased the complexity of

frameworks as well as the demand for hardware.

Lastly, when dealing with multiple other languages,

its performance may not reach the same level as

Chinese.

Liu et al. proposed ARDiT frameworks as another

revised version that relies on the Transformer

framework's decoder to encode audio as a sequence

of vectors in continuous space, rather than the

traditional sequence of discrete symbols (Liu et al.,

2024). Therefore generate natural and high-quality

speech. In addition, it exhibits strong performance in

terms of sample-free speech generation. Whereas the

training complexity and computational cost of this

method are relatively elevated and consume more

computational resources.

The last framework of autoregressive to be

introduced is RobuTrans, which has been developed

based on the robustness of the Transformer

(maintaining stability and output capacity when

encountering disruptions or unknown contingencies)

(Li et al., 2020). This framework addresses the

instability of the existing neural TTS when dealing

with anomalous text by converting the input text into

sequences containing phonemes and rhythmic

features with duration-based hard and pseudo-non-

causal attention mechanisms and removing positional

embedding to achieve high-quality and stable speech

synthesis. Studies have demonstrated that RobuTrans

achieved natural and robust superiority over other

ICDSE 2025 - The International Conference on Data Science and Engineering

592

frameworks. Besides, no additional teacher frames

are needed to achieve good compositing results.

2.2 FastSpeech and its improved

versions

A typical example of a non-autoregressive is

FastSpeech (Ren et al., 2019). It is a Transformer-

based end-to-end TTS system. It has been proposed

to tackle problems such as slow reasoning and low

controllability. The process of its generation is first

phoneme sequences are input. Then they are

converted to embedding format. After that, they are

processed through multiple encoder-like FFT Blocks.

Subsequently, it is passed to a Length Regulator that

predicts the length of the Mel-spectrogram and

controls the rhythms to improve frame controllability.

Finally, the final Mel-spectrogram is output again via

multiple FFT blocks. In addition, FastSpeech

introduces knowledge distillation to improve the

quality of audio. Knowledge distillation simplifies

the distribution of output data and enhances the

framework's one-to-many language processing

problem. In general, FastSpeech 2 significantly

improves speech generation speed when compared

with Transformer. However, it shows a disadvantage

in the quality of generation and has drawbacks such

as complex and time-consuming knowledge

distillation.

FastSpeech 2 is one of the revised versions proposed

by the same team(Ren et al., 2020). It is an advanced

non-autoregressive text-to-speech (TTS) model

designed to address the limitations of its predecessor,

FastSpeech, while enhancing synthesis quality and

efficiency. By eliminating the complex teacher-

student distillation pipeline, FastSpeech 2 directly

trains on ground-truth mel-spectrograms, avoiding

information loss and simplifying the training process.

To alleviate the one-to-many mapping challenge in

TTS, the model introduces variance information such

as pitch, energy, and precise phoneme duration

extracted through Montreal Forced Aligner (MFA).

Notably, pitch prediction is improved via continuous

wavelet transform, which models pitch variations in

the frequency domain for higher accuracy.

Additionally, FastSpeech 2s extends the framework

by enabling fully end-to-end text-to-waveform

synthesis, bypassing intermediate mel-spectrogram

generation and achieving faster inference speeds.

Evaluations on the LJSpeech dataset demonstrate that

FastSpeech 2 surpasses autoregressive models like

Tacotron 2 in voice quality (MOS: 3.83 vs. 3.70) and

reduces training time by threefold compared to

FastSpeech. The model ’ s ability to integrate

variance control (e.g., adjustable pitch and energy)

enhances prosody customization while maintaining

naturalness. By combining simplified training,

enhanced variance modeling, and end-to-end

capabilities, FastSpeech 2 advances the practicality of

high-quality, real-time speech synthesis with robust

controllability.

FastSpeech provides an excellent basic framework

for others. FastSpeech-based FastPitch is a parallel

text-to-speech model built upon FastSpeech,

designed to enhance speech expressiveness and

quality by explicitly predicting fundamental

frequency (F0) contours (Łańcucki, 2021).. The

architecture utilizes two feed-forward Transformer

stacks: one processes input tokens, while the other

generates mel-spectrogram frames. By conditioning

on F0 values predicted at the granularity of input

symbols, the model resolves pronunciation

ambiguities and improves speech naturalness. During

inference, users can intuitively adjust predicted pitch

values to modify prosody, enabling natural voice

modulation while preserving speaker identity.

FastPitch achieves exceptional synthesis speed, with

a real-time factor exceeding 900 times for mel-

spectrogram generation on GPUs, outperforming

autoregressive models like Tacotron 2. Evaluations

on the LJSpeech-1.1 dataset demonstrate superior

Mean Opinion Scores (4.080) compared to Tacotron

2 (3.946) and multi-speaker models such as Flowtron.

Unlike FastSpeech 2, which predicts frame-level F0,

FastPitch operates at the symbol level, simplifying

interactive pitch editing without compromising

quality. The model also supports multi-speaker

synthesis through speaker embeddings, delivering

state-of-the-art performance. By combining high-

speed parallel synthesis with flexible prosodic

control, FastPitch advances applications in expressive

and real-time speech generation, offering practical

advantages in both quality and usability.

ProbSparseFS and LinearizedFS are also two newly

proposed FastSpeech-based TTS frameworks that

significantly improve inference speed and memory

efficiency while maintaining speech quality through

efficient self-attention mechanisms and compact

feed-forward networks(Xiao et al., 2022).

The last framework of non-autoregressive to be

introduced is LightTTS, which is a lightweight, multi-

speaker, multi-language text-to-speech (TTS)

system(Li et al., 2021). It achieved fast speech

synthesis of different languages or codes by deep

Artiﬁcial Intelligence Speech Synthesis Based on the Deep Learning

593

learning. Notably, Compared to traditional attention-

based autoregressive frameworks, LightTTS employs

non-autoregressive generation, which considerably

improves the synthesis speed and drastically reduces

the framework parameters. Experiments showed that

LightTTS can generate Mel-spectrogram 2.5 times

faster than FastSpeech. However, the number of

participants decreased by a factor of 12.83. It is

comparable to real speech in terms of naturalness and

similarity, demonstrating its promise for a wide range

of applications in multilingual environments.

3 LIMITATIONS AND

DIRECTIONS FOR

DEVELOPMENT

Research in the field of Speech Synthesis has made

remarkable progress in recent years, leading to

significant advancements in the naturalness and

expressiveness of generated speech. These

developments have far-reaching implications across

various industries, including entertainment,

education, healthcare, and human-computer

interaction. However, despite these achievements,

several challenges remain unresolved, indicating that

the field still has a long way to go before achieving

truly human-like and versatile speech synthesis

systems.

One of the most pressing issues is the latent problem

of privacy leakage, which arises from the misuse of

large datasets required to train sophisticated

frameworks. The collection and utilization of voice

data often occur without explicit consent or proper

anonymization, raising ethical and legal concerns.

Additionally, the quality of voice data in these

datasets is highly inconsistent, leading to uneven

training effects for specific frameworks. While

screening datasets to select appropriate resources

could mitigate this issue, the process is often costly

and time-consuming, posing a significant barrier to

efficient model development.

Another critical limitation is the current framework's

inability to achieve effective one-to-many mapping.

Existing systems still rely heavily on additional

inputs, such as intonation, accent, rhythm, and

emotional cues, to generate high-quality speech. This

dependency not only increases the complexity of the

frameworks but also demands substantial human

effort in data annotation and preprocessing.

Furthermore, the application of explicit modeling

techniques, while effective in certain scenarios,

significantly raises the cost and time required for both

training and deploying these frameworks. On the

other hand, implicit modeling, though more efficient,

often falls short in terms of performance and

versatility.

The generation speed of current frameworks, despite

noticeable improvements, remains a bottleneck for

real-time voice interactions. While recent

advancements have accelerated the synthesis process,

the computational demands of high-quality speech

generation still hinder seamless real-time

applications, particularly in scenarios requiring low

latency and high responsiveness.

Looking ahead, this paper proposes several directions

for future research to address these challenges. First,

the disclosure and misuse of personal privacy can be

mitigated through stricter legislation and regulations

governing the collection and use of datasets.

Establishing clear guidelines and ethical standards for

data usage will be crucial in building trust and

ensuring compliance. Second, conducting rigorous

pre-screening of datasets to ensure quality and

consistency will enhance their utility and improve

training outcomes. Third, optimizing algorithms and

exploring innovative ways to combine different

frameworks or integrate frameworks with

convolutional techniques could pave the way for

simultaneous improvements in generation speed and

speech quality. Finally, leveraging advanced

networks such as Variational Autoencoders (VAEs)

and Generative Adversarial Networks (GANs) could

enhance the one-to-many mapping capabilities of

speech synthesis systems, enabling more flexible and

expressive voice generation.

In conclusion, while significant strides have been

made in speech synthesis, addressing the remaining

challenges will require a multidisciplinary approach

that combines technological innovation, ethical

considerations, and regulatory frameworks. By

focusing on these areas, the field can move closer to

achieving truly human-like, efficient, and privacy-

conscious speech synthesis systems.

4 CONCLUSIONS

This paper discussed deep learning-based speech

synthesis techniques focusing on autoregressive and

non-autoregressive frameworks. Autoregressive

frameworks like Transformer TTS are capable of

ICDSE 2025 - The International Conference on Data Science and Engineering

594

generating more natural and expressive speech but

slowly and costly. Non-autoregressive frameworks

such as FastSpeech increase generation speed, it is

lack accuracy and fluency. This paper also introduces

various improved versions that have been explored

and refined to increase synthesis speed, reduce frame

parameters, improve speech quality, and adapt to

multilingual environments. Despite significant

progress, it still faces problems such as dataset abuse,

high framework complexity, one-to-many mapping,

high framework training cost, and slow generation

speed. Future research can explore different

frameworks or the combination of frameworks and

convolution to improve the generation speed and

quality while referring to encoders, global style

labeling, VAE, GAN, and other techniques to

improve the one-to-many mapping capability of

frameworks. Ultimately, autoregressive and non-

autoregressive research continues to promote speech

synthesis technology in the direction of more efficient

and natural, providing a technical basis for intelligent

interaction and multi-scene applications.

REFERENCES

Tan, X., Qin, T., Soong, F., & Liu, T. Y. (2021). A survey

on neural speech synthesis. arXiv preprint

arXiv:2106.15561.

Tang, H., Zhang, X., Wang, J., Cheng, N., & Xiao, J.

(2023). A Survey of Expressive Speech Synthesis, Big

Data Research, 9.6: 53-71.

Ashish, V., Noam, S., Niki, P., Jakob, U., Llion, J., Aidan

N., G., Lukasz, K., & Illia, P. (2017) Attention is All

You Need, Advances in neural information processing

systems, 30: 5998-6008.

Zhao Wei. (2023). Research on deep neural network

acoustic modeling for efficient speech synthesis (PhD

dissertation, Zhejiang

University).https://link.cnki.net/doi/10.27461/d.cnki.g

zjdx.2023.000815doi:10.27461/d.cnki.gzjdx.2023.000

815.

Liu, Z., Wang, S., Inoue, S., Bai, Q., & Li, H. (2024).

Autoregressive Diffusion Transformer for Text-to-

Speech Synthesis. arXiv preprint arXiv:2406.05551.

Li, N., Liu, Y., Wu, Y., Liu, S., Zhao, S., & Liu, M. (2020,

April). Robutrans: A robust transformer-based text-to-

speech model. In Proceedings of the AAAI conference

on artificial intelligence (Vol. 34, No. 05, pp. 8228-

8235).

Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., &

Liu, T. Y. (2019). Fastspeech: Fast, robust and

controllable text to speech. Advances in neural

information processing systems, 32.

Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu,

T. Y. (2020). Fastspeech 2: Fast and high-quality end-

to-end text to speech. arXiv preprint arXiv:2006.04558.

Łańcucki, A. (2021, June). Fastpitch: Parallel text-to-

speech with pitch prediction. In ICASSP 2021-2021

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP) (pp. 6588-6592).

IEEE.

Xiao, Y., Wang, X., He, L., & Soong, F. K. (2022, May).

Improving fastspeech tts with efficient self-attention

and compact feed-forward network. In ICASSP 2022-

2022 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP) (pp. 7472-

7476). IEEE.

Li, S., Ouyang, B., Li, L., & Hong, Q. (2021, June). Light-

tts: Lightweight multi-speaker multi-lingual text-to-

speech. In ICASSP 2021-2021 IEEE International

Conference on Acoustics, Speech and Signal

Processing (ICASSP) (pp. 8383-8387). IEEE.

Artiﬁcial Intelligence Speech Synthesis Based on the Deep Learning

595