frameworks. Besides, no additional teacher frames
are needed to achieve good compositing results.
2.2 FastSpeech and its improved
versions
A typical example of a non-autoregressive is
FastSpeech (Ren et al., 2019). It is a Transformer-
based end-to-end TTS system. It has been proposed
to tackle problems such as slow reasoning and low
controllability. The process of its generation is first
phoneme sequences are input. Then they are
converted to embedding format. After that, they are
processed through multiple encoder-like FFT Blocks.
Subsequently, it is passed to a Length Regulator that
predicts the length of the Mel-spectrogram and
controls the rhythms to improve frame controllability.
Finally, the final Mel-spectrogram is output again via
multiple FFT blocks. In addition, FastSpeech
introduces knowledge distillation to improve the
quality of audio. Knowledge distillation simplifies
the distribution of output data and enhances the
framework's one-to-many language processing
problem. In general, FastSpeech 2 significantly
improves speech generation speed when compared
with Transformer. However, it shows a disadvantage
in the quality of generation and has drawbacks such
as complex and time-consuming knowledge
distillation.
FastSpeech 2 is one of the revised versions proposed
by the same team(Ren et al., 2020). It is an advanced
non-autoregressive text-to-speech (TTS) model
designed to address the limitations of its predecessor,
FastSpeech, while enhancing synthesis quality and
efficiency. By eliminating the complex teacher-
student distillation pipeline, FastSpeech 2 directly
trains on ground-truth mel-spectrograms, avoiding
information loss and simplifying the training process.
To alleviate the one-to-many mapping challenge in
TTS, the model introduces variance information such
as pitch, energy, and precise phoneme duration
extracted through Montreal Forced Aligner (MFA).
Notably, pitch prediction is improved via continuous
wavelet transform, which models pitch variations in
the frequency domain for higher accuracy.
Additionally, FastSpeech 2s extends the framework
by enabling fully end-to-end text-to-waveform
synthesis, bypassing intermediate mel-spectrogram
generation and achieving faster inference speeds.
Evaluations on the LJSpeech dataset demonstrate that
FastSpeech 2 surpasses autoregressive models like
Tacotron 2 in voice quality (MOS: 3.83 vs. 3.70) and
reduces training time by threefold compared to
FastSpeech. The model ’ s ability to integrate
variance control (e.g., adjustable pitch and energy)
enhances prosody customization while maintaining
naturalness. By combining simplified training,
enhanced variance modeling, and end-to-end
capabilities, FastSpeech 2 advances the practicality of
high-quality, real-time speech synthesis with robust
controllability.
FastSpeech provides an excellent basic framework
for others. FastSpeech-based FastPitch is a parallel
text-to-speech model built upon FastSpeech,
designed to enhance speech expressiveness and
quality by explicitly predicting fundamental
frequency (F0) contours (Łańcucki, 2021).. The
architecture utilizes two feed-forward Transformer
stacks: one processes input tokens, while the other
generates mel-spectrogram frames. By conditioning
on F0 values predicted at the granularity of input
symbols, the model resolves pronunciation
ambiguities and improves speech naturalness. During
inference, users can intuitively adjust predicted pitch
values to modify prosody, enabling natural voice
modulation while preserving speaker identity.
FastPitch achieves exceptional synthesis speed, with
a real-time factor exceeding 900 times for mel-
spectrogram generation on GPUs, outperforming
autoregressive models like Tacotron 2. Evaluations
on the LJSpeech-1.1 dataset demonstrate superior
Mean Opinion Scores (4.080) compared to Tacotron
2 (3.946) and multi-speaker models such as Flowtron.
Unlike FastSpeech 2, which predicts frame-level F0,
FastPitch operates at the symbol level, simplifying
interactive pitch editing without compromising
quality. The model also supports multi-speaker
synthesis through speaker embeddings, delivering
state-of-the-art performance. By combining high-
speed parallel synthesis with flexible prosodic
control, FastPitch advances applications in expressive
and real-time speech generation, offering practical
advantages in both quality and usability.
ProbSparseFS and LinearizedFS are also two newly
proposed FastSpeech-based TTS frameworks that
significantly improve inference speed and memory
efficiency while maintaining speech quality through
efficient self-attention mechanisms and compact
feed-forward networks(Xiao et al., 2022).
The last framework of non-autoregressive to be
introduced is LightTTS, which is a lightweight, multi-
speaker, multi-language text-to-speech (TTS)
system(Li et al., 2021). It achieved fast speech
synthesis of different languages or codes by deep