
2.4 Hidden Markov Model and
Parametric Synthesis
Although waveform concatenation and traditional
synthesis methods partially address prosodic
discontinuities, further optimization is needed. In
practice, traditional concatenation methods often
suffer from unstable synthesis quality and low
robustness due to incorrect unit selection. To solve
this issue, a superior speech unit selection algorithm
is required to accurately predict the acoustic
parameters corresponding to a given text under
different conditions.
The Hidden Markov Model (HMM) is a double-
stochastic process where the specific state sequence
is unobservable, but its transition probabilities are
known. That is, the state transitions are hidden, while
the observable events are random functions of these
hidden transitions (Jing, 2012).
From a statistical perspective, human speech
production is also a double-stochastic process. The
brain, based on expressive needs, organizes language
according to grammatical rules and generates a series
of unobservable commands to control speech organs,
ultimately producing observable acoustic parameters.
This speech generation process is similar to the
description of HMM, making HMM a cornerstone
method in statistical parametric speech synthesis(Wu,
2006).
The HMM-based model primarily involves two
phases: training and synthesis. The training phase
consists of five steps: model initialization, HMM
training, context-dependent HMM training, decision
tree-based training, and duration modeling. Once the
model is trained, it can generate state sequence
feature vectors from input text. These vectors are then
processed by a filter to convert them into speech
signals (Wu, 2006). The HMM-based modeling
approach offers greater flexibility, does not require an
extensive speech corpus, and significantly reduces the
time needed for model construction compared to
traditional methods. As a result, it is more suitable for
lightweight and embedded platforms.
3 SPEECH SYNTHESIS
TECHNOLOGY BASED ON
DEEP LEARNING
3.1 Speech synthesis technology based on
deep learning
Early speech synthesis technology represented by
HMM-based speech synthesis will inevitably destroy
the fine structure of natural speech spectrum while
using statistical parameters. Moreover, due to the
limitation of computing power, it can only consider
the influence of one or two adjacent phonemes,
resulting in the discard of potential meaningful
information in the previous text, causing information
loss(Pan, 2021).
In end-to-end speech generation, the system
consists of two parts: acoustic model and a vocoder.
The acoustic model realizes the temporal alignment
of text and speech, and the vocoder restores the output
of the acoustic model into a speech waveform(Zhang,
2021). The essence of speech generation is to
simulate sound through a series of acoustic
parameters. Acoustic parameters are a kind of
complex data and are not easy to model manually.
Deep learning can learn more useful features by
building a machine learning model with many hidden
layers and big data training, which just solves the
problem that acoustic parameters are not easy to
model and select features manually. DNN is a
common model for modeling acoustic parameters. In
training, the minimum mean square error criterion is
usually used to train the DNN model, and the model
parameters are continuously adjusted to minimize the
error between the predicted acoustic parameters and
the target acoustic parameters. In synthesis, after
extracting text features, the trained DNN model is
used to predict the acoustic parameters and the
duration information provided by other systems (such
as the PSOLA algorithm), and then input into the
vocoder to obtain synthesized speech(Zhang, 2020).
3.2 Processing and limitations of voice
change
Under ideal conditions, the above-mentioned deep
learning-based speech synthesis algorithms can
generate fluent and natural speech more accurately.
However, in real life, acoustic parameters do not only
include prosodic parameters. The same person can
not only make one voice. The volume, pitch, voice
emotion, and switching between true and false voices
will affect the acoustic parameters. Moreover,
emotions will also affect people's control over various
parts of the body, including the vocal organs. These
special control rules under emotional conditions will
produce special emotional speech parameter changes.
In the synthesis system, when the range of variation
of the rhythmic acoustic parameters is large, the
synergy between the articulatory organs will have a
greater impact on the speech, and the spectrum and
filter can no longer be simply treated as completely
independent parameters (Wang, 2013). The sole use
Analysis of Speech Synthesis Technology: From Deep Learning to Airflow Modeling
583