traditional features such as prosodic features, wavelet
features, spectral features, and cepstral features from
the edges, segments, and utterances of speech signals,
alongside automatically extracted features by deep
learning models (Lieskovská, Jakubec, Jarina, &
Chmulík, 2021; Ramyasree & Kumar, 2023). Several
classic tools such as PRAAT, APARAT,
OpenSMILE, and OpenEAR can be used for mining
features from speech signals (Shukla & Jain, 2022).
Finally, a classifier is employed to categorize the
emotional features and build the SER model. Among
these stages, feature extraction is a critical process
that has a decisive impact on recognition
performance.
3.2 Feature Extraction
3.2.1 Traditional Feature Extraction
Traditional handcrafted features include prosodic
features, wavelet features, spectral features, and
cepstral features. Prosodic features mainly cover
aspects such as stress, pauses, and intonation,
reflecting the rhythm and pitch variations in speech.
Wavelet features are derived using wavelet transform
techniques, which can effectively capture the local
transformation characteristics of speech signals and
offer certain advantages in analyzing non-stationary
signals. Spectral features are obtained by
transforming time-domain signals into the frequency
domain through methods such as Fourier Transform,
providing insights into the energy distribution of
speech signals across different frequency bands.
Cepstral features are further processed parameters
based on spectral features, with the most
representative being Mel-Frequency Cepstral
Coefficients (MFCCs) and Perceptual Linear
Prediction (PLP) coefficients. These features can
reflect subtle adjustments in the spectrum during
emotional changes and exhibit a degree of similarity
to how humans perceive emotions.
3.2.2 Machine Learning-Based Feature
Extraction
The automatic learning capability of machine
learning enables it to autonomously extract emotional
features from speech signals, with different deep
learning models capturing distinct types of features.
Features extracted using Convolutional Neural
Networks (CNNs) primarily include local spectral
features and hierarchical features. Compared with
traditional spectral features, the spectral features
extracted by CNN are higher in dimensionality, more
abstract, and harder to interpret. However, this
extraction approach avoids the subjectivity of manual
feature engineering, saves time, and offers strong
adaptability to different types of speech data.
Furthermore, it can comprehensively and
meticulously describe variations in the spectrum. The
hierarchical features extracted by CNN are
progressively abstracted and summarized as the depth
of the network layers increases. Each subsequent
layer further processes the features from the previous
layer, resulting in a more comprehensive and accurate
representation of the intrinsic structure of speech
information, thereby improving the recognition
accuracy of the model.
Features extracted using Recurrent Neural
Networks (RNNs) and their variants possess temporal
dynamics and contextual dependency. Temporal
dynamic features are extracted by combining the
current feature with information from previous
moments through the recurrent structure of RNNs,
capturing the rising, falling, or steady trends of these
features. This also allows the model to detect
periodicity in speech signals and, through learning
from previous cycles, extract related features more
accurately. Context-dependent features are derived
from speech information such as contextual pauses
and speech rate, rather than isolated features at a
single time point. RNNs and their variants can
leverage these features to account for the coherence
of speech, leading to a more accurate understanding
of overall emotional states.
3.2.3 Advantages of Machine Learning in
SER
With the advancement of deep learning, end-to-end
deep SER has gained increasing attention, capable of
directly using raw emotional speech signals or
handcrafted features as input for deep learning
models (Luo, Ran, Yang, & Dou, 2022). The
integration of machine learning technology with SER
brings numerous advantages. Firstly, the powerful
self-learning ability of machine learning allows it to
automatically extract features from large amounts of
speech data, offering stronger adaptability and more
representative features compared to traditional
speech recognition methods (Lieskovská, Jakubec,
Jarina, & Chmulík, 2021). Secondly, when dealing
with imbalanced datasets, machine learning
algorithms — such as convolutional recurrent neural
networks (CRNNs) with variable-length inputs and
focal loss — can adjust the contribution of different
samples to the total loss, enabling the model to
perform well even on minority samples (Liang, Li, &