Excellence big data to achieve 93.3% accuracy
efficiency as documented in Study
(
Guo, Jia., 2022). The model consisted of reliable
functionality involving extensive data evaluation
although its complex preprocessing requirements
created obstacles for real-time system deployment
because of extensive feature specifications.
Through the combination of CNN and multiple-
head convolutional transformers researchers achieved
82.31% accuracy according to information found in
the document
9
(Ullah et al., 2023) which utilized IEMOCAP and
RAVDESS databases. The transformer architecture
reached adequate accuracy benchmarks but this
achievement came at the expense of system
performance speed and resource-intensive memory
consumption.
This study by (
Samaneh Madanian et al., 2023)
performed an organized review that showed how ML
technology brought together SVM and Random
Forest algorithms and noise reduction methods
applied to MFCC extraction in reaching 91%
precision. This approach resulted in improved speech
data tolerance but caused a decreased response
quality in noise-free situations because of overused
data augmentation methods.
3 RESEARCH METHODOLOGY
The research employed deep learning models for
systematic speech emotion recognition (SER),
encompassing data collection, preprocessing, feature
extraction, and classification. The dataset was first
organized by labeling speech recordings, followed by
visualization using waveform and spectrogram
representations. Mel-Frequency Cepstral Coefficients
(MFCC) captured key spectral features.
Preprocessing involved dimension expansion and
one-hot encoding for compatibility with deep
learning architectures. Various models, including
TCN, RNN, ANN, RCNN, and LSTM, were tested.
The dataset was split into training, validation, and test
sets, and model performance was assessed using
confusion matrices, validation accuracy, and loss
metrics. The Toronto Emotional Speech Set (TESS),
featuring 2,800 recordings from two female actors
expressing seven emotions (anger, contempt, fear,
happiness, surprise, sadness, and neutrality), was
used. TESS is widely utilized in affective computing
and machine learning to enhance emotion-aware
applications.
Figure 2 shows research methodology.
Figure 2: Research Methodology.
3.1 Data Preprocessing
The voice dataset was methodically ready for model
training and analysis. For precise emotion
identification, audio files were loaded and sorted by
filename. To ensure consistent duration and offset
across files, Librosa was utilized to clean signals,
extract features, and minimize noise. While MFCC
collected important spectral features for Speech
Emotion Recognition (SER) and stored qualities in
numerical form, waveform and spectrogram
representations examined changes in pitch, tone, and
frequency.
To ensure alignment for training, processed
features were prepared for model input and
categorical labels were one-hot encoded to fit deep
learning models. This pipeline improved model
performance in emotion classification, decreased
variability, and optimized the dataset.
3.2 Exploratory Data Analysis
EDA (exploratory data analysis) helps to explore and
understand the distribution of emotions in the dataset
which is the most influencing factor for such types of
analysis. The speech samples and their emotional
properties were analyzed using a variety of statistical
and graphical techniques. Wave plots displaying
speech signals in the time domain allowed us to study
amplitude and intensity variability between moods.
Such variations provided insights into the way
emotional expressions influence dynamics of speech.
Moreover, MFCC visualizations were also used in
this analysis to capture the speech spectrum,
considering how frequency components are affected
in different emotional states. Figure 3 shows Bar Plot