Temporal Convolutional Networks for Speech Emotion Recognition:
A Benchmark Study against Deep Learning Models
Nitasha Rathore, Pratibha Barua, Arhaan Sood, Bandaru Yogesh Kumar and Ashutosh Singh
Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, Delhi, India
Keywords: Speech Emotion Recognition (SER), Long Short‑Term Memory (LSTM), Recurrent Convolutional Neural
Networks (RCNN), Artificial Neural Network (ANN), Recurrent Neural Networks (RNN), Temporal
Convolutional Networks (TCN).
Abstract: As technology becomes more and more human-centric, the ability to recognize and interpret emotions from
speech is becoming more than just an innovation it is a necessity. Speech Emotion Recognition (SER) is a
field that sits at the nexus of artificial intelligence and human communication. It offers perspectives on both
our spoken words and our emotions. In addition to enhancing digital assistants, SER is revolutionizing mental
health monitoring and how people interact with robots. This research investigates the intricate world of
emotion-laden speech to learn how cutting- edge deep learning models, including Temporal Convolutional
Networks (TCN), Artificial Neural Networks (ANN), Recurrent Convolutional Neural Networks (RCNN),
Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM) networks, decode the smallest
emotional cues. While some models employ spatial patterns in speech patterns, others are very adept at
recognizing temporal links. Each model has its own merits. We analyze their performance in emotionally
enriched environments and show a potential proof-of-concept how can they impact at the human-computer
level. This study imagines a future of SER that is relevant to sympathetic AI, which is essential for the
connections that exist between AI and humans, beyond the current measures of numbers and accuracy scores.
So, the question is: How close are we really, as we strain those outer limits, to teaching machines the
vocabulary of feelings?.
1 INTRODUCTION
Imagine living in a world where machines not only
hear you but understand how you feel. The rapidly
growing discipline of Speech Emotion Recognition
(SER) which nests in the fields of artificial intelligence,
psychology and human- computer interaction holds
this promise. While SER will provide a new
understanding of human emotions by analyzing
speech elements (pitch, tone, intensity, etc.) to make
digital conversations more intuitive, empathetic and
smarter.
Along with the rise of accessible and lighter
frameworks such as TensorFlow and Porch, SER has
made an evolutionary leap from early, simplistic rule-
based systems to complex deep learning architectures
proficient at emotion detection with greater precision
than ever. Early methods on the other hand utilized
machine learning methods, such as support vector
machine (SVM), decision trees, and Gaussian Mixture
Models (GMMs) and features that were generated
manually. But these models typically had difficulty
resolving ambiguity, they were not contextually aware,
they could also miss the fine emotional changes that
occur in speech.
Figure 1shows flow of SER Training.
In fact, the true leap in deep learning architectures
were LSTM (Long Short-Term Memory) networks,
which enable a more sophisticated understanding of
feelings and the capacity to remember longer term
dependencies to help understand speech. But what
about million odal fusion with SER Technics That
expands the limits. When combining speech with
elements of non-verbal communication body
language, facial expressions, etc. we get closer to
communicating the full range of human emotion.
To examine the effectiveness of several deep
learning models, we compare LSTMs, Temporal
Convolutional Networks (TCNs), and Recurrent
Convolutional Neural Networks (RCNNs) for SER.
We also examine the role of transformer models and
Rathore, N., Barua, P., Sood, A., Kumar, B. Y. and Singh, A.
Temporal Convolutional Networks for Speech Emotion Recognition: A Benchmark Study against Deep Learning Models.
DOI: 10.5220/0013889500004919
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 2, pages
753-759
ISBN: 978-989-758-777-1
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
753
attention mechanisms, which could revolutionize the
profession by enhancing contextual awareness.
With the emergence of emotionally intelligent AI,
SER is not just a research problem but the future of
human-machine interaction. This essay examines the
developments that have influenced SER and paved
the way for a time when technology will be able to
understand our feelings in addition to our words.
Figure 1: Flow of SER Training.
2 RELATED WORKS
The Research paper used the Gaussian Mixture
Model to establish a general understanding of SER
measurement achieving an accuracy rate of 89.12%.
The method yielded positive findings however
researchers recognized its dependence on
recommended features since it failed to adapt to
complex voice patterns.
The study authors mentioned in their work
(Aouani, Hadhami & Benayed, Yassine 2020). that deep
learning combined with SVM using all kernels
produced 83.3% accuracy. Due to kernel-based
processing requirements, the experimental set-up
showed effective performance although it demanded
strong computational power for functionality.
A noise-cleaning method led to an achievement of
71.75% accuracy when CNN was integrated with
CNN+RNN for emotion analysis of audio and video
content as described in the study (
M. Singh and Y.
Fang., 2020)
. Despite its non-nuisance to various audio
types and reduction in performance with noisy
datasets, the trial method achieved successful
outcomes.
A research analysis (
Mustaqeem, M. Sajjad and S.
Kwon, 2020)
proved that clustering-based SER
achieved 95% accuracy through the integration of
bidirectional connections with Belts. The model
attained its final objective until it faced two
significant challenges regarding processing extended
datasets and running in real-time.
An auto encoder system at
Patel, N., Patel, S. &
Mankad obtained 90% accuracy by combining SVM
and Decision Tree with CNN for reducing data
dimensions. Limited dataset containment led to
unsatisfactory performance because the system did
not provide sufficient functionality when applied in
real- time.
The authors behind (
Lieskovská, E. et al., 2021)
showed that attention mechanisms in deep learning
models achieved an 85.2% success rate in SER tasks.
The research evidence shows that attention
approaches improve test outcomes but they demand
additional processing power and their performance
output changes based on the specifics of evaluated
datasets.
The research in (
Bagus Tris Atmaja et al., 2022) used
SVM together with MLP and LSTM and handcrafted
features to obtain a 78.8% accurate assessment of
bimodal SER. This method faced difficulties in
analyzing dynamic audio data because it used
handcrafted features which restricted its robustness
features.
The researchers in (
Aggarwal et al., 2022) created a
system that merged Reset VGG16 pre-trained
architecture with DNN models to function in two
directions achieving 96.26% accuracy for SER. High
precision outcomes came from using pre-trained
models yet achieving adaptations in this method
required large extensive datasets because of its
complexity.
The combination of CNN+LSTM with a
stochastic fractal search optimization algorithm
generated 97.38% accuracy levels according to data
in the Study (
A.A. Abdelhamid et al., 2022). The
advanced optimization method faced
implementation challenges because its deployment
proved to be too complex.
The EEG-based SER evaluation carried out in
(
Houssein, E.H et al., 2022) demonstrated various ML
approaches to assess brain activities through EEG
testing resulting in 87.25% accuracy from
examinations of SVM, ANN, Random Forest and
Decision Tree and KNN, RNN, and CNN. This
method became impractical for large-scale
applications because researchers acquired EEG data.
A joint approach uniting NLP and DLSTA
conducted deep semantic analysis of Service
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
754
Excellence big data to achieve 93.3% accuracy
efficiency as documented in Study
(
Guo, Jia., 2022). The model consisted of reliable
functionality involving extensive data evaluation
although its complex preprocessing requirements
created obstacles for real-time system deployment
because of extensive feature specifications.
Through the combination of CNN and multiple-
head convolutional transformers researchers achieved
82.31% accuracy according to information found in
the document
9
(Ullah et al., 2023) which utilized IEMOCAP and
RAVDESS databases. The transformer architecture
reached adequate accuracy benchmarks but this
achievement came at the expense of system
performance speed and resource-intensive memory
consumption.
This study by (
Samaneh Madanian et al., 2023)
performed an organized review that showed how ML
technology brought together SVM and Random
Forest algorithms and noise reduction methods
applied to MFCC extraction in reaching 91%
precision. This approach resulted in improved speech
data tolerance but caused a decreased response
quality in noise-free situations because of overused
data augmentation methods.
3 RESEARCH METHODOLOGY
The research employed deep learning models for
systematic speech emotion recognition (SER),
encompassing data collection, preprocessing, feature
extraction, and classification. The dataset was first
organized by labeling speech recordings, followed by
visualization using waveform and spectrogram
representations. Mel-Frequency Cepstral Coefficients
(MFCC) captured key spectral features.
Preprocessing involved dimension expansion and
one-hot encoding for compatibility with deep
learning architectures. Various models, including
TCN, RNN, ANN, RCNN, and LSTM, were tested.
The dataset was split into training, validation, and test
sets, and model performance was assessed using
confusion matrices, validation accuracy, and loss
metrics. The Toronto Emotional Speech Set (TESS),
featuring 2,800 recordings from two female actors
expressing seven emotions (anger, contempt, fear,
happiness, surprise, sadness, and neutrality), was
used. TESS is widely utilized in affective computing
and machine learning to enhance emotion-aware
applications.
Figure 2 shows research methodology.
Figure 2: Research Methodology.
3.1 Data Preprocessing
The voice dataset was methodically ready for model
training and analysis. For precise emotion
identification, audio files were loaded and sorted by
filename. To ensure consistent duration and offset
across files, Librosa was utilized to clean signals,
extract features, and minimize noise. While MFCC
collected important spectral features for Speech
Emotion Recognition (SER) and stored qualities in
numerical form, waveform and spectrogram
representations examined changes in pitch, tone, and
frequency.
To ensure alignment for training, processed
features were prepared for model input and
categorical labels were one-hot encoded to fit deep
learning models. This pipeline improved model
performance in emotion classification, decreased
variability, and optimized the dataset.
3.2 Exploratory Data Analysis
EDA (exploratory data analysis) helps to explore and
understand the distribution of emotions in the dataset
which is the most influencing factor for such types of
analysis. The speech samples and their emotional
properties were analyzed using a variety of statistical
and graphical techniques. Wave plots displaying
speech signals in the time domain allowed us to study
amplitude and intensity variability between moods.
Such variations provided insights into the way
emotional expressions influence dynamics of speech.
Moreover, MFCC visualizations were also used in
this analysis to capture the speech spectrum,
considering how frequency components are affected
in different emotional states. Figure 3 shows Bar Plot
Temporal Convolutional Networks for Speech Emotion Recognition: A Benchmark Study against Deep Learning Models
755
for Audio Files in Dataset. Through tone, pitch and
speech modulation analytics, distinguishing
emotional pathways were identified and each
differentiated emotion was tracked.
Figure 4 shows
Mel-Spectrograms for each emotion.
EDA played a significant role in optimizing
feature selection, as well as guaranteeing that the
deep learning models could accurately learn the data
and classify emotions based on speech. It recognized
emotions based on the fact that the underlying
patterns were underlying trends.
Figure 5 shows Wave
Plots are obtained for each emotion.
Figure 3: Bar Plot for Audio Files in Dataset.
Figure 4: Mel-Spectrograms for Each Emotion.
Figure 5: Wave Plots Are Obtained for Each Emotion.
Model Selection and Implementation:
Models with deep learning capabilities are selected
to process the temporal and sequential data for speech
emotions recognition (SER). Other model was chosen
based on its potential which is able to extract and
classify speech signal features. Model performance
on MFCC- extracted features was evaluated by
overall confusion matrices, validation accuracy, and
validation loss metrics in order to ensure robust
emotion classification.
Long Short-Term Memory (LSTM): Long-
term dependencies in sequential data allow LSTM
networks to perform exceptionally well in speech-
based emotion identification. Their gated architecture
improves speech context retention by resolving the
vanishing gradient issue that conventional RNNs
face. To greatly increase recognition accuracy,
MFCC characteristics were processed via several
LSTM layers before going through a dense output
layer for emotion categorization.
Recurrent Neural Networks (RNN):
Since
RNNs can analyses sequential input, they were used
as a baseline model for time-sensitive speech
analysis. Recurrent and fully connected layers were
used to classify the MFCC features. However, they
were less effective than LSTMs due to vanishing
gradient restrictions, which hindered their capacity to
learn long-term dependencies.
Artificial Neural Network (ANN):
Using a fully
linked feedforward architecture, ANNs served as a
benchmark model to evaluate the efficacy of MFCC-
based feature extraction. Compared to sequential
models such as LSTMs and RNNs, they fared quite
well, but their accuracy was restricted by their
incapacity to grasp temporal correlations in speech.
Recurrent Convolutional Neural Network
(RCNN):
Recurrent and convolutional architectures
were integrated in RCNNs to enhance feature
learning. Recurrent layers found temporal patterns in
speech, while convolutional layers collected spatial
characteristics from MFCC representations. By
utilizing both sequential and spatial learning
capabilities, our hybrid model improved the accuracy
of emotion classification.
Temporal Neural Network (TCN):
TCNs
successfully modelled long-range dependencies
without the processing burden of LSTMs by using
dilated causal convolutions rather than recurrence. By
modelling the MFCC information with convolutional
layer, TCNs managed to learn temporal linkages
efficiently, whilst achieving comparable accuracy.
The comparative study of deep learning methods
for SER was help by each model. Though hybrids
such like RCNNs achieved strong performance, or
effective architectures like TCNs were promising,
LSTMs performed superior since they could hold
long dependence. The findings provide critical
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
756
insights toward the selection of suitable deep learning
architectures for speech emotion recognition use
cases
4 RESULTS
Evaluation of deep learning models for Speech
Emotion Recognition (SER) yielded varying degrees
of accuracy. The performance of LSTM was
remarkably accurate, with an index of 96.42% in
distinction long-tem reliance. It is important to notice
that with the 98.92% and 98.65% achievable
accuracies, TCN and RCNN subjected methods were
the best among all others, emphasizing the strength of
temporal convolutional and hybrid convolutional
approaches (Hybrid wins both, but at the cost of time
and resources). Second, with a 98.57% success rate,
ANN has been demonstrated efficiency when trained
on carefully extracted MFCC features. RNN
performed the worst, at 87.14%, presumably due to
the vanishing gradient problem for learning over
longer periods of input. Table 1 illustrate Results of
deep learning architectures.
Overall, the top-performing models for SER were
LSTM, RCNN, and TCN that yielded higher
accuracy and alternatives to traditional recurrent
architectures.
Figure 6 shows Confusion Matrices
obtained for each model.
Table 1: Results of Deep Learning Architectures.
Deep Learning Model
Accuracy
Rate (%)
LSTM (Long Short-Term
Memory)
96.42
RNN (Recurrent Neural
Networks
)
87.14
RCNN (Recurrent Convolutional
Neural Networks)
98.65
ANN (Artificial Neural
Networks)
98.57
TCN (Temporal Convolutional
Networks)
98.92
Figure 6: Confusion Matrices Obtained for Each Model.
Figure 7: Training/Validation Loss V/S Epochs Plots for
Each Model.
Figure 8: Testing/Validation Accuracy V/S Epochs Plots
for Each Model.
Temporal Convolutional Networks for Speech Emotion Recognition: A Benchmark Study against Deep Learning Models
757
Figure 9: TCN Lime Chart.
Figure 10: RCNN Lime Chart.
Figure 11: ANN Lime Chart.
Figure 12: RNN Lime Chart.
Figure 13: LSTM Lime Chart.
REFERENCES
“Speech emotion recognition.” International Journal of Soft
Computing and Engineering (IJSCE), vol. 2, no. 1, Mar.
2012, pp. 235–36.
A.A.Abdelhamid et al., "Robust Speech Emotion
Recognition Using CNN+LSTM Based on Stochastic
Fractal Search Optimization Algorithm," in IEEE
Access, vol. 10, pp. 49265-49284, 2022, doi:
10.1109/ACCESS.2022.3172954.
Aggarwal, A., Srivastava, A., Agarwal, A., Chahal, N.,
Singh, D., Alnuaim, A. A., Alhadlaq, A., & Lee, H.-N.
(2022). Two-Way Feature Extraction for Speech
Emotion Recognition Using Deep Learning. Sensors,
22(6), 2378.
Aouani, Hadhami & Benayed, Yassine. (2020). Speech
Emotion Recognition with deep learning. Procedia
Computer Science.176. 251-260. 10.1016/j.procs.
2020.08.027.
Bagus Tris Atmaja, Akira Sasou, Masato Akagi, Survey on
bimodal speech emotion recognition from acoustic and
linguistic information fusion, Speech Communication,
Volume 140,2022, Pages 11-28, ISSN 0167-6393,
Guo, Jia. "Deep learning approach to text analysis for
human emotion detection from big data" Journal of
Intelligent Systems, vol. 31, no. 1, 2022, pp. 113-126.
Houssein, E.H., Hammad, A. & Ali, A.A. Human emotion
recognition from EEG-based brain–computer interface
using machine learning: a comprehensive review.
Neural Comput & Applic 34, 12527–12557 (2022).
Kogila, R., Sadanandam, M. & Bhukya, H. Deep Learning
Algorithms for Speech Emotion Recognition with
Hybrid Spectral Features. SN COMPUT. SCI. 5, 17
(2024).
Lieskovská, E., Jakubec, M., Jarina, R., & Chmulík, M.
(2021). A Review on Speech Emotion Recognition
Using Deep Learning and Attention Mechanism.
Electronics, 10(10), 1163.
M. Singh and Y. Fang, "Emotion Recognition in Audio and
Video Using Deep Neural Networks," arXiv preprint
arXiv:2006.08129, June 2020.
Mustaqeem, M. Sajjad and S. Kwon, "Clustering-Based
Speech Emotion Recognition by Incorporating Learned
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
758
Features and Deep BiLSTM," in IEEE Access,
vol. 8, pp. 79861-79875, 2020, doi: 10.1109/ACCES
S.2020.2990405.
Patel, N., Patel, S. & Mankad, S.H. Impact of autoencoder
based compact representation on emotion detection
from audio. J Ambient Intell Human Comput 13, 867–
885
Samaneh Madanian, Talen Chen, Olayinka Adeleye, John
Michael Templeton, Christian Poellabauer, Dave Parry,
Sandra L. Schneider, Speech emotion recognition using
machine learning A systematic review, Intelligent
Systems with Applications,Volume 20,2023,200266,
ISSN 2667-3053
Ullah, R.; Asif, M.; Shah, W.A.; Anjam, F.; Ullah, I.;
Khurshaid, T.; Wuttisittikulkij, L.; Shah, S.; Ali, S.M.;
Alibakhshikenari, M. Speech Emotion Recognition
Using Convolution Neural Networks and Multi-Head
Convolutional Transformer. Sensors 2023, 23, 6212.
Temporal Convolutional Networks for Speech Emotion Recognition: A Benchmark Study against Deep Learning Models
759