Speech Emotion Recognition Technology in Human-Computer

Interaction

Jingming Wang

Stony Brook Institute at Anhui University, Anhui University, Hefei, China

Keywords: Speech Emotion Recognition, Development History, Speech Feature Extraction.

Abstract: Speech Emotion Recognition (SER), as an important research direction in the field of human-computer

interaction, enables computers to perceive and understand the user's emotional state, thereby improving the

naturalness and intelligence of the interaction. This paper systematically reviews the development context and

key technologies of speech emotion recognition. First, the development history of this field is reviewed and

the main stages of its algorithm evolution are sorted out. Then, based on the overall process of speech emotion

recognition, this paper focuses on the core link, the feature extraction stage, and deeply explores its key role

in recognition performance, and systematically compares the differences between traditional methods and

machine learning methods. In addition, this paper also deeply analyzes the core challenges faced by current

research from the perspectives of features and models. Through a comprehensive review of existing research

results, this paper aims to provide theoretical references and technical support for building a more efficient

and robust speech emotion recognition system.

1 INTRODUCTION

With the rapid development of artificial intelligence

technologies, human-computer interaction has

gradually permeated various aspects of daily life.

Voice interaction, in particular, has been widely

applied in fields such as Siri, Xiao Ai, and smart home

systems. By introducing speech emotion recognition

technology into human-computer interaction

systems, the focus of these systems can extend

beyond merely understanding semantic information

to also analyzing voiceprint signals and perceiving

users' emotions. This makes interactions more

humanized while improving both system intelligence

and user experience.

Speech emotion recognition technology has a

wide range of applications. In the field of intelligent

customer service, it can replace manual quality

inspection methods, providing a more efficient and

cost-effective way to detect customer service staff’s

emotions and reduce conflicts with users (Zhang,

2023). In the power grid industry, SER can

effectively monitor the emotional states of

dispatchers, thereby significantly reducing human

errors and preventing safety incidents (Luo, 2023).

https://orcid.org/0009-0004-0654-2078

However, in real life, speech emotions are

characterized by diversity, hybridity, and uncertainty

(Luo, Ran, Yang, & Dou, 2022), making emotion

recognition quite challenging. Fortunately, with the

continuous development of machine learning, its

powerful data processing and feature learning

capabilities have brought new opportunities for

advancing speech emotion recognition (Lieskovská,

Jakubec, Jarina, & Chmulík, 2021).

This paper first reviews and summarizes the

development history of SER, outlining the main

stages of algorithm evolution. It then explains the

workflow of emotion recognition, evaluates

traditional methods and machine learning approaches

from a feature perspective, and identifies the core

challenges of each. Finally, feasible directions for

future research are proposed.

354

Wang, J.

Speech Emotion Recognition Technology in Human-Computer Interaction.

DOI: 10.5220/0014355300004718

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2025), pages 354-358

ISBN: 978-989-758-792-4

2 THE OVERALL

DEVELOPMENT OF SPEECH

EMOTION RECOGNITION

Speech emotion recognition technology has

developed rapidly over the past 40 years. Figure 1

illustrates the research progress in this field. In 1996,

Daellert et al. conducted pioneering research in SER

(Schuller, 2018). Early studies mainly relied on

handcrafted features and traditional classification

models. It was not until 2000, when Nicholson

applied neural networks to this field, that machine

learning models began to enter speech recognition

research (Milton & Tamil Selvi, 2014). In 2005,

Grimm et al. introduced a three-dimensional emotion

description model to spontaneous SER (Elbarougy &

Akagi, 2012). In 2006, Neiberg et al. applied

Gaussian Mixture Models (GMM) to spontaneous

SER (Neiberg, Elenius, & Laskowski, 2006).

In 2010, Eyben et al. developed OPEN-SMILE, a

toolkit for extracting speech emotion features (Eyben,

Wöllmer, & Schuller, 2010). By 2014, Mao et al.

introduced Convolutional Neural Networks (CNNs)

to learn emotionally salient features for SER, marking

the adoption of deep learning models in this field

(Mao, Dong, Huang, & Zhan, 2014). Subsequently,

deep learning models have continued to evolve. In

2016, Trigeorgis et al. proposed an end-to-end

approach that combined CNNs with Long Short-

Term Memory (LSTM) networks (Trigeorgis et al.,

2016). In 2018, Schuller summarized the

development, challenges, and future trends of CNNs

and LSTMs in SER (Schuller, 2018).

Since 2021, an increasing number of studies have

focused on incorporating Transformer models (Chen,

Xing, Xu, Pang, & Du, 2023). In recent years, SER

has been transitioning from traditional handcrafted

features and machine learning classifiers toward deep

learning and multimodal fusion approaches (Zhu,

Sun, Wei, & Zhao, 2023). This includes integrating

linguistic information with text, facial expressions,

body movements, and other modalities, thereby

making emotion recognition more accurate and

efficient.

Figure 1: Schematic diagram of speech emotion recognition research development (Picture credit: Original).

3 EMOTION RECOGNITION

METHODS

3.1 Overall Workflow

Speech Emotion Recognition (SER) technology

involves using computers to analyze various

emotional information in preprocessed speech

signals, extracting features that describe emotions,

associating these features with specific emotional

categories, and ultimately classifying the emotional

states (Luo, Ran, Yang, & Dou, 2022). In the

preprocessing stage, incomplete and unnecessary

noise signals are removed. Subsequently, feature

extraction is performed, typically extracting

Speech Emotion Recognition Technology in Human-Computer Interaction

355

traditional features such as prosodic features, wavelet

features, spectral features, and cepstral features from

the edges, segments, and utterances of speech signals,

alongside automatically extracted features by deep

learning models (Lieskovská, Jakubec, Jarina, &

Chmulík, 2021; Ramyasree & Kumar, 2023). Several

classic tools such as PRAAT, APARAT,

OpenSMILE, and OpenEAR can be used for mining

features from speech signals (Shukla & Jain, 2022).

Finally, a classifier is employed to categorize the

emotional features and build the SER model. Among

these stages, feature extraction is a critical process

that has a decisive impact on recognition

performance.

3.2 Feature Extraction

3.2.1 Traditional Feature Extraction

Traditional handcrafted features include prosodic

features, wavelet features, spectral features, and

cepstral features. Prosodic features mainly cover

aspects such as stress, pauses, and intonation,

reflecting the rhythm and pitch variations in speech.

Wavelet features are derived using wavelet transform

techniques, which can effectively capture the local

transformation characteristics of speech signals and

offer certain advantages in analyzing non-stationary

signals. Spectral features are obtained by

transforming time-domain signals into the frequency

domain through methods such as Fourier Transform,

providing insights into the energy distribution of

speech signals across different frequency bands.

Cepstral features are further processed parameters

based on spectral features, with the most

representative being Mel-Frequency Cepstral

Coefficients (MFCCs) and Perceptual Linear

Prediction (PLP) coefficients. These features can

reflect subtle adjustments in the spectrum during

emotional changes and exhibit a degree of similarity

to how humans perceive emotions.

3.2.2 Machine Learning-Based Feature

Extraction

The automatic learning capability of machine

learning enables it to autonomously extract emotional

features from speech signals, with different deep

learning models capturing distinct types of features.

Features extracted using Convolutional Neural

Networks (CNNs) primarily include local spectral

features and hierarchical features. Compared with

traditional spectral features, the spectral features

extracted by CNN are higher in dimensionality, more

abstract, and harder to interpret. However, this

extraction approach avoids the subjectivity of manual

feature engineering, saves time, and offers strong

adaptability to different types of speech data.

Furthermore, it can comprehensively and

meticulously describe variations in the spectrum. The

hierarchical features extracted by CNN are

progressively abstracted and summarized as the depth

of the network layers increases. Each subsequent

layer further processes the features from the previous

layer, resulting in a more comprehensive and accurate

representation of the intrinsic structure of speech

information, thereby improving the recognition

accuracy of the model.

Features extracted using Recurrent Neural

Networks (RNNs) and their variants possess temporal

dynamics and contextual dependency. Temporal

dynamic features are extracted by combining the

current feature with information from previous

moments through the recurrent structure of RNNs,

capturing the rising, falling, or steady trends of these

features. This also allows the model to detect

periodicity in speech signals and, through learning

from previous cycles, extract related features more

accurately. Context-dependent features are derived

from speech information such as contextual pauses

and speech rate, rather than isolated features at a

single time point. RNNs and their variants can

leverage these features to account for the coherence

of speech, leading to a more accurate understanding

of overall emotional states.

3.2.3 Advantages of Machine Learning in

SER

With the advancement of deep learning, end-to-end

deep SER has gained increasing attention, capable of

directly using raw emotional speech signals or

handcrafted features as input for deep learning

models (Luo, Ran, Yang, & Dou, 2022). The

integration of machine learning technology with SER

brings numerous advantages. Firstly, the powerful

self-learning ability of machine learning allows it to

automatically extract features from large amounts of

speech data, offering stronger adaptability and more

representative features compared to traditional

speech recognition methods (Lieskovská, Jakubec,

Jarina, & Chmulík, 2021). Secondly, when dealing

with imbalanced datasets, machine learning

algorithms — such as convolutional recurrent neural

networks (CRNNs) with variable-length inputs and

focal loss — can adjust the contribution of different

samples to the total loss, enabling the model to

perform well even on minority samples (Liang, Li, &

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

356

Song, 2020). Additionally, the incorporation of

attention mechanisms allows models like CNNs to

compute attention weights, determining the

importance of different parts of the speech, thus

making emotion recognition more accurate (Zhu,

Sun, Wei, & Zhao, 2023).

4 KEY CHALLENGES

4.1 Features

Compared with the automatically extracted features

from deep learning methods, traditional features

possess stronger interpretability and lower

computational complexity. Most traditional features

have explicit physical meanings, making them easily

recognizable within speech signals. These intuitive

features facilitate researchers' understanding and

analysis. In contrast, features extracted through deep

learning are more complex and represent deeper

emotional characteristics. These features are typically

difficult to detect using conventional approaches, but

they exhibit stronger discriminative power and

adaptability. When dealing with more complex

speech information, deep learning-based features

demonstrate superior expressive capabilities and are

better suited for accommodating emotional

expression differences across various regions and

cultures.

Traditional feature extraction algorithms are

relatively simple and computationally efficient.

Under limited hardware resources, these algorithms

can quickly extract features, making them suitable for

scenarios where high precision in emotion

recognition is not required. Conversely, features

extracted by deep learning models require no manual

design, and offer greater discriminative power and

adaptability. Through automatic learning from large

amounts of speech data, deep learning models can

autonomously extract complex and deep emotional

features, significantly enhancing recognition

performance in challenging conditions.

4.2 Models

There are notable differences between traditional

models and deep learning models in the field of SER.

Traditional models, such as Hidden Markov Models

(HMM) and Support Vector Machines (SVM), are

relatively simple. These approaches typically offer

stronger interpretability, demand less hardware, and

are efficient when handling small-scale data training

and recognition tasks. However, due to their simple

structure, the recognition accuracy of these models

tends to degrade when dealing with complex speech

signals.

Deep learning models effectively address this

issue. With their powerful feature learning and fitting

capabilities, machine learning and deep learning

models achieve higher accuracy and better robustness

in SER tasks. Nevertheless, deep learning models also

have certain drawbacks, such as higher hardware

requirements and a greater dependence on large-scale

training datasets.

5 CONCLUSIONS

This paper focused on SER technology, first

providing a comprehensive review of the 40-year

development history of SER. Subsequently, the

workflow of SER technology was explained,

including preprocessing, feature extraction, feature-

to-emotion mapping, and emotion classification.

From a feature extraction perspective, this study

extensively discussed the principles, advantages, and

limitations of both traditional methods and machine

learning approaches.

For traditional methods, due to their reliance on

handcrafted features, these models offer higher

interpretability and simpler structures, making them

advantageous in scenarios with limited hardware

resources and modest recognition accuracy

requirements. In contrast, machine learning methods

possess automatic feature extraction capabilities,

enabling them to mine complex and deep emotional

features from large volumes of speech data. These

features exhibit greater adaptability and

discriminative power, achieving better performance

in complex speech environments and across diverse

cultural contexts.

Future research can further explore fusion

strategies for different deep learning models,

combining the strengths of CNNs and RNNs to

construct more powerful hybrid models. Such

approaches can achieve collaborative optimization of

local feature capture and long-term dependency

processing, thereby enhancing model performance in

complex SER tasks. Additionally, integrating

multimodal information—such as text, facial

expressions, and body movements—can facilitate the

construction of multimodal fusion SER models,

enabling a more comprehensive understanding of

emotional expression and improving both recognition

accuracy and system robustness.

Speech Emotion Recognition Technology in Human-Computer Interaction

357

REFERENCES

Chen, W., Xing, X., Xu, X., Pang, J., & Du, L. (2023).

SpeechFormer++: A hierarchical efficient framework

for paralinguistic speech processing. IEEE/ACM

Transactions on Audio, Speech, and Language

Processing, 31, 775–788.

Elbarougy, R., & Akagi, M. (2012). Speech emotion

recognition system based on a dimensional approach

using a three-layered model. In Proceedings of the 2012

Asia Pacific Signal and Information Processing

Association Annual Summit and Conference (pp. 1–9).

IEEE.

Eyben, F., Wöllmer, M., & Schuller, B. (2010).

openSMILE: The Munich versatile and fast open-

source audio feature extractor. ACM.

Liang, Z., Li, X., & Song, W. (2020). Research on speech

emotion recognition algorithm for unbalanced data set.

Journal of Intelligent & Fuzzy Systems, 39(3), 2791–

2796.

Lieskovská, E., Jakubec, M., Jarina, R., & Chmulík, M.

(2021). A review on speech emotion recognition using

deep learning and attention mechanism. Electronics,

10(10), 1163.

Luo, D. (2023). Research on power grid dispatching

operation safety early warning model based on speech

emotion recognition (Master’s thesis, Shaanxi

University of Technology). CNKI.

Luo, D., Ran, Q., Yang, C., & Dou, W. (2022). A review of

speech emotion recognition. Computer Engineering

and Applications, 58(21), 40–52.

Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014).

Learning salient features for speech emotion

recognition using convolutional neural networks. IEEE

Transactions on Multimedia, 16(8), 2203–2213.

Milton, A., & Tamil Selvi, S. (2014). Class-specific

multiple classifiers scheme to recognize emotions from

speech signals. Computer Speech & Language, 28(3),

727–742.

Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion

recognition in spontaneous speech using GMMs.

Interspeech 2006.

Ramyasree, K., & Kumar, C. S. (2023). Multi-attribute

feature extraction and selection for emotion recognition

from speech through machine learning. Traitement du

Signal, 40(1), 265–275.

Schuller, B. (2018). Speech emotion recognition: Two

decades in a nutshell, benchmarks, and ongoing trends.

Communications of the ACM, 61(5), 90–99.

Shukla, S., & Jain, M. (2022). Deep GANITRUS algorithm

for speech emotion recognition. Journal of Intelligent &

Fuzzy Systems, 43(5), 5353–5368.

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E.,

Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016).

Adieu features? End-to-end speech emotion recognition

using a deep convolutional recurrent network. In

Proceedings of IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP).

Zhang, M. (2023). Design and implementation of a

customer service emotion monitoring system based on

speech emotion recognition (Master’s thesis, Southeast

University). CNKI.

Zhu, R., Sun, C., Wei, X., & Zhao, L. (2023). Speech

emotion recognition using channel attention

mechanism. In 2023 4th International Conference on

Computer Engineering and Application (ICCEA) (pp.

680–684). IEEE.

EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence

358