Authors:
Atitaya Yakaew
1
;
Matthew N. Dailey
1
and
Teeradaj Racharak
2
Affiliations:
1
Department of Information and Communication Technologies, Asian Institute of Technology, Klong Luang, Pathumthani, Thailand
;
2
School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Keyword(s):
Deep Learning for Multimodal Real-Time Analysis, Emotion Recognition, Video Processing and Analysis, Lightweight Deep Convolutional Neural Networks, Sentiment Classification.
Abstract:
Real-time sentiment analysis on video streams involves classifying a subject’s emotional expressions over time based on visual and/or audio information in the data stream. Sentiment can be analyzed using various modalities such as speech, mouth motion, and facial expression. This paper proposes a deep learning approach based on multiple modalities in which extracted features of an audiovisual data stream are fused in real time for sentiment classification. The proposed system comprises four small deep neural network models that analyze visual features and audio features concurrently. We fuse the visual and audio sentiment features into a single stream and accumulate evidence over time using an exponentially-weighted moving average to make a final prediction. Our work provides a promising solution to the problem of building real-time sentiment analysis systems that have constrained software or hardware capabilities. Experiments on the Ryerson audio-video database of emotional speech (
RAVDESS) show that deep audiovisual feature fusion yields substantial improvements over analysis of either single modality. We obtain an accuracy of 90.74%, which is better than baselines of 11.11% – 31.48% on a challenging test dataset.
(More)