systems typically employ Natural Language
Processing (NLP) techniques such as sentiment
analysis or tokenization to derive emotional
information from text. Although there has been
sufficient accuracy in these systems in their
respective domains, they still struggle with speech in
real-world scenarios where emotion is conveyed
simultaneously in dozens of modalities.
However, these modalities are rarely integrated,
and existing systems tend to evaluate each modality
independently, which ignores the potential
information hidden in multimodal data. Moreover,
existing systems often struggle with ambiguous or
noisy input especially when the modalities are
misaligned or filled with errors. Affect-based
approaches may struggle with recognition in noisy
environments, or identifying facilities that are
obscured or in different illuminate. Thanks to deep
learning, a new era of emotion analysis systems has
been opened, but current systems still often lack the
support of a transparent, real-time application in real-
world scenarios where multiple modalities need to be
processed together.
7 PROPOSED SYSTEM
The proposed system aims to overcome the
limitations of existing emotion identification systems
by utilizing a multimodal approach. To build a more
dependable and precise model for emotion
recognition, this system merges three distinct types of
data, specifically text, audio, and images. Using
sophisticated deep learning architectures like CNN
for image and audio classification and DenseNet121
for image processing, the system will be able to
analyse and integrate data from these diverse sources.
The data pretreatment pipeline is going to handle the
methods to clean and pre-process the data to train out
the model, like text tokenization, feature extraction
(MFCC for audio), image scale, noise removal etc.
Note that emotions are often expressed through
multiple modalities simultaneously, and so by
combining these different modalities they can create
a more true-to-life representation of the complexity
of human emotions. This holistic approach gives the
model a better chance of detecting emotions more
accurately in adverse real-world scenarios where one
modality might be noisy or limited. Also, suggesting
a solution capable of performing these emotion
predictions in a short period of time will allow
optimizing this whole system to work in real time.
This opens the door to live interaction or other
applications that require immediate responses, such
as customer service queries or interactive AIs.
8 LITERATURE SURVEY
focuses on, CNNs for image analysis, Transformer
models for text analysis, and Long Short- Term
Memory (LSTM) networks for audio processing (J.
Patel, S. Lee, R. Garcia). The authors use the MELD
dataset to evaluate their multimodal emotion
identification system. A pre-trained DenseNet
architecture is employed for the image modalities,
and an LSTM-based model is employed to process
the MFCC-audio features that were obtained. The
textual input is tokenized and passed to a transformer-
based sentiment analysis algorithm, to extract
contextual meaning from the conversation. For
example, A. Kumar, V. Sharma, and M.Dissertation
Tan use CNNs for facial expression identification,
CNN-based models for audio feature extraction and
BERT (Bidirectional Encoder Representations from
Transformers) for analysis in a deep learning
framework. which gives a sentiment of the text of the
chat. The authors propose a late fusion approach
predicting emotional states by fusion techniques
fusing the output of all three models on popularly
known EmotiW dataset for validation.
9 METHODOLOGY
9.1 Data Selection
Data Preprocessing Module: Preparation of MELD
dataset for further analysis and training of the model
is an important feature. It handles the transformation
of raw input data text, audio, images and so on into
forms that machine learning model can leveraged
effectively. Image data preprocessing methods, in
order to enhance computational rate and model
accuracy, consist of scaling to a predefined size,
changing the image to grayscale and normalizing the
pixel values. During the preprocessing steps, Mel-
frequency cepstral coefficients (MFCC) are extracted
from the audio data, representing the spectral
properties of the sound which allows the model to
understand the emotive tones present in speech. Noise
removal techniques are also used to ensure that the
audio input is clear and background noise that could
obscure emotion recognition. Tokenization is the
process of converting raw text into a list of words or
tokens that a natural language processing (NLP)