Expanding the Singular Channel - MODALINK: A Generalized

Automated Multimodal Dataset Generation Workﬂow for

Emotion Recognition

Amany H. AbouEl-Naga

, May Hussien

, Wolfgang Minker

, Mohammed A.-M. Salem

and Nada Sharaf

Faculty of Informatics and Computer Science, German International University, New Capital, Egypt

Institute of Communications Engineering, Ulm University, Ulm, Germany

Faculty of Media Engineering and Technology, German University in Cairo, New Cairo, Egypt

Keywords:

Emotion Recognition, Multimodal Emotion Recognition, Dataset Generation, Artiﬁcial Intelligence, Affective

Computing, Deep Learning, Machine Learning, Multimodality.

Abstract:

Human communication relies deeply on the emotional states of the individuals involved. The process of iden-

tifying and processing emotions in the human brain is inherently multimodal. With recent advancements in

artiﬁcial intelligence and deep learning, ﬁelds like affective computing and human-computer interaction have

witnessed tremendous progress. This has shifted the focus from unimodal emotion recognition systems to mul-

timodal systems that comprehend and analyze emotions across multiple channels, such as facial expressions,

speech, text, and physiological signals, to enhance emotion classiﬁcation accuracy. Despite these advance-

ments, the availability of datasets combining two or more modalities remains limited. Furthermore, very few

datasets have been introduced for the Arabic language, despite its widespread use (Safwat et al., 2023; Akila

et al., 2015). In this paper, MODALINK, an automated workﬂow to generate the ﬁrst novel Egyptian-Arabic

dialect dataset integrating visual, audio, and text modalities is proposed. Preliminary testing phases of the pro-

posed workﬂow demonstrate its ability to generate synchronized modalities efﬁciently and in a timely manner.

1 INTRODUCTION

Emotions are a collection of mental states triggered

by different ideas and actions. People continually

express their emotions when they communicate, and

recognizing emotions is crucial for many facets of

daily living and interpersonal communication. Emo-

tion is vital in interpersonal interactions, insight, per-

ception, and other areas of life (Radoi et al., 2021).

Although emotions are expressed differently

across societies and situations, they can be seen as

a universal language that transcends linguistic and

cultural barriers. While cultural norms inﬂuence ex-

pressions of love or sorrow, fundamental similarities

highlight universal human experiences (Kalateh et al.,

2024b).

Identifying emotions is a key component of hu-

man communication, driving research into emotion

recognition to develop systems capable of recogniz-

ing and understanding human emotions through the

analysis of the modality that expresses the emotion it-

self (Ahmed et al., 2023). Emotion recognition has

broad applications, from mental health monitoring

to enhancing customer service, user experience, and

natural human-robot interactions (Islam et al., 2024;

Kalateh et al., 2024b; Ahmed et al., 2023).

For years, research has focused on unimodal emo-

tion recognition methods, such as facial expression,

speech, and text analysis. Despite achieving reason-

able results, unimodal approaches are limited due to

the inherently multifaceted nature of emotional ex-

pressions. A single modality provides restricted in-

formation, making it difﬁcult to interpret emotions

accurately in complex social contexts. Factors such

as lighting, camera angles, stress, and physical activ-

ity can alter how emotions are perceived, while vari-

ations in speech signals due to language and cultural

differences further challenge accuracy. Additionally,

unimodal algorithms are vulnerable to input data ﬂuc-

tuation and noise (Kalateh et al., 2024b; Ahmed et al.,

2023; Vijayaraghavan et al., 2024).

To address these shortcomings, researchers have

AbouEl-Naga, A. H., Hussien, M., Minker, W., Salem, M. A.-M., Sharaf and N.

Expanding the Singular Channel - MODALINK: A Generalized Automated Multimodal Dataset Generation Workﬂow for Emotion Recognition.

DOI: 10.5220/0013520100003967

In Proceedings of the 14th International Conference on Data Science, Technology and Applications (DATA 2025), pages 415-422

ISBN: 978-989-758-758-0; ISSN: 2184-285X

415

turned to multimodal emotion recognition (MER),

integrating modalities such as facial expressions,

speech, and text. Recent studies demonstrate that

MER outperforms unimodal approaches, as differ-

ent modalities complement each other, providing a

more comprehensive understanding of emotions (Vi-

jayaraghavan et al., 2024; Kalateh et al., 2024b).

Many datasets have been introduced, primarily

covering audio, visual, and text modalities, often in-

tegrating two or all three. These datasets facilitate the

development and evaluation of MER models by pro-

viding diverse emotional states with careful curation

and annotation (Ahmed et al., 2023; Vijayaraghavan

et al., 2024).

Despite the availability of MER datasets, Ara-

bic remains underrepresented. Most studies focus

on English, Western European, and East Asian lan-

guages, leaving a gap in Arabic MER research. With

an estimated 420 million Arabic speakers, including

310 million native speakers across eight major dialect

groups, the lack of Arabic MER datasets is a signif-

icant challenge. Arabic is often considered a low-

resource language in emotion recognition and NLP

due to the limited availability of annotated datasets.

This scarcity is particularly problematic given the

linguistic complexity and dialectal diversity of Ara-

bic, including the diglossia between Modern Standard

Arabic and regional dialects (Arabiya, 2025; Al Ro-

ken and Barlas, 2023). The Egyptian dialect has

unique phonological and lexical traits that may inﬂu-

ence emotion expression and recognition. This work

proposes an automated process for creating a mul-

timodal dataset in Egyptian Arabic, integrating text,

visual, and audio modalities for emotion recognition

tasks. The MODALINK framework is designed to

bridge the gap in Arabic MER research, speciﬁcally

for the Egyptian dialect.

MODALINK ensures a wide range of emotions

and participant diversity by automating the process-

ing and annotation of Egyptian talk shows and series

using advanced deep learning techniques. It incorpo-

rates natural language processing, facial expression

analysis, and speech recognition to extract relevant

features from each modality. The objective is to re-

duce the time and resources needed for dataset cre-

ation while maintaining high-quality, consistent clas-

siﬁcation and synchronization across modalities.

The rest of the paper is structured as follows: Sec-

tion 2 reviews related work, Section 3 details the au-

tomated workﬂow and dataset structure, Section 4

presents the ﬁndings, and Section 5 concludes the

study.

2 RELATED WORK

Emotion recognition has been studied for a long time,

initially using textual data and gradually moving to

facial expressions and, ultimately, multimodal tech-

niques to boost accuracy(Kalateh et al., 2024a). This

move was prompted by advances in AI and the need

for improved emotion recognition. Researchers im-

proved applications in affective computing, human-

computer interaction, virtual assistants, mental health

monitoring, and entertainment by combining text, fa-

cial expressions, and audio.

Multimodal emotion recognition overcomes the

limitations of single-modality techniques and pro-

vides a more accurate understanding of emotions.

High-quality, large-labeled datasets are critical for

creating and testing AI emotion models (Kalateh

et al., 2024a). Diverse datasets that account for demo-

graphic, cultural, and environmental variations enable

advancement by facilitating new architectures and fu-

sion methods for complicated emotion detection.

2.1 Popular Datasets in Multimodal

Emotion Recognition

To facilitate experimental research and study, many

datasets have been developed in multimodal emotion

recognition (MER). Nevertheless, several areas still

need more work and attention, even with the large di-

versity of datasets accessible in the literature. Every

dataset comes with its own set of strengths and limita-

tions. Here, we will provide an overview of the most

famous MER datasets.

The Interactive Emotional Dyadic Motion Capture

Database (IEMOCAP) (Busso et al., 2008), this

dataset was developed by the USC Signal Analysis

and Interpretation Lab (SAIL) and is commonly uti-

lized in research on emotion recognition. It contains

video, speech, and text data, allowing for a compre-

hensive examination of emotional states in interactive

environments. The dataset includes recordings from

ten actors spanning approximately twelve hours, in-

cluding scripted and spontaneous dialogue. It com-

prises 4,784 improvised and 5,255 written conversa-

tions across nine emotions: happy, sadness, rage, sur-

prise, fear, disgust, frustration, excitement, and neu-

trality. IEMOCAP’s positives are its good quality,

synchronized multimodal data, and combination of

acted and spontaneous emotional reactions. Its small

size, however, restricts its potential for deep learn-

ing models, whereas the imbalance in emotion classes

can affect model performance. Furthermore, as an

English-only dataset, it lacks cultural diversity.

The CMU-MOSEI dataset is one of the largest avail-

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

416

able for multimodal sentiment analysis and emotion

recognition (Bagher Zadeh et al., 2018). This dataset

includes more than 12 hours of annotated segments

from YouTube videos by more than 1,000 speakers

on approximately 250 different topics. The emotions

represented in the dataset are: anger, happiness, dis-

gust, sadness, fear, and surprise. The data collection

techniques employed ensure that every video involves

only one speaker, further establishing the mono-logic

nature of this dataset. Although the CMU-MOSEI

dataset is well-known for its vast size and is still

widely utilized for model development and validation,

it has a few limitations. The dataset is limited to En-

glish, which may limit the ability of models trained

on it to generalize to other languages. Furthermore,

like with other emotion datasets, some emotions are

underrepresented.

A Multimodal Multi-Party Dataset for Emotion

Recognition in Conversations(MELD) (Poria et al.,

2018), by focusing on interactions between several

participants, the MELD dataset offers a novel per-

spective on multimodal emotion recognition. Each

statement is classiﬁed as anger, disgust, sadness, joy,

neutrality, surprise, or fear. It includes 13,708 ut-

terances and 1,433 conversations from the Ameri-

can television series Friends. This dataset addresses

a research gap by assisting in the investigation of

emotional dynamics in multi-party interactions. Its

cultural signiﬁcance and reality present difﬁculties,

though. The conversation and emotions may be ex-

aggerated because it is based on a written script,

which lessens realism. Furthermore, models trained

on MELD may not work well in other cultural con-

texts due to their English language and American cul-

tural standards, underscoring the necessity for addi-

tional varied datasets in emotion recognition studies.

Sentiment and Emotion, Well-being, and Affec-

tive dynamics dataset (SEWA) (Kossaiﬁ et al.,

2019) is a valuable resource in the ﬁeld. This

dataset closes a signiﬁcant gap in emotion recog-

nition studies through incorporating linguistic and

cultural diversity. It contains audio-visual data in

six languages—English, German, French, Hungar-

ian, Greek, and Chinese—that capture natural emo-

tions and conversational dynamics from people aged

18 to 65. While it improves multilingual research,

the small number of participants limits generalization,

and many commonly utilized languages are underrep-

resented, emphasizing the need for additional expan-

sion in the future.

2.2 Arabic Multimodal Emotion

Recognition Dataset

Multimodal emotion recognition (MER) has received

a lot of attention, thanks to increased research on cul-

tural and language diversity. Despite its widespread

use and numerous dialects, Arabic MER has received

very little attention. The enormous variety among

Arabic dialects makes it difﬁcult to construct a uni-

versal emotion recognition model. This section fo-

cuses on the efforts accomplished to develop Arabic-

speciﬁc MER datasets.

The ﬁrst audio-visual Arabic emotional dataset is

called (AVANEmo), One of the early efforts to cre-

ate an Arabic MER dataset was introduced in (Shaqra

et al., 2019). The dataset contains 3000 clips for video

and audio data, covering one of the following emo-

tions: Happy, Sad, Angry, Surprise, Disgust, Neutral.

They focused on videos that contain personal stories

or life experiments, we also considered the talk shows

where speakers shared their perspectives on a speciﬁc

issue. They ended up with 90 video sources and 82

speakers of different age groups. While this was a

good start, the dataset only focused on audio and vi-

sual features. The dataset has a mix of Arabic which

could be challenging during processing.

An Arabic Multi-modal emotion recognition

dataset was introduced in (Al Roken and Barlas,

2023). This dataset has only visual and audio modal-

ities. Clips were collected from more than 40 shows

available on YouTube, featuring various guests from

different backgrounds and using different Arabic di-

alects. The clips were annotated with one of the fol-

lowing emotions: Angry, Happy, Neutral, Sad, and

Surprised. While this dataset seeks to ﬁll a gap by

providing an Arabic MER dataset, it faces several lim-

itations. The dataset is relatively small, with a limited

number of videos and speakers. Additionally, it lacks

diversity, as it only includes clips featuring adults,

and the number of male speakers exceeds that of fe-

male speakers. Finally, the model results presented in

the cited work indicate that building dialect-speciﬁc

datasets may be beneﬁcial at this stage, as there is still

more room for improvement.

Tables 1, 2 show that existing emotion recognition

datasets have made signiﬁcant contributions to the do-

main, yet they still do not represent Arabic dialects,

Egyptian Arabic in particular, adequately. The ma-

jority of the datasets focus on Modern Standard Ara-

bic (MSA) or a combination of dialects, which do not

capture real speaking patterns and emotional expres-

sion of Egyptian Arabic. To ﬁll this gap, we propose

to create a multimodal emotion recognition dataset

Expanding the Singular Channel - MODALINK: A Generalized Automated Multimodal Dataset Generation Workﬂow for Emotion

Recognition

417

Table 1: Overview of Emotion Recognition Datasets.

Dataset Language Emotions Source Modality

EMOCAP (Busso et al.,

2008)

English Happiness, sadness, anger,

surprise, fear, disgust,

frustration, enthusiasm, and

neutral

Recorded Audio, Video,

Text

CMU-MOSEI

(Bagher Zadeh et al.,

2018)

English Anger, happiness, disgust,

sadness, fear, and surprise

Youtube Audio, Video,

Text

MELD (Poria et al.,

2018)

English Joy, sadness, anger, fear,

disgust, surprise, and neutral

Friends TV

Series

Audio, Video,

Text

SEWA (Kossaiﬁ et al.,

2019)

English, German,

French, Hungarian,

Greek, and Chinese

Continuous valence and

arousal

Recorded Audio, Video

ANADemo (Shaqra

et al., 2019)

Arabic Happy, sad, angry, surprise,

disgust, and neutral

Youtube Audio, Video

Arabic MER dataset

(Al Roken and Barlas,

2023)

Arabic Angry, happy, neutral, sad, and

surprised

Youtube Audio, Video

Table 2: In-depth view of Arabic Emotion Recognition Datasets.

Dataset Dialect Size Number of Speakers Availability

ANADemo (Shaqra

et al., 2019)

Multi-dialect and

MSA

892 segments 82 Not publicly available

Arabic MER dataset

(Al Roken and Barlas,

2023)

Multi-dialect and

MSA

1336 segments unknown Not publicly available

for Egyptian Arabic spanning audio, visual, and text

modalities. This dataset aims to naturally represent

Egyptian cultural and linguistic features.

3 PROPOSED METHODOLOGY

MODALINK, the proposed automated workﬂow,

generates a multimodal dataset from Egyptian Arabic

video content for emotion recognition. The process

includes video acquisition, audio extraction, speech

and video segmentation, face detection, and speech-

to-text transcription (Figure 1). MODALINK inte-

grates audio, visual, and textual modalities to ex-

tract emotion-rich segments with transcriptions, en-

suring high-quality data while addressing face visi-

bility, speech detection, and transcription accuracy.

3.1 Data Collection

In this step, videos chosen from publicly available

Egyptian TV-series, variety shows, and talk shows

are downloaded from YouTube using the yt-dlp li-

brary. The video is downloaded in its highest quality

available in MP4 format ensuring both the audio and

video channels are captured into a single ﬁle. Only

Figure 1: MODALINK: Proposed Automated Workﬂow for

Egyptian-Arabic Dialect Dataset Generation.

videos with clear audio and visible faces are selected

to maintain the quality of the dataset. Next, the video

ﬁle is preprocessed to ensure consistent format.

3.2 Audio Extraction

After downloading the video, the audio track is ex-

tracted from the MP4 ﬁle using the FFmpeg library

(Tomar, 2006). The audio is converted to a standard-

ized PCM format with a sampling rate of 16kHz and a

single mono channel to ensure the process of speech

detection is optimal. A fallback mechanism is im-

plemented to handle any corrupted or unusable audio

ﬁles during the extraction.

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

418

3.3 Speech Segmentation

The extracted audio is segmented using a WebRTC

Voice Activity Detection (VAD), where only the

speech regions are retained and the non-speech re-

gions are discarded. The audio signal is divided into

frames of 30 ms then each frame is considered for the

existence of speech.

Segments shorter than .5 seconds are discarded to

ensure that there is sufﬁcient duration and content in

each segment to perform emotion recognition. Seg-

ments are split based on silence period threshold on

.3 seconds. These parameters could be edited accord-

ing to the need and the nature of the required data.

Continuous speech segments are identiﬁed based

on the VAD output. MODALINK applies heuristics

to merge closely spaced speech segments and discard

very short segments, improving the overall quality of

detected speech regions.

Detected speech segments information are stored

as time intervals (start time and end time), these

timestamps are vital to ensure synchronization be-

tween all modalities. This process helps to reduce the

time and effort required for transcription as the tran-

scription model will only process the detected speech

segments.

3.3.1 Audio Feature Extraction and Speaker

Change Detection

To enhance speech segmentation and identify speaker

changes, MFCCs are extracted from overlapping win-

dows within speech segments to capture key spec-

tral characteristics used in speaker recognition. By

analyzing the cosine similarity between consecutive

MFCC features, signiﬁcant changes are detected.

Segments are then split at points where the similar-

ity metric falls below a predeﬁned threshold, indicat-

ing potential speaker changes. This threshold is ad-

justable to match the sensitivity required for the spe-

ciﬁc task.

3.4 Video Segmentation

To ensure face visibility per speech segment, the

video segments are extracted from the original video

using the timestamps obtained in the prior step. Video

clips the match the time stamps are extracted using the

FFmpeg library (Tomar, 2006).

The video is then used to extract individual frames

at a speciﬁed frame rate, typically 1 frame per sec-

ond. This step utilizes FFmpeg’s capabilities to ex-

tract frames efﬁciently.

3.5 Face Detection and Speaker

Matching

Each extracted frame undergoes pre-processing to op-

timize face detection. MODALINK down-scales the

frame using a scale factor of 0.5 to enhance process-

ing speed without signiﬁcantly compromising detec-

tion accuracy. The down-scaled frame is then con-

verted from BGR to RGB color space, as required by

the face detection model.

The Multi-task Cascaded Convolutional Networks

(MTCNN) face detection model is utilized to detect

the presence of faces within the ﬁrst 10 frames of each

video segment (Zhang et al., 2016). The MTCNN al-

gorithm uses a cascading series of neural networks to

detect, align, and extract facial features from digital

images with high accuracy and speed.

Segments without visible faces in the analyzed frames

are discarded to ensure that visual data includes hu-

man expressions. The model was set to the aggressive

mode to ensure that segments without clear human fa-

cial expressions are discarded.

A multiple face detection approach is employed, and

each detected face is represented by its bounding

box coordinates and conﬁdence scores. A heuristic

method is applied to match detected faces to the active

speaker. The largest face in each segment is assumed

to be the active speaker.

3.6 Speech-to-Text Transcription

Speech segments detected previously are fed to an

automatic speech recognition model to generate text

transcriptions. The Google Speech Recognition API

is utilized as part of the MODALINK workﬂow with

language conﬁguration set to Egyptian-Arabic. Seg-

ments that fail transcription due to noise or API errors

are ﬂagged but not discarded entirely.

3.7 Metadata Generation

Each processed segment is annotated with synchro-

nized multimodal data (audio, video, text) along with

metadata.

Metadata Fields:

1. Timestamps: Start time and end time of each seg-

ment.

2. Video Clip: A short video clip corresponding to

the speech segment is extracted and saved in MP4

format.

3. Audio Clip: The isolated audio for the speech seg-

ment is saved separately in WAV format.

Expanding the Singular Channel - MODALINK: A Generalized Automated Multimodal Dataset Generation Workﬂow for Emotion

Recognition

419

4. Transcription: The text transcription of the speech

segment is recorded.

5. Representative Frame: A single frame, typically

from the midpoint of the segment, is extracted to

visually represent the speaker.

MODALINK is implemented in Python, utilizing

open-source tools like FFmpeg, MTCNN, WebRTC

VAD, and the Google Speech Recognition API. Par-

allel processing with a thread pool executor acceler-

ates the pipeline by handling multiple segments si-

multaneously. An error logging system ensures con-

tinuity without halting the process. By leveraging

multi-core processors, MODALINK efﬁciently pro-

cesses large videos, enabling scalable dataset genera-

tion while minimizing manual intervention.

4 RESULTS AND DISCUSSION

The proposed MODALINK workﬂow, explained in

detail in the methodology section, combines several

modules, including audio extraction, speech recogni-

tion, face recognition, transcription, and video down-

loading. MODALINK aims to automate the process

of creating a well-structured Egyptian Arabic Multi-

modal Emotion recognition dataset.

4.1 Preliminary Results

Testing the proposed MODALINK workﬂow showed

its ability to successfully :

• Download video content using yt-dlp and store it

in MP4 format.

• Extract audio ﬁles from video ﬁles via FFmpeg

• Detect speech segments using WebRTC VAD

• Identify faces within video segments using the

MTCNN model

• Transcribe audio clips using Google’s Speech API

customized for Egyptian Arabic.

Initial testing provided the following insights:

Downloading the videos and extracting the corre-

sponding audio streams was successful without any

complications. In the next step, MODALINK suc-

cessfully identiﬁes several speech segments from the

extracted audio ﬁles. Audio segments with loud back-

ground music, noises, and no speech were discarded.

Temporal synchronization between audio and video

was maintained through rigorous timestamp manage-

ment. Next, MODALINK used MTCNN for reliable

face detection in video frames. Any video segment

without visible faces was ﬁltered out. Finally, man-

aging to transcribe the speech audio ﬁles and saving

them along with the time stamps where they occurred.

Figure 2 shows the various expressions detected in

different video segments generated as output.

Figure 2: Sample of detected facial expressions in the gen-

erated video segments.

Figure 3 represents samples of the transcribed text

obtained when we tested our workﬂow on an episode

of a famous Egyptian TV series. Around 44 speech

segments were extracted from the audio ﬁle of this

video.

MODALINK provides extensive output, including

segmented video clips with speech and visible faces,

audio clips for each segment, detailed transcriptions

with temporal alignment, and structured output ﬁles

that maintain all component connections.

4.2 Optimization Strategies

To ensure efﬁcient processing of long-form videos,

the following optimizations were implemented:

1. Parallel Processing: Video and audio extraction,

face detection, and transcription are parallelized

using ThreadPoolExecutor. This approach maxi-

mizes CPU utilization and reduces overall execu-

tion time.

2. Frame Rate Reduction: Frames are extracted at

1 FPS instead of higher rates (e.g., 5 FPS). This

reduces the number of frames processed without

signiﬁcantly impacting face detection accuracy.

3. Downscaling: Frames are down-scaled to 50%

of their original resolution before face detec-

tion. This improves the efﬁciency of the MTCNN

model while maintaining acceptable accuracy.

4. Batch Processing: Audio frames are processed

in batches for VAD, and audio segments are tran-

scribed in batches. This minimizes overhead and

improves throughput.

5. FFmpeg Optimization: FFmpeg commands use

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

420

(a) Sample 1 of transcribed text obtained from testing the proposed workﬂow.

(b) Sample 2 of transcribed text obtained from testing the proposed workﬂow.

Figure 3: Examples of transcribed text from the proposed workﬂow.

the -preset ultrafast option and -threads 4 to accel-

erate video and audio extraction. These settings

optimize the trade-off between speed and quality.

4.3 Generalization of MODALINK

A key aspect of the MODALINK workﬂow is its po-

tential for generalization across different languages

and dialects. MODALINK’s versatility is pivotal for

broadening emotion recognition capabilities to under-

represented linguistic communities. MODALINK is

designed to be modular, allowing for seamless modi-

ﬁcation of parameters to adjust various languages or

dialects. Altering the transcription model’s language

is the main modiﬁcation that must be made. Due to

its adaptability, MODALINK can be used in a variety

of linguistic contexts without requiring consequential

reconﬁguring. Furthermore, the transcription model

itself can be easily swapped if a more accurate model

becomes available for a speciﬁc language or dialect.

This guarantees that the process can take advantage

of new developments in transcription models as they

become available, in addition to increasing its adapt-

ability. MODALINK provides a strong pipeline for

building multimodal datasets that can facilitate emo-

tion recognition research across a variety of languages

and dialects by upholding an emphasis on modularity

and adaptability.

This generalizability of MODALINK also unlocks a

deeper understanding of how people from different

cultures express emotions which is crucial for ﬁlling

the resource gap. By enabling the creation of mul-

timodal datasets in various languages, MODALINK

helps in exploring the rich tapestry of human emo-

tions across diverse linguistic and cultural landscapes.

4.4 Challenges and Observations

The segmentation is based on detected speech rather

than ﬁxed time intervals, resulting in segments of

varying lengths. While this strategy assures that each

segment includes relevant speech, it could complicate

processing and analysis. Hence, a minimum segment

length is exploited to ensure that we have meaningful

text useful for emotion analysis.

Future iterations could implement more sophisticated

speaker diarization techniques to handle multiple si-

multaneous speakers, however; these models are

computationally intensive.

Also, some recognized speech chunks lack identiﬁ-

able faces, leading to their elimination. This is to en-

sure both speech and facial emotions are present in

each segment which is critical for further research.

One of the most challenging tasks was converting au-

dio to text. Several established models and APIs were

tested, but they all struggled with dialect-speciﬁc vo-

cabulary, resulting in inaccurate transcriptions. Fine-

tuned Wav2Vec models (Baevski et al., 2020) as

well as Vosk models (Shmyrev and other contribu-

tors, 2020) pre-trained on Arabic language were tried.

However, the transcription results were not satisfy-

ing. Google’s speech recognition was an improve-

ment over other approaches examined, and it has pro-

duced encouraging results so far; nonetheless, im-

proving transcription accuracy remains an ongoing

challenge.

Expanding the Singular Channel - MODALINK: A Generalized Automated Multimodal Dataset Generation Workﬂow for Emotion

Recognition

421

5 CONCLUSIONS AND FUTURE

WORK

The paper presents MODALINK to generate mul-

timodal datasets leveraging visual, audio, and text

modalities for emotion recognition. It addresses a

signiﬁcant gap in multimodal emotion recognition by

tackling the scarcity of resources for low-resource

languages, particularly Arabic, while capturing lin-

guistic and cultural aspects. MODALINK utilizes

advanced tools such as FFmpeg, MTCNN, WebRTC

VAD, and Google Speech Recognition API to auto-

mate dataset generation efﬁciently, ensuring precise

synchronization across modalities. Preliminary tests

demonstrate its ability to process large-scale video

data, extract emotion-rich segments, and produce syn-

chronized outputs with minimal time, resources, and

human intervention. Challenges remain in improv-

ing transcription accuracy for speciﬁc dialects and ex-

panding diversity.The future goal is to develop a com-

prehensive, diverse Egyptian Arabic dataset incorpo-

rating all modalities for emotion recognition.

ACKNOWLEDGMENT

We acknowledge the use of AI tools to generate and

enhance parts of the paper. The content was revised.

REFERENCES

Ahmed, N., Aghbari, Z. A., and Girija, S. (2023). A system-

atic survey on multimodal emotion recognition using

learning algorithms. Intell. Syst. Appl., 17:200171.

Akila, G., El-Menisy, M., Khaled, O., Sharaf, N., Tarhony,

N., and Abdennadher, S. (2015). Kalema: Digitizing

arabic content for accessibility purposes using crowd-

sourcing. In Computational Linguistics and Intelli-

gent Text Processing: 16th International Conference,

CICLing 2015, Cairo, Egypt, April 14-20, 2015, Pro-

ceedings, Part II 16, pages 655–662. Springer.

Al Roken, N. and Barlas, G. (2023). Multimodal ara-

bic emotion recognition using deep learning. Speech

Communication, 155:103005.

Arabiya, S. (2025). 10 things you may not know about.

Accessed: 16-Jan-2025.

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M.

(2020). wav2vec 2.0: A framework for self-

supervised learning of speech representations. arXiv

preprint arXiv:2006.11477.

Bagher Zadeh, A., Liang, P. P., Poria, S., Cambria, E., and

Morency, L.-P. (2018). Multimodal language analysis

in the wild: CMU-MOSEI dataset and interpretable

dynamic fusion graph. In Gurevych, I. and Miyao,

Y., editors, Proceedings of the 56th Annual Meeting

of the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 2236–2246, Melbourne,

Australia. Association for Computational Linguistics.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower,

E., Kim, S., Chang, J. N., Lee, S., and Narayanan,

S. S. (2008). Iemocap: Interactive emotional dyadic

motion capture database. Language resources and

evaluation, 42:335–359.

Islam, M. M., Nooruddin, S., Karray, F., and Muhammad,

G. (2024). Enhanced multimodal emotion recognition

in healthcare analytics: A deep learning based model-

level fusion approach. Biomed. Signal Process. Con-

trol., 94:106241.

Kalateh, S., Estrada-Jimenez, L. A., Hojjati, S. N., and

Barata, J. (2024a). A systematic review on multi-

modal emotion recognition: building blocks, current

state, applications, and challenges. IEEE Access.

Kalateh, S., Estrada-Jimenez, L. A., Nikghadam-Hojjati, S.,

and Barata, J. (2024b). A systematic review on mul-

timodal emotion recognition: Building blocks, cur-

rent state, applications, and challenges. IEEE Access,

12:103976–104019.

Kossaiﬁ, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt,

M., Ringeval, F., Han, J., Pandit, V., Toisoul, A.,

Schuller, B., et al. (2019). Sewa db: A rich database

for audio-visual emotion and sentiment research in the

wild. IEEE transactions on pattern analysis and ma-

chine intelligence, 43(3):1022–1040.

Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria,

E., and Mihalcea, R. (2018). Meld: A multimodal

multi-party dataset for emotion recognition in conver-

sations. arXiv preprint arXiv:1810.02508.

Radoi, A., Birhala, A., Ristea, N., and Dutu, L. (2021).

An end-to-end emotion recognition framework based

on temporal aggregation of multimodal information.

IEEE Access, 9:135559–135570.

Safwat, S., Salem, M. A.-M., and Sharaf, N. (2023). Build-

ing an egyptian-arabic speech corpus for emotion

analysis using deep learning. In Paciﬁc Rim Inter-

national Conference on Artiﬁcial Intelligence, pages

320–332. Springer.

Shaqra, F. A., Duwairi, R., and Al-Ayyoub, M. (2019).

The audio-visual arabic dataset for natural emotions.

2019 7th International Conference on Future Internet

of Things and Cloud (FiCloud), pages 324–329.

Shmyrev, N. V. and other contributors (2020). Vosk

speech recognition toolkit: Ofﬂine speech recogni-

tion api for android, ios, raspberry pi and servers

with python, java, c#, and node. https://github.com/

alphacep/vosk-api. Accessed: 2025-01-16.

Tomar, S. (2006). Converting video formats with ffmpeg.

Linux Journal, 2006(146):10.

Vijayaraghavan, G., T., M., D., P., and E., U. (2024). Mul-

timodal emotion recognition with deep learning: Ad-

vancements, challenges, and future directions. Inf. Fu-

sion, 105:102218.

Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). Joint

face detection and alignment using multitask cascaded

convolutional networks. IEEE Signal Process. Lett.,

23(10):1499–1503.

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

422