
5 CONCLUSIONS AND FUTURE
WORK
The paper presents MODALINK to generate mul-
timodal datasets leveraging visual, audio, and text
modalities for emotion recognition. It addresses a
significant gap in multimodal emotion recognition by
tackling the scarcity of resources for low-resource
languages, particularly Arabic, while capturing lin-
guistic and cultural aspects. MODALINK utilizes
advanced tools such as FFmpeg, MTCNN, WebRTC
VAD, and Google Speech Recognition API to auto-
mate dataset generation efficiently, ensuring precise
synchronization across modalities. Preliminary tests
demonstrate its ability to process large-scale video
data, extract emotion-rich segments, and produce syn-
chronized outputs with minimal time, resources, and
human intervention. Challenges remain in improv-
ing transcription accuracy for specific dialects and ex-
panding diversity.The future goal is to develop a com-
prehensive, diverse Egyptian Arabic dataset incorpo-
rating all modalities for emotion recognition.
ACKNOWLEDGMENT
We acknowledge the use of AI tools to generate and
enhance parts of the paper. The content was revised.
REFERENCES
Ahmed, N., Aghbari, Z. A., and Girija, S. (2023). A system-
atic survey on multimodal emotion recognition using
learning algorithms. Intell. Syst. Appl., 17:200171.
Akila, G., El-Menisy, M., Khaled, O., Sharaf, N., Tarhony,
N., and Abdennadher, S. (2015). Kalema: Digitizing
arabic content for accessibility purposes using crowd-
sourcing. In Computational Linguistics and Intelli-
gent Text Processing: 16th International Conference,
CICLing 2015, Cairo, Egypt, April 14-20, 2015, Pro-
ceedings, Part II 16, pages 655–662. Springer.
Al Roken, N. and Barlas, G. (2023). Multimodal ara-
bic emotion recognition using deep learning. Speech
Communication, 155:103005.
Arabiya, S. (2025). 10 things you may not know about.
Accessed: 16-Jan-2025.
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M.
(2020). wav2vec 2.0: A framework for self-
supervised learning of speech representations. arXiv
preprint arXiv:2006.11477.
Bagher Zadeh, A., Liang, P. P., Poria, S., Cambria, E., and
Morency, L.-P. (2018). Multimodal language analysis
in the wild: CMU-MOSEI dataset and interpretable
dynamic fusion graph. In Gurevych, I. and Miyao,
Y., editors, Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 2236–2246, Melbourne,
Australia. Association for Computational Linguistics.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower,
E., Kim, S., Chang, J. N., Lee, S., and Narayanan,
S. S. (2008). Iemocap: Interactive emotional dyadic
motion capture database. Language resources and
evaluation, 42:335–359.
Islam, M. M., Nooruddin, S., Karray, F., and Muhammad,
G. (2024). Enhanced multimodal emotion recognition
in healthcare analytics: A deep learning based model-
level fusion approach. Biomed. Signal Process. Con-
trol., 94:106241.
Kalateh, S., Estrada-Jimenez, L. A., Hojjati, S. N., and
Barata, J. (2024a). A systematic review on multi-
modal emotion recognition: building blocks, current
state, applications, and challenges. IEEE Access.
Kalateh, S., Estrada-Jimenez, L. A., Nikghadam-Hojjati, S.,
and Barata, J. (2024b). A systematic review on mul-
timodal emotion recognition: Building blocks, cur-
rent state, applications, and challenges. IEEE Access,
12:103976–104019.
Kossaifi, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt,
M., Ringeval, F., Han, J., Pandit, V., Toisoul, A.,
Schuller, B., et al. (2019). Sewa db: A rich database
for audio-visual emotion and sentiment research in the
wild. IEEE transactions on pattern analysis and ma-
chine intelligence, 43(3):1022–1040.
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria,
E., and Mihalcea, R. (2018). Meld: A multimodal
multi-party dataset for emotion recognition in conver-
sations. arXiv preprint arXiv:1810.02508.
Radoi, A., Birhala, A., Ristea, N., and Dutu, L. (2021).
An end-to-end emotion recognition framework based
on temporal aggregation of multimodal information.
IEEE Access, 9:135559–135570.
Safwat, S., Salem, M. A.-M., and Sharaf, N. (2023). Build-
ing an egyptian-arabic speech corpus for emotion
analysis using deep learning. In Pacific Rim Inter-
national Conference on Artificial Intelligence, pages
320–332. Springer.
Shaqra, F. A., Duwairi, R., and Al-Ayyoub, M. (2019).
The audio-visual arabic dataset for natural emotions.
2019 7th International Conference on Future Internet
of Things and Cloud (FiCloud), pages 324–329.
Shmyrev, N. V. and other contributors (2020). Vosk
speech recognition toolkit: Offline speech recogni-
tion api for android, ios, raspberry pi and servers
with python, java, c#, and node. https://github.com/
alphacep/vosk-api. Accessed: 2025-01-16.
Tomar, S. (2006). Converting video formats with ffmpeg.
Linux Journal, 2006(146):10.
Vijayaraghavan, G., T., M., D., P., and E., U. (2024). Mul-
timodal emotion recognition with deep learning: Ad-
vancements, challenges, and future directions. Inf. Fu-
sion, 105:102218.
Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). Joint
face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Process. Lett.,
23(10):1499–1503.
DATA 2025 - 14th International Conference on Data Science, Technology and Applications
422