Joining LDA and Word Embeddings for Covid-19 Topic Modeling on English and Arabic Data

Amina Amara, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha

2024

Abstract

The value of user-generated content on social media platforms has been well established and acknowledged since their rich and subjective information allows for favorable computational analysis. Nevertheless, social data are often text-heavy and unstructured, thereby complicating the process of data analysis. Topic models act as a bridge between social science and unstructured social data analysis to provide new perspectives for interpreting social phenomena. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques. However, the LDA-based topic models alone do not always provide promising results and do not consider the recent advancement in the natural language processing field by leveraging word embeddings when learning latent topics to capture more word-level semantic and syntactic regularities. In this work, we extend the LDA model by mixing the Skip-gram model with Dirichlet-optimized sparse topic mixtures to learn dense word embeddings jointly with the Dirichlet distributed latent document-level mixtures of topic vectors. The embeddings produced through the proposed model were submitted to experimental evaluation using a Covid-19 based multilingual dataset extracted from the Facebook social network. Experimental results show that the proposed model outperforms all compared baselines in terms of both topic quality and predictive performance.

Download


Paper Citation


in Harvard Style

Amara A., Hadj Taieb M. and Ben Aouicha M. (2024). Joining LDA and Word Embeddings for Covid-19 Topic Modeling on English and Arabic Data. In Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-680-4, SciTePress, pages 275-282. DOI: 10.5220/0012320900003636


in Bibtex Style

@conference{icaart24,
author={Amina Amara and Mohamed Ali Hadj Taieb and Mohamed Ben Aouicha},
title={Joining LDA and Word Embeddings for Covid-19 Topic Modeling on English and Arabic Data},
booktitle={Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2024},
pages={275-282},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012320900003636},
isbn={978-989-758-680-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Joining LDA and Word Embeddings for Covid-19 Topic Modeling on English and Arabic Data
SN - 978-989-758-680-4
AU - Amara A.
AU - Hadj Taieb M.
AU - Ben Aouicha M.
PY - 2024
SP - 275
EP - 282
DO - 10.5220/0012320900003636
PB - SciTePress