Topic Modelling: A Comparative Study for Short Text 
Sara Lasri and El Habib Nfaoui 
LISAC Laboratory Sidi Mohammed Ben Abdellah, University Fez, Morocco  
Keywords:  Topic Modelling, Latent Dirichlet Allocation, Biterm Model, LDA2Vec, WNTM. 
Abstract:  Massive amounts of short text collected every day. Therefore, the challenging goal is to find the information 
we are looking for, so we need to organize, search, classify and understand this large quantity of data. Topic 
modelling is a better performing technique to solve this problem. Topic modelling provides us with methods 
to organize, understand and summarize the short categorical text.TM is an intuitive approach to extract the 
most essential topics detection in a short text.
1  INTRODUCTION 
Topic  modelling  is  the  task  of  identifying  which 
underlying concepts are discussed within a collection 
of  documents  and  determining  which  topics  each 
document  is  addressing  (Andra,  Pietsch,  Stefan, 
2019).  
Topic modelling is a method to find out the hidden 
semantic  topics  (Political,  sports,  or  business,  etc.) 
from  the  observed  documents  in  the  text  corpus 
(Chris Bail, 2012). 
Topic  modelling  provides  methods  for 
automatically  organizing,  understanding,  searching, 
and  summarizing  corpus  (Bhagyashree  Vyankatrao 
Barde, A. M. Bain wad. 2017) 
 
Figure 1: Topic Modelling. 
In  general,  documents  modelled  as  mixtures  of 
subjects,  where  the  subject  is  a  probability 
distribution  over  Words  (Hamed,  Yongli,  Chi,  Xia 
Xinhua, Yanchao, Liang, 2019). Statistical techniques 
are  then  utilized  to  learn  the  topic  components  and 
mixture  coefficients  of  each  Document  (Hamed, 
Yongli, Chi, Xia Xinhua, Yanchao, Liang, 2019). 
Detection
 of the topics within short texts, such as 
tweets,  has  become  a  challenge.  However,  directly 
applying conventional topic models. (Hamed, Yongli, 
Chi, Xia Xinhua, Yanchao, Liang, 2019).  
In  this  paper,  we  present  different  methods  for 
topic  modelling,  and  we  compare  them  to  find  the 
most efficient for uncovering the hidden themes in the 
tweet. 
2  TOPIC MODELLING 
METHODS 
2.1  Latent Dirichlet Allocation (LDA) 
Latent Dirichlet Allocation (LDA) is an unsupervised 
generative probabilistic method; it is the most popular 
topic  modelling  (Hamed,  Yongli,  Chi,  Xia  Xinhua, 
Yanchao,  Liang,  2019).  The  basic  idea  is  that 
documents  represent  random  mixtures  over  latent 
topics,  where  each  subject  characterizes  by  a 
distribution  over  words  (Hamed,  Yongli,  Chi,  Xia 
Xinhua, Yanchao, Liang, 2019).