textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data

Rob Churchill, Lisa Singh

2021

Abstract

With the rapid growth of social media in recent years, there has been considerable effort toward understanding the topics of online discussions. Unfortunately, state of the art topic models tend to perform poorly on this new form of data, due to their noisy and unstructured nature. There has been a lot of research focused on improving topic modeling algorithms, but very little focused on improving the quality of the data that goes into the algorithms. In this paper, we formalize the notion of preprocessing configurations and propose a standardized, modular toolkit and pipeline for performing preprocessing on social media texts for use in topic models. We perform topic modeling on three different social media data sets and in the process show the importance of preprocessing and the usefulness of our preprocessing pipeline when dealing with different social media data. We release our preprocessing toolkit code (textPrep) in a python package for others to use for advancing research on data mining and machine learning on social media text data.

Download


Paper Citation


in Harvard Style

Churchill R. and Singh L. (2021). textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data. In Proceedings of the 10th International Conference on Data Science, Technology and Applications - Volume 1: DATA, ISBN 978-989-758-521-0, pages 60-70. DOI: 10.5220/0010559000600070


in Bibtex Style

@conference{data21,
author={Rob Churchill and Lisa Singh},
title={textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data},
booktitle={Proceedings of the 10th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},
year={2021},
pages={60-70},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010559000600070},
isbn={978-989-758-521-0},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 10th International Conference on Data Science, Technology and Applications - Volume 1: DATA,
TI - textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data
SN - 978-989-758-521-0
AU - Churchill R.
AU - Singh L.
PY - 2021
SP - 60
EP - 70
DO - 10.5220/0010559000600070