loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Rob Churchill and Lisa Singh

Affiliation: Georgetown University, U.S.A.

Keyword(s): Text Preprocessing, Topic Modeling, Data Science, Social Media, textPrep.

Abstract: With the rapid growth of social media in recent years, there has been considerable effort toward understanding the topics of online discussions. Unfortunately, state of the art topic models tend to perform poorly on this new form of data, due to their noisy and unstructured nature. There has been a lot of research focused on improving topic modeling algorithms, but very little focused on improving the quality of the data that goes into the algorithms. In this paper, we formalize the notion of preprocessing configurations and propose a standardized, modular toolkit and pipeline for performing preprocessing on social media texts for use in topic models. We perform topic modeling on three different social media data sets and in the process show the importance of preprocessing and the usefulness of our preprocessing pipeline when dealing with different social media data. We release our preprocessing toolkit code (textPrep) in a python package for others to use for advancing research on d ata mining and machine learning on social media text data. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 54.198.34.207

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Churchill, R. and Singh, L. (2021). textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data. In Proceedings of the 10th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-521-0; ISSN 2184-285X, SciTePress, pages 60-70. DOI: 10.5220/0010559000600070

@conference{data21,
author={Rob Churchill. and Lisa Singh.},
title={textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data},
booktitle={Proceedings of the 10th International Conference on Data Science, Technology and Applications - DATA},
year={2021},
pages={60-70},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010559000600070},
isbn={978-989-758-521-0},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 10th International Conference on Data Science, Technology and Applications - DATA
TI - textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data
SN - 978-989-758-521-0
IS - 2184-285X
AU - Churchill, R.
AU - Singh, L.
PY - 2021
SP - 60
EP - 70
DO - 10.5220/0010559000600070
PB - SciTePress