Evaluation of Statistical Text Normalisation Techniques for Twitter

Phavanh Sosamphan, Veronica Liesaputra, Sira Yongchareon, Mahsa Mohaghegh

Abstract

One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, particularly from micro-blog websites like Twitter. Twitter messages, called tweets, are commonly written in ill-forms, including abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ require text normalisation techniques to detect and convert them into more accurate English sentences. There are several existing techniques proposed to solve these issues, however each technique possess some limitations and therefore cannot achieve good overall results. This paper aims to evaluate individual existing statistical normalisation methods and their possible combinations in order to find the best combination that can efficiently clean noisy tweets at the character-level, which contains abbreviations, repeated letters and misspelled words. Tested on our Twitter sample dataset, the best combination can achieve 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% Word Error Rate (WER) score, both of which are considered better than the baseline model.

References

  1. Akerkar, R. (2013). Big data computing. Florida, USA: CRC Press.
  2. Atkinson, K. (2004). GNU Aspell. Retrieved from http://aspell.net/
  3. Gouws, S., Hovy, D., & Metzler, D. (2011). Unsupervised mining of lexical variants from noisy text. Proc. of the First workshop on Unsupervised Learning in NLP (pp. 82-90).
  4. Han, B., Cook, P., & Baldwin, T. (2013). Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol., 4(1), 1-27.
  5. Li, C., & Liu, Y. (2012). Normalization of Text Messages Using Character-and Phone-based Machine Translation Approaches. In INTERSPEECH (pp. 2330-2333).
  6. Madnani, N. (2011). iBLEU: Interactively debugging and scoring statistical machine translation systems. Proc. for the ICSC Conf. on (pp. 213-214).
  7. McCallum, J. (2014). Python 3 Spelling Corrector. From https://pypi.python.org/pypi/autocorrect/0.1.0
  8. Norvig, P. (2012). How to Write a Spelling Corrector. From http://norvig.com/spell-correct.html
  9. Perkins, J. (2014). Python 3 Text Processing with NLTK 3 Cookbook. Birmingham, UK: Packt Publishing.
  10. Saloot, M. A., Idris, N., & Mahmud, R. (2014). An architecture for Malay Tweet normalization. Information Processing & Management, 50(5), 621- 633.
  11. Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proc. Intl. Conf. on Spoken Language Processing (pp. 257-286). INTERSPEECH.
Download


Paper Citation


in Harvard Style

Sosamphan P., Liesaputra V., Yongchareon S. and Mohaghegh M. (2016). Evaluation of Statistical Text Normalisation Techniques for Twitter . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 413-418. DOI: 10.5220/0006083004130418


in Bibtex Style

@conference{kdir16,
author={Phavanh Sosamphan and Veronica Liesaputra and Sira Yongchareon and Mahsa Mohaghegh},
title={Evaluation of Statistical Text Normalisation Techniques for Twitter},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={413-418},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006083004130418},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Evaluation of Statistical Text Normalisation Techniques for Twitter
SN - 978-989-758-203-5
AU - Sosamphan P.
AU - Liesaputra V.
AU - Yongchareon S.
AU - Mohaghegh M.
PY - 2016
SP - 413
EP - 418
DO - 10.5220/0006083004130418