Evaluation of Statistical Text Normalisation Techniques for Twitter

Phavanh Sosamphan; Veronica Liesaputra; Sira Yongchareon; Mahsa Mohaghegh

Research.Publish.Connect.

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Evaluation of Statistical Text Normalisation Techniques for Twitter

Topics: Information Extraction; Mining Text and Semi-Structured Data; Pre-Processing and Post-Processing for Data Mining; Web Mining

In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 0IC3K, 413-418, 2016 , Porto, Portugal

Authors: Phavanh Sosamphan ¹ ; Veronica Liesaputra ¹ ; Sira Yongchareon ² and Mahsa Mohaghegh ¹

Affiliations: ¹ Unitec Institute of Technology, New Zealand ; ² AUT, New Zealand

Keyword(s): Text Mining, Social Media, Text Normalisation, Twitter, Statistical Language Models, Lexical Normalisation.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Information Extraction ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Mining Text and Semi-Structured Data ; Pre-Processing and Post-Processing for Data Mining ; Soft Computing ; Symbolic Systems ; Web Mining

Abstract: One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, particularly from micro-blog websites like Twitter. Twitter messages, called tweets, are commonly written in ill-forms, including abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ require text normalisation techniques to detect and convert them into more accurate English sentences. There are several existing techniques proposed to solve these issues, however each technique possess some limitations and therefore cannot achieve good overall results. This paper aims to evaluate individual existing statistical normalisation methods and their possible combinations in order to find the best combination that can efficiently clean noisy tweets at the character-level, which contains abbreviations, repeated letters and misspelled words. Tested on our Twitter sample dataset, the best combination can achieve 88% accuracy in the Bilingual Evaluation Understudy (BLEU) scor e and 7% Word Error Rate (WER) score, both of which are considered better than the baseline model. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 18.224.149.242

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Sosamphan, P.; Liesaputra, V.; Yongchareon, S. and Mohaghegh, M. (2016). Evaluation of Statistical Text Normalisation Techniques for Twitter. In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016) - KDIR; ISBN 978-989-758-203-5; ISSN 2184-3228, SciTePress, pages 413-418. DOI: 10.5220/0006083004130418

@conference{kdir16,
author={Phavanh Sosamphan. and Veronica Liesaputra. and Sira Yongchareon. and Mahsa Mohaghegh.},
title={Evaluation of Statistical Text Normalisation Techniques for Twitter},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016) - KDIR},
year={2016},
pages={413-418},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006083004130418},
isbn={978-989-758-203-5},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016) - KDIR
TI - Evaluation of Statistical Text Normalisation Techniques for Twitter
SN - 978-989-758-203-5
IS - 2184-3228
AU - Sosamphan, P.
AU - Liesaputra, V.
AU - Yongchareon, S.
AU - Mohaghegh, M.
PY - 2016
SP - 413
EP - 418
DO - 10.5220/0006083004130418
PB - SciTePress