Cluster Analysis of Twitter Data: A Review of Algorithms

Noufa Alnajran, Keeley Crockett, David McLean, Annabel Latham

Abstract

Twitter, a microblogging online social network (OSN), has quickly gained prominence as it provides people with the opportunity to communicate and share posts and topics. Tremendous value lies in automated analysing and reasoning about such data in order to derive meaningful insights, which carries potential opportunities for businesses, users, and consumers. However, the sheer volume, noise, and dynamism of Twitter, imposes challenges that hinder the efficacy of observing clusters with high intra-cluster (i.e. minimum variance) and low inter-cluster similarities. This review focuses on research that has used various clustering algorithms to analyse Twitter data streams and identify hidden patterns in tweets where text is highly unstructured. This paper performs a comparative analysis on approaches of unsupervised learning in order to determine whether empirical findings support the enhancement of decision support and pattern recognition applications. A review of the literature identified 13 studies that implemented different clustering methods. A comparison including clustering methods, algorithms, number of clusters, dataset(s) size, distance measure, clustering features, evaluation methods, and results was conducted. The conclusion reports that the use of unsupervised learning in mining social media data has several weaknesses. Success criteria and future directions for research and practice to the research community are discussed.

References

  1. Aggarwal, C. C. & Zhai, C. 2012. Mining Text Data, Springer Science & Business Media.
  2. Anumol Babu, R. V. P. 2016. Efficient Density Based Clustering of Tweets and Sentimental Analysis Based on Segmentation. International Journal of Computer Techniques, 3, 53-57.
  3. Baralis, E., Cerquitelli, T., Chiusano, S., Grimaudo, L. & Xiao, X. Analysis of Twitter Data Using A MultipleLevel Clustering Strategy. International Conference on Model and Data Engineering, 2013. Springer, 13-24.
  4. Bora, D. J., Gupta, D. & Kumar, A. 2014. A Comparative Study Between Fuzzy Clustering Algorithm and Hard Clustering Algorithm. Arxiv Preprint Arxiv:1404.6059.
  5. Breiman, L. 2001. Random forests. Machine learning, 45, 5-32.
  6. Castillo, C., Mendoza, M. & Poblete, B. Information Credibility On Twitter. Proceedings Of The 20th International Conference On World Wide Web, 2011. ACM, 675-684.
  7. De Boom, C., Van Canneyt, S. & Dhoedt, B. SemanticsDriven Event Clustering In Twitter Feeds. Making Sense Of Microposts, 2015. Ceur, 2-9.
  8. Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD, 1996. 226-231.
  9. Friedemann, V. 2015. Clustering A Customer Base Using Twitter Data.
  10. Go, A., Bhayani, R. & Huang, L. 2009. Twitter Sentiment Classification Using Distant Supervision. Cs224n Project Report, Stanford, 1, 12.
  11. Godfrey, D., Johns, C., Meyer, C., Race, S. & Sadek, C. 2014. A Case Study in Text Mining: Interpreting Twitter Data from World Cup Tweets. Arxiv Preprint Arxiv:1408.5427.
  12. Han, J., Pei, J. & Kamber, M. 2011. Data Mining: Concepts And Techniques, Elsevier.
  13. Ifrim, G., Shi, B. & Brigadir, I. Event Detection in Twitter Using Aggressive Filtering and Hierarchical Tweet Clustering. Second Workshop on Social News on the Web (Snow), Seoul, Korea, 8 April 2014, 2014. ACM.
  14. Jr., J. H. W. 1963. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58, 236-244.
  15. Kaufman, L. & Rousseeuw, P. J. 2009. Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons.
  16. Kaur, N., 2015. A Combinatorial Tweet Clustering Methodology Utilizing Inter and Intra Cosine Similarity (Doctoral Dissertation, Faculty Of Graduate Studies And Research, University Of Regina).
  17. Krestel, R., Werkmeister, T., Wiradarma, T. P. & Kasneci, G. Tweet-Recommender: Finding Relevant Tweets for News Articles. Proceedings of the 24th International Conference on World Wide Web, 2015. ACM, 53-54.
  18. Kumar, S., Morstatter, F. & Liu, H. 2014. Twitter Data Analytics, Springer.
  19. Manpreet Kaur, U. K. 2013. Comparison Between K-Mean And Hierarchical Algorithm Using Query Redirection. International Journal of Advanced Research in Computer Science and Software Engineering 3, 54-59.
  20. Miyamoto, S., Suzuki, S. & Takumi, S. Clustering In Tweets Using A Fuzzy Neighborhood Model. Fuzzy Systems (Fuzz-Ieee), 2012 IEEE International Conference On, 2012. IEEE, 1-6.
  21. Preeti Arora, D. D., Shipra Varshney. Analysis of k-Means and k-Medoids Algorithm for Big Data. 2016 India. Procedia Computer Science, 507-512.
  22. Purwitasari, D., Fatichah, C., Arieshanti, I. & Hayatin, N. k-Medoids Algorithm on Indonesian Twitter Feeds for Clustering Trending Issue as Important Terms in News Summarization. Information & Communication Technology And Systems (ICTS), 2015 International Conference On, 2015. IEEE, 95-98.
  23. Rousseeuw, P. J. 1987. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53-65.
  24. Sheela, L. 2016. A Review of Sentiment Analysis in Twitter Data Using Hadoop. International Journal of Database Theory And Application, 9, 77-86.
  25. Soni, R. & Mathai, K. J. 2015. Improved Twitter Sentiment Prediction Through Cluster-Then-Predict Model. Arxiv Preprint Arxiv:1509.02437.
  26. Vicente, M., Batista, F. & Carvalho, J. P. Twitter Gender Classification Using User Unstructured Information. Fuzzy Systems (Fuzz-IEEE), 2015 IEEE International Conference On, 2015. IEEE, 1-7.
  27. Weng, J., Li, C., Sun, A. And He, Q., 2015. Tweet Segmentation and Its Application to Named Entity Recognition.
  28. Zadeh, L. A., Abbasov, A. M. & Shahbazova, S. N. Analysis Of Twitter Hashtags: Fuzzy Clustering Approach. Fuzzy Information Processing Society (Nafips) Held Jointly With 2015 5th World Conference On Soft Computing (WCONSC), 2015 Annual Conference of the North American, 2015. IEEE, 1-6.
  29. Zhao, Y. 2011. R and Data Mining: Examples and Case Studies.
Download


Paper Citation


in Harvard Style

Alnajran N., Crockett K., McLean D. and Latham A. (2017). Cluster Analysis of Twitter Data: A Review of Algorithms . In Proceedings of the 9th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-220-2, pages 239-249. DOI: 10.5220/0006202802390249


in Bibtex Style

@conference{icaart17,
author={Noufa Alnajran and Keeley Crockett and David McLean and Annabel Latham},
title={Cluster Analysis of Twitter Data: A Review of Algorithms},
booktitle={Proceedings of the 9th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2017},
pages={239-249},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006202802390249},
isbn={978-989-758-220-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Cluster Analysis of Twitter Data: A Review of Algorithms
SN - 978-989-758-220-2
AU - Alnajran N.
AU - Crockett K.
AU - McLean D.
AU - Latham A.
PY - 2017
SP - 239
EP - 249
DO - 10.5220/0006202802390249