Enhanced Information Access to Social Streams Through Word Clouds with Entity Grouping

Martin Leginus, Leon Derczynski, Peter Dolog

2015

Abstract

Intuitive and effective access to large volumes of information is increasingly important. As social media explodes as a useful source of information, so are methods required to access these large volumes of user-generated content. Word clouds are an effective information access tool. However, those generated over social media data often depict redundant and mis-ranked entries. This limits the users’ ability to browse and explore datasets. This paper proposes a method for improving word cloud generation over social streams. Named entity expressions in tweets are detected, disambiguated and aggregated into entity clusters. A word cloud is generated from terms that represent the most relevant entity clusters. We find that word clouds with grouped named entities attain significantly broader coverage and significantly decreased content duplication. Further, access to relevant entries in the collection is improved. An extrinsic crowdsourced user evaluation of generated word clouds was performed. Word clouds with grouped named entities are rated as significantly more relevant and more diverse with respect to the baseline. In addition, we found that word clouds with higher levels of Mean Average Precision (MAP) are more likely to be rated by users as being relevant to the concepts reflected. Critically, this supports MAP as a tool for predicting word cloud quality without requiring a human in the loop.

References

  1. Augenstein, I., Gentile, A. L., Norton, B., Zhang, Z., and Ciravegna, F. (2013). Mapping keywords to linked data resources for automatic query expansion. In Proceedings of the Second International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, pages 9-20.
  2. Bernstein, M. S., Suh, B., Hong, L., Chen, J., Kairam, S., and Chi, E. H. (2010). Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 303-312. ACM.
  3. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. (2008). Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, pages 1247-1250. ACM.
  4. Derczynski, L., Maynard, D., Aswani, N., and Bontcheva, K. (2013). Microblog-genre noise and impact on semantic annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, pages 21-30. ACM.
  5. Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., Petrak, J., and Bontcheva, K. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2):32-49.
  6. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. (2010). Annotating named entities in twitter data with crowdsourcing. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 80-88. ACL.
  7. Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of ACL, pages 368-378. ACL.
  8. Hogan, A., Zimmermann, A., Umbrich, J., Polleres, A., and Decker, S. (2012). Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics: Science, Services and Agents on the World Wide Web, 10:76- 110.
  9. Kuo, B. Y., Hentrich, T., Good, B. M., and Wilkinson, M. D. (2007). Tag clouds for summarizing web search results. In Proceedings of WWW, pages 1203-1204. ACM.
  10. Lage, R., Dolog, P., and Leginus, M. (2014). The role of adaptive elements in web-based surveillance system user interfaces. In Dimitrova, V., Kuflik, T., Chin, D., Ricci, F., Dolog, P., and Houben, G.-J., editors, User Modeling, Adaptation, and Personalization, volume 8538 of Lecture Notes in Computer Science, pages 350-362. Springer International Publishing.
  11. Leginus, M., Dolog, P., and Lage, R. (2013). Graph based techniques for tag cloud generation. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, pages 148-157. ACM.
  12. Leginus, M., Zhai, C., and Dolog, P. (2015). Personalized generation of word clouds from tweets. Journal of the Association for Information Science and Technology.
  13. Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to information retrieval, volume 1. Cambridge University Press Cambridge.
  14. Maynard, D. and Greenwood, M. A. (2014). Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis. In Proceedings of LREC 2014, Reykjavik, Iceland. ELRA.
  15. McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., and McCullough, D. (2012). On building a reusable twitter corpus. In Proceedings of SIGIR, pages 1113- 1114. ACM.
  16. Mei, Q., Guo, J., and Radev, D. (2010). Divrank: the interplay of prestige and diversity in information networks. In Proceedings of SIGKDD, pages 1009-1018. ACM.
  17. Miotto, R., Jiang, S., and Weng, C. (2013). etacts: A method for dynamically filtering clinical trial search results. Journal of Biomedical Informatics, 46(6):1060-1067.
  18. Ounis, I., Macdonald, C., Lin, J., and Soboroff, I. (2011). Overview of the TREC-2011 microblog track. In Proceedings of the 20th Text REtrieval Conference.
  19. Sabou, M., Bontcheva, K., Derczynski, L., and Scharl, A. (2014). Corpus annotation through crowdsourcing: Towards best practice guidelines. In Proceedings of LREC 2014. ELRA.
  20. Tufekci, Z. (2014). Big questions for social media big data: Representativeness, validity and other methodological pitfalls. In Proceedings of ICWSM, pages 505-514. AAAI.
  21. Venetis, P., Koutrika, G., and Garcia-Molina, H. (2011). On the selection of tags for tag clouds. In Proceedings of WSDM, pages 835-844. ACM.
  22. Wu, W., Zhang, B., and Ostendorf, M. (2010). Automatic generation of personalized annotation tags for twitter users. In Proceedings of ACL:HLT, pages 689-692. ACL.
Download


Paper Citation


in Harvard Style

Leginus M., Derczynski L. and Dolog P. (2015). Enhanced Information Access to Social Streams Through Word Clouds with Entity Grouping . In Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-106-9, pages 183-193. DOI: 10.5220/0005403101830193


in Bibtex Style

@conference{webist15,
author={Martin Leginus and Leon Derczynski and Peter Dolog},
title={Enhanced Information Access to Social Streams Through Word Clouds with Entity Grouping},
booktitle={Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2015},
pages={183-193},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005403101830193},
isbn={978-989-758-106-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Enhanced Information Access to Social Streams Through Word Clouds with Entity Grouping
SN - 978-989-758-106-9
AU - Leginus M.
AU - Derczynski L.
AU - Dolog P.
PY - 2015
SP - 183
EP - 193
DO - 10.5220/0005403101830193