A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection

Giacomo Domeniconi, Konstantinos Semertzidis, Vanessa Lopez, Elizabeth M. Daly, Spyros Kotoulas, Gianluca Moro

2016

Abstract

Efficiently detecting conversation threads from a pool of messages, such as social network chats, emails, comments to posts, news etc., is relevant for various applications, including Web Marketing, Information Retrieval and Digital Forensics. Existing approaches focus on text similarity using keywords as features that are strongly dependent on the dataset. Therefore, dealing with new corpora requires further costly analyses conducted by experts to find out new relevant features. This paper introduces a novel method to detect threads from any type of conversational texts overcoming the issue of previously determining specific features for each dataset. To automatically determine the relevant features of messages we map each message into a three dimensional representation based on its semantic content, the social interactions in terms of sender/recipients and its timestamp; then clustering is used to detect conversation threads. In addition, we propose a supervised approach to detect conversation threads that builds a classification model which combines the above extracted features for predicting whether a pair of messages belongs to the same thread or not. Our model harnesses the distance measure of a message to a cluster representing a thread to capture the probability that a message is part of that same thread. We present our experimental results on seven datasets, pertaining to different types of messages, and demonstrate the effectiveness of our method in the detection of conversation threads, clearly outperforming the state of the art and yielding an improvement of up to a 19%.

References

  1. Adams, P. H. and Martell, C. H. (2008). Topic detection and extraction in chat. In ICSC 2008, pages 581-588.
  2. Aumayr, E., Chan, J., and Hayes, C. (2011). Reconstruction of threaded conversations in online discussion forums. In Weblogs and Social Media.
  3. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993-1022.
  4. Bouguettaya, A., Yu, Q., Liu, X., Zhou, X., and Song, A. (2015). Efficient agglomerative hierarchical clustering. Expert Syst. Appl., 42(5):2785-2797.
  5. Coussement, K. and den Poel, D. V. (2008). Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decision Support Systems, 44(4):870-882.
  6. Dehghani, M., Shakery, A., Asadpour, M., and Koushkestani, A. (2013). A learning approach for email conversation thread reconstruction. J. Information Science, 39(6):846-863.
  7. Domeniconi, G., Moro, G., Pasolini, R., and Sartori, C. (2016). A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf.idf. In Data Management Technologies and Applications (DATA 2015), Revised Selected Papers, volume 553, pages 39-58. Springer.
  8. Erera, S. and Carmel, D. (2008). Conversation detection in email systems. In ECIR, Glasgow, UK, March 30- April 3, 2008., pages 498-505.
  9. Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In (KDD-96), Portland, Oregon, USA, pages 226-231.
  10. F. M. Khan, T. A. Fisher, L. S. T. W. and Pottenger, W. M. (2002). Mining chatroom conversations for social and semantic interactions. In Technical Report LU-CSE02-011, Lehigh University.
  11. Glass, K. and Colbaugh, R. (2010). Toward emerging topic detection for business intelligence: Predictive analysis of meme' dynamics. CoRR, abs/1012.5994.
  12. Hall, M. A., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations, 11(1):10-18.
  13. Hofmann, T. (1999). Probabilistic latent semantic indexing. In ACM SIGIR, pages 50-57. ACM.
  14. Huang, J., Zhou, B., Wu, Q., Wang, X., and Jia, Y. (2012). Contextual correlation based thread detection in short text message streams. J. Intell. Inf. Syst., 38(2):449- 464.
  15. Joshi, S., Contractor, D., Ng, K., Deshpande, P. M., and Hampp, T. (2011). Auto-grouping emails for faster e-discovery. PVLDB, 4(12):1284-1294.
  16. Jurczyk, P. and Agichtein, E. (2007). Discovering authorities in question answer communities by using link analysis. In CIKM, Lisbon, Portugal, November 6-10, 2007, pages 919-922.
  17. Manning, C. D., Raghavan, P., Sch ütze, H., et al. (2008). Introduction to information retrieval, volume 1. Cambridge university press Cambridge.
  18. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130-137.
  19. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513-523.
  20. Shen, D., Yang, Q., Sun, J., and Chen, Z. (2006). Thread detection in dynamic text message streams. In SIGIR, Washington, USA, August 6-11, 2006, pages 35-42.
  21. Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35-43.
  22. Soboroff, I., de Vries, A. P., and Craswell, N. (2006). Overview of the TREC 2006 enterprise track. In TREC, Gaithersburg, Maryland, USA, November 14- 17, 2006.
  23. Ulrich, J., Murray, G., and Carenini, G. (2008). A publicly available annotated corpus for supervised email summarization. In AAAI08 EMAIL Workshop.
  24. Wang, H., Wang, C., Zhai, C., and Han, J. (2011). Learning online discussion structures by conditional random fields. In In SIGIR 2011, Beijing, China, July 25-29, 2011, pages 435-444.
  25. Wu, Y. and Oard, D. W. (2005). Indexing emails and email threads for retrieval. In SIGIR, pages 665-666.
  26. X. Wang, M. Xu, N. Z. and Chen, N. (2008). Email conversations reconstruction based on messages threading for multi-person. In (ETTANDGRS 7808), volume 1, pages 676-680.
  27. Yeh, J. (2006). Email thread reassembly using similarity matching. In CEAS, July 27-28, 2006, Mountain View, California, USA.
  28. Zhao, Q. and Mitra, P. (2007). Event detection and visualization for social text streams. In ICWSM, Boulder, Colorado, USA, March 26-28, 2007.
  29. Zhao, Q., Mitra, P., and Chen, B. (2007). Temporal and information flow based event detection from social text streams. In AAAI, July 22-26, 2007, Vancouver, British Columbia, Canada, pages 1501-1506.
Download


Paper Citation


in Harvard Style

Domeniconi G., Semertzidis K., Lopez V., Daly E., Kotoulas S. and Moro G. (2016). A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection . In Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-193-9, pages 43-54. DOI: 10.5220/0006001100430054


in Bibtex Style

@conference{data16,
author={Giacomo Domeniconi and Konstantinos Semertzidis and Vanessa Lopez and Elizabeth M. Daly and Spyros Kotoulas and Gianluca Moro},
title={A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection},
booktitle={Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2016},
pages={43-54},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006001100430054},
isbn={978-989-758-193-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection
SN - 978-989-758-193-9
AU - Domeniconi G.
AU - Semertzidis K.
AU - Lopez V.
AU - Daly E.
AU - Kotoulas S.
AU - Moro G.
PY - 2016
SP - 43
EP - 54
DO - 10.5220/0006001100430054