AUTHOR ATTRIBUTION EVALUATION WITH NOVEL TOPIC CROSS-VALIDATION

Andrew I. Schein, Johnnie F. Caver, Randale J. Honaker, Craig H. Martell

2010

Abstract

The practice of using statistical models in predicting authorship (so-called author attribution models) is long established. Several recent authorship attribution studies have indicated that topic-specific cues impact author attribution machine learning models. The arrival of new topics should be anticipated rather than ignored in an author attribution evaluation methodology; a model that relies heavily on topic cues will be problematic in deployment settings where novel topics are common. We develop a protocol and test bed for measuring sensitivity to topic cues using a methodology called novel topic cross-validation. Our methodology performs a cross-validation where only topics unseen in training data are used in the test portion. Analysis of the testing framework suggests that corpora with large numbers of topics lead to more powerful hypothesis testing in novel topic evaluation studies. In order to implement the evaluation metric, we developed two subsets of the New York Times Annotated Corpus including one with 15 authors and 23 topics. We evaluated a maximum entropy classifier in standard and novel topic cross validation in order to compare the mechanics of the two procedures. Our novel topic evaluation framework supports automatic learning of stylometric cues that are topic neutral, and our test bed is reproducible using document identifiers available from the authors.

References

  1. Baayen, H., van Halteren, H., and Tweedie, F. (1996). Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3):121-132.
  2. Berger, A. L., Pietra, V. J. D., and Pietra, S. A. D. (1996). A maximum entropy approach to natural language processing. Comput.Linguist., 22(1):39-71.
  3. Corney, M. W. (2003). Analysing E-mail Text Authorship for Forensic Purposes. Master's thesis, Queensland University of Technology.
  4. Daumé III, H. (2004). Notes on CG and LM-BFGS optimization of logistic regression. Paper available at http://pub.hal3.name#daume04cg-bfgs, implementation available at http://hal3.name/megam/.
  5. Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., and Rossi, F. (2003). Gnu Scientific Library: Reference Manual. Network Theory Ltd.
  6. Gehrke, G. T. (2008). Authorship Discovery in Blogs Using Bayesian Classification with Corrective Scaling. Master's thesis, Naval Postgraduate School.
  7. Gough, B. J. (2010). personal communication.
  8. Japkowicz, N. (2000). The class imbalance problem: Significance and strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000), volume 1, pages 111-117.
  9. Koppel, M., Schler, J., and Bonchek-Dokow, E. (2008). Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of machine learning research : JMLR., 8(1):1261-1276.
  10. Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., and Ye, L. (2005). Author Identification on the Large Scale. In Proc. of the Meeting of the Classification Society of North America.
  11. Malyutov, M. (2006). Authorship attribution of texts: A review. Ahlswede, Rudolf (ed.) et al., General theory of information transfer and combinatorics. Berlin: Springer. Lecture Notes in Computer Science 4123, 362-380 (2006).
  12. Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Mass. ID: 40848647.
  13. Mikros, G. and Argiri, E. K. (2007). Investigating Topic Influence in Authorship Attribution. In Proceedings of the SIGIR 7807 Workshop on Plagiarism Analysis,
  14. tion, PAN 2007, Amsterdam, Netherlands, July 27,
  15. Schein, A., Popescul, A., Ungar, L., and Pennock, D. (2002). Methods and metrics for cold-start recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), pages 253-260.
  16. Stamatatos, E. (2009). A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, 60(3):538- 556.
  17. Stamatatos, E., Kokkinakis, G., and Fakotakis, N. (2000). Automatic text categorization in terms of genre and author. Comput. Linguist., 26(4):471-495. 1
  18. wixi2 - W i, j wiw jxix j). 1
  19. wiE[xi2] - W i, j wiw jE[xix j]
Download


Paper Citation


in Harvard Style

I. Schein A., F. Caver J., J. Honaker R. and H. Martell C. (2010). AUTHOR ATTRIBUTION EVALUATION WITH NOVEL TOPIC CROSS-VALIDATION . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 206-215. DOI: 10.5220/0003088402060215


in Bibtex Style

@conference{kdir10,
author={Andrew I. Schein and Johnnie F. Caver and Randale J. Honaker and Craig H. Martell},
title={AUTHOR ATTRIBUTION EVALUATION WITH NOVEL TOPIC CROSS-VALIDATION},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={206-215},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003088402060215},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - AUTHOR ATTRIBUTION EVALUATION WITH NOVEL TOPIC CROSS-VALIDATION
SN - 978-989-8425-28-7
AU - I. Schein A.
AU - F. Caver J.
AU - J. Honaker R.
AU - H. Martell C.
PY - 2010
SP - 206
EP - 215
DO - 10.5220/0003088402060215