Authorship Attribution using Variable Length Part-of-Speech Patterns

Yao Jean Marc Pokou, Philippe Fournier-Viger, Chadia Moghrabi


Identifying the author of a book or document is an interesting research topic having numerous real-life applications. A number of algorithms have been proposed for the automatic authorship attribution of texts. However, it remains an important challenge to find distinct and quantifiable features for accurately identifying or narrowing the range of likely authors of a text. In this paper we propose a novel approach for authorship attribution, which relies on the discovery of variable-length sequential patterns of parts of speech to build signatures representing each author’s writing style. An experimental evaluation using 10 authors and 30 books, consisting of 2,615,856 words, from Project Gutenberg was carried. Results show that the proposed approach can accurately classify texts most of the time using a very small number of variable-length patterns. The proposed approach is also shown to perform better using variable-length patterns than with fixed-length patterns (bigrams or trigrams).


