
 
new document finding the k nearest neighbours among the training documents. The 
resulting classification is a kind of majority vote of the categories of these neighbours 
[4].  Support vector machines try to find a model that minimizes the true error (the 
probability to make a classification error) and are based on the structural risk 
minimization principle [1]. Machine learning techniques and shallow parsing have 
been used in a methodology for authorship attribution by Luyckx and Daelemans [7].  
All the above methods, except the statistical tests, are called semi-parametric models 
for classification, as they model the underlying distribution with a potentially infinite 
number of parameters selected in such a way that the prediction becomes optimal.  
The above authorship attribution systems have several disadvantages. First of all, 
these systems invariably perform their analysis at the word level. Although word level 
analysis seems to be intuitive, it ignores various morphological features which can be 
very important to the identification problem. Therefore, the systems are language 
dependent and techniques that apply for one language usually could not be applicable 
for other languages. Emphasis must also be given to the difficulty of word 
segmentation in many Asian languages.  These systems, also, usually involve a 
feature elimination process to reduce dimensionality space by setting thresholds to 
eliminate uninformative features [8]. This fact could be extremely subtle, because 
although rare features contribute less information than common features, they can still 
have an important cumulative effect [9].   
To avoid these undesirable situations, many researchers have proposed different 
approaches, which work in a character level segmentation [13], [14]. Fuchun et al. 
[14], have shown that the state of the art performance in authorship attribution can be 
achieved by building N-gram language models of the text produced by an author. 
These models play the role of author profiles. The standard perplexity measure is then 
used as the similarity measure between two profiles.  Although these methods are 
language independent and do not require any text pre-processing, they still rely on a 
training phase during which the system has to build the author’s profile using a set of 
optimal N-grams. This may be computationally intensive and costly, especially when 
larger n-grams are used. 
In this paper, we apply an alternative non parametric approach to solve the authorship 
identification problem using N-grams at a character level segmentation (N-
consecutive characters).  We compare simple N-grams distributions with the normal 
distribution avoiding thus the extra computational burden of building authors’ 
profiles.  For a text with unknown authorship, for all the possible N-grams in the text 
we calculate their distributions in each one of the authors’ collection writings. These 
distributions are then compared to the normal distribution using the Kolmogorov - 
Smirnov test. The author, whose the derived distribution is behaved more abnormally 
is selected as the correct answer for the authorship identification problem. We expect 
the n-grams of the disputed text to be more biased against the correct and should be 
distributed more abnormally in the correct author’s collection writing, than the other 
authors’ writings.  Such an abnormality is caught by the Kolmogorov-Smirnov test.  
Our method is language independent and does not require segmentation for languages 
such as Chinese or Thai. There is no need for any text pre-processing or higher level 
processing, avoiding thus the use of taggers, parsers, feature selection strategies, or 
other language dependent NLP tools. Our method is also simple, not parametric 
without the necessity of building authors’ profiles from training data.   
140