A Machine Learning based Study on Classical Arabic Authorship

Identification

Mohamed-Amine Boukhaled

Department of Computer Science, ESIEE-IT Engineering School, Pontoise, France

Keywords: Classical Arabic, Machine Learning, Authorship Identification, Style Marker, Syntactic Features,

Diachronic Corpus.

Abstract: Arabic is a widely spoken language with a rich and long written tradition spanning more than 14 centuries.

Due to its very peculiars linguistic properties, it constitutes a difficult challenge to some natural language

processing applications such as authorship identification, especially in its classical form. Authorship

identification works done on Arabic have mainly focused on the investigation of style markers derived from

either lexical or structural properties of the studied texts. Despite being effective to a certain degree, these

types of style markers have been shown to be unreliable in addressing authorship problems for such

language. In this contribution, we present a machine learning-based study on using different types of style

markers for classical Arabic. Our aim is to compare the effectiveness of machine learning authorship

identification using style markers that do not rely primarily on the lexical or structural dimension of

language. We used three types of style markers relying mostly on the syntactic information. By way of

illustration, we conducted a study and reported results of experiments done on a corpus of 700 books written

by 20 eminent classical Arabic authors.

1 INTRODUCTION

Arabic is a Semitic language with a rich and long

written tradition spanning more than 14 centuries.

Two different forms of Arabic have diachronically

emerged and co-exist nowadays, the Classical

Arabic (CA) is the historical form of the Arabic

language used in literary texts and applied mainly

for the formal academic and religious domains.

Modern Standard Arabic (MSA) on the other hand is

the form used in contemporary written works as well

as in the media.

MSA does not essentially differ from CA in its

basic linguistics foundations (morphology or

syntax). However, most researchers on Arabic

Natural Language Processing (NLP) have

concentrated their efforts on MSA. Classical Arabic,

being much more rich and complex in its stylistic,

syntactic and lexical usages, is an interesting area of

linguistics research, as much as it is a challenging

form of language for existing NLP applications

because of its peculiar characteristics.

One of the NLP applications that have received

considerable attention lately is authorship

identification. The authorship identification problem

is the task of identifying the author of a given

document. This problem (known also as authorship

attribution

or authorship recognition) can typically

be formulated as follows: given a set of candidate

authors for whom samples of written text are

available, the task is to assign a newly unseen text of

unknown authorship to one of these candidate

authors (Stamatatos, 2009).

This task has been addressed traditionally in the

literature as a text categorization problem

(Sebastiani, 2002). Text categorization is indeed a

useful way to organize large document collections.

In this line of work, current authorship attribution

methods have two key steps. First, an indexing step

based on some style markers is performed on the

studied text using some natural language processing

techniques depending on the type of the desired style

features, such as tokenizing, tagging, parsing, and

morphological analysis; then an identification step is

applied subsequently using the indexed markers to

determine the most likely authorship.

Authorship identification and authorship attribution are

two terms used interchangeably in this document.

Boukhaled, M.

A Machine Learning based Study on Classical Arabic Authorship Identiﬁcation.

DOI: 10.5220/0010969100003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 1, pages 489-495

ISBN: 978-989-758-547-0; ISSN: 2184-433X

489

The identification step usually involves using

machine learning algorithms or some other kind of

statistical and numerical analysis.

Authorship attribution works done on Arabic

have mainly focused on the investigation of style

markers derived from either the lexical or the

structural properties of the studied texts (e.g.

frequency of word forms, discourse markers, type

and length of sentences). Despite being effective to a

certain degree, these types of style markers have

been shown to be unreliable in addressing authorship

problems in Arabic (Omar and Hamouda, 2020).

This can be indeed attributed to the peculiar

linguistic properties of Arabic in general, and CA in

particular. Moreover, the majority of the work done

in Arabic authorship identification used MSA as text

resources, mainly due to its dominant usage in

journalistic and social media contents.

In this contribution, we present a comparative

study on using different types of style markers for

classical Arabic based on a machine learning

approach for authorship identification. Our aim is to

compare the effectiveness of using style markers that

do not rely primarily on the lexical or structural

dimensions of language, and hence are more prone

to be topic-independents. We used three types of

style markers relying mostly on the syntactical

information contained in the structure of the text:

Part of Speech-based features, Function Word

features, and Character-based features

By way of illustration, this study is done on a

corpus of 700 books written by 20 eminent classical

Arabic authors. To evaluate the effectiveness of our

approach, we conducted a machine learning

experiment based on three different algorithms

belonging to different statistical families and

reported their performances.

The rest of the paper is organized as follows: we

first give in section 2 a brief review of related works

concerned with authorship identification in general

and in Arabic language in particular, and then we

describe our experimental setup in section 3. The

results of the comparative study are presented in

section 4. Finally, section 5 concludes this

contribution and gives our main prospects.

2 RELATED WORKS

Authorship attribution is a relatively old research

field. A first scientific approach to the problem was

proposed in the late 19th century, in the work of

Mendenhall in 1887, who studied the authorship of

texts attributed to Bacon, Marlowe and Shakespeare.

More recently, the problem of authorship attribution

gained greater importance due to new applications in

forensic analysis and humanities scholarship, as well

as in contemporary society and industry (Kestemont

et al., 2019).

The identification process involves using

methods that fall mainly into two categories: the first

category includes methods that are based on

statistical analysis, such as principal component

analysis (Jamak, Savatić and Can, 2012) or linear

discriminant analysis (Chaski, 2005); the second

category includes machine learning techniques, such

Support Vector Machine (SVM) (Stamatatos, 2008),

decision trees (Abbasi and Chen, 2005), K-Nearest

Neighbours (KNN) (Zamani et al., 2014) and neural

networks (Zheng et al., 2006).

To achieve high authorship attribution accuracy,

one should use features that are most likely to be

independent from the topic of the text. Many style

markers have been used for this task, from early

works based on simple features such as sentence

length and vocabulary richness (Yule, 1944), to

recent and relevant works based on function words

(Zhao and Zobel, 2005) (Boukhaled and Ganascia,

2015), punctuation marks (Martin-del-Campo-

Rodriguez et al., 2019), Part-of-Speech (POS) tags

(Pokou, Fournier-Viger and Moghrabi, 2016), parse

trees (Gamon, 2004) and character-based features

(Sapkota et al., 2015).

There is a consensus among different researchers

that function words are a highly reliable indicator of

authorship (Kestemont, 2014). There are two main

reasons for using function words in lieu of other

markers. First, because of their high frequency in a

written text, function words are very difficult to

consciously control by the author, which minimizes

the risk of false attribution. The second is that

function words, unlike content words, are more

independent from the topic or the genre of the text,

so one should not expect to find great differences of

frequencies across different texts written by the

same authors on different topics (Chung and

Pennebaker, 2007).

For the Arabic language, one can categorize

existing works into two categories based on the

extracted features (Al-Ayyoub, Alwajeeh and

Hmeidi, 2017). The first category includes the

lexical approach, where the feature vector for each

text is computed based on the occurrences of the

words within it. The second category is based on

more sophisticated style markers; it relies on

computing certain features by trying to capture more

relevant and deep linguistic traits. Finally, for a

more comprehensive coverage of the different works

NLPinAI 2022 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

490

and issues on Arabic authorship identification

problem, interested readers are referred to (El Bakly,

Darwish and Hefny, 2020)

3 EXPERIMENTAL SETUP

3.1 Data Set

For this comparative study, we constructed our data

set collection from the OpenITI corpus (Belinkov et

al., 2019). This choice was motivated by our special

interest in studying classical Arabic literature, which

has not benefited from as much attention as MSA

literature did in the community.

OpenITI corpus is a historical corpus of Arabic,

containing some 6 thousand titles and approximately

1 billion words.

The collection is based on edited manuscripts,

and each title (book) is represented by its full text

support.

The corpus is cleaned and organized with

metadata information. The Library of Congress

scheme in its simplified version is followed as rules

for coding author names and book titles. Moreover,

the entire corpus is processed with state-of-the-art

Arabic NLP tools (tokenizers and morpho-syntactic

analysers among other tools). The result is a full

analysis per word, including tokenization,

lemmatization, part-of-speech-tagging, and various

morphological features, which would be very helpful

in extracting style markers considered in our

analysis. The corpus contains the majority of the

famous titles in Arabic culture, and almost all genres

that played an important role in the development of

the Arabic written tradition are represented.

For our experiment, we collected books for the

twenty most represented authors in terms of works

in the OpenITI corpus, so that the total number of

books is 700. This author selection schema helps us

cover most of the classical Arabic time period, from

the 9

to the 15

century CE (which corresponds to

and 9

centuries respectively in Islamic Hijri

calendar AH) (Ali and Ali, 1987).

The next step was to divide these books into

smaller pieces of texts in order to have enough data

Table 1: Statistics for the data set used in our experiment, the first column represent the year of death of an author, taken as

time period indicator.

Year (AH) Author # Words # Sentences # Books # Texts

303 Nasai 1655894 36925 17 76

385 Daruqutni 1355617 51345 22 105

413 ShaykhMufid 1186846 56091 44 121

430 AbuNucaymIsbahani 2981145 89544 28 180

456 IbnHazm 3060549 82807 27 165

458 Bayhaqi 5270192 135144 23 255

463 KhatibBaghdadi 3096101 49427 26 104

505 Ghazali 2103712 51758 22 104

571 IbnCasakir 6649073 167963 24 319

576 IbnMuhammadSilafi 402677 11412 16 31

597 IbnJawzi 5356768 240757 50 462

600 CabdGhaniMuqaddasi 608623 25059 21 56

643 DiyaDinMuqaddasi 1155397 48541 25 102

676 Nawawi 4182033 124590 21 263

728 IbnTaymiyya 9191977 183184 89 378

748 Dhahabi 7659298 505507 42 953

751 IbnQayyimJawziyya 4492335 97880 40 196

795 IbnRajabHanbali 2063206 94798 22 184

852 IbnHajarCasqalani 14962764 577075 50 1070

911 Suyuti 10479550 647426 91 1216

A Machine Learning based Study on Classical Arabic Authorship Identiﬁcation

491

instances to train the attribution algorithm.

Researchers working on authorship attribution

applied on literary texts have been using different

dividing strategies. For example, Hoover (2003)

decided to take just the first 10,000 words of each

book as a single text, while Argamon and Levitan

(2005) treated each chapter of each book as a

separate text.

As done in (Boukhaled and Ganascia, 2017) and

since we are considering sentence-split texts, in our

experiment we chose to slice books by the size of

the smallest one in the collection in terms of number

of sentences.

More information about the data set used in the

experiment is presented in Table1 above.

3.2 Style Markers (Features) and

Classification Scheme (Algorithms)

In our approach, we focus our comparative study on

using style markers that do not rely on the lexical or

structural dimension of classical Arabic. We used

three types of style markers relying mostly on the

syntactical information contained in the structure of

the text: Part of Speech features, Function Word

features, and Character-based features. More

precisely, each text in our data set is represented by

a vector of normalized

frequencies of occurrence of

these three types of stylistic markers: part-of-speech

tag n-grams, function words frequencies, character-

based n-grams (with n varying from 1 to 3).

Then we relied on a classification scheme based

on three different machine learning algorithms,

belongings to different statistical families, to derive

a discriminative model for our represented authors.

The three algorithms used in the experiments are:

The Logistic Regression Classifier, the Gaussian

Naïve Bayes Classifier, and K-Nearest-Neighbours

(KNN) Classifier.

To get a reasonable estimation of the expected

attribution performance (generalization), we used

common classification metrics: Precision (P), Recall

(R), and F1-score based on a 10-fold cross–

validation as follows:

𝑃=

𝑇𝑃

𝑇𝑃 + 𝐹𝑃

(1)

𝑅=

𝑇𝑃

𝑇𝑃 + 𝐹𝑁

(2)

The normalization of the frequencies vectors was done

based on the L1 normalization technique.

𝐹1 𝑠𝑜𝑐𝑟𝑒 =

2𝑃𝑅

𝑃+𝑅

(3)

Where TP, TF, FN and FP are respectively true

positive, true negative, false negative, and false

positive text-to-author attributions.

4 RESULTS AND ANALYSIS

Results of measuring the attribution performance for

the different feature sets presented in our experiment

setup are summarized below in Table 2. These

results show in general a better performance when

using character-based features, which achieved a

very high attribution, over features based on part-of-

speech and function words features.

Our study here shows that the KNN classifier is

by far the best performing model in our experiment.

Combined with features extracted using Character n-

grams, it can achieve a high attribution performance

(That is, F1 = 91.5% for character based 3-grams).

To a certain limit, adding more grams increases the

attribution performance. These comparative

performance results suggests that a simple (lazy)

model does better than complex models such as

Logistic regression classifier in our classification

settings; we believe that is due to the relatively small

size of the training data.

By contrary to our expectation, function-word-

features did not perform well in our corpus, In fact,

they achieved at best a mitigated performance (F1 =

83,7%) when used with the TF-IDF heuristic. We

believe that this is due to the presence of some

peculiar linguistic properties related to classical

Arabic affecting the attribution process. These

properties, that need to be more deeply studied in

further works, depend on the linguistic character of

the text, such as the syntactic and the lexical

disparities between the different parts of one book,

and the time period in which it was written.

Despite the fact that function word-based

markers are state-of-the-art in many other languages,

they are basically relying on the bag-of-words

assumption, which stipulates that text is a set of

independent words. This approach completely

ignores the fact that there is a syntactic structure and

latent sequential information in the text. In fact,

De Roeck, Sarkar and Garthwaite (2004) have

shown that frequent words, including function

words, do not distribute homogeneously over a text.

NLPinAI 2022 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

492

Table 2: 10-fold cross-validation results for the three models based on different types of style features. Precision (P), Recall

(R) and F1- score are shown in percentages; Time of execution is reported in seconds.

LogisticRegression GaussianNB KNeighborsClassifier

Style Markers P R F1 Time P R F1 Time P R F1 Time

CHAR_1_gram 25,9 37,4 26,1 16 63,1 55,4 56,2 6 86,6 86,2 86,1 12

CHAR_2_gram 25,1 38,1 26,9 189 67,2 62,0 62,4 27 91,5 91,0 90,9 171

CHAR_3_gram 25,3 34,9 22,8 979 81,1 80,3 79,9 118 91,0 91,6 91,5 684

FW 38,5 46,0 37,2 49 63,2 49,0 52,2 8 84,0 83,2 83,0 24

FW_TF-IDF 39,1 45,8 37,0 46 63,2 44,2 48,2 9 84,7 83,9 83,7 23

POS_1_gram 33,3 45,8 36,7 22 58,9 48,4 49,3 8 85,5 84,8 84,7 10

POS_2_gram 27,3 43,4 31,9 85 63,8 55,9 57,9 13 90,2 89,6 89,6 27

POS_3_gram 25,9 41,6 29,6 312 74,1 72,5 72,0 37 91,2 90,6 90,6 151

Table 3: Individual attribution results for each author in

the data set, produced by the best performing model KNN

classifier with character 3-gram.

Year-Author P R F1

0303-Nasai 0.89 1.00 0.94

0385-Daruqutni 1.00 0.69 0.81

0413-ShaykhMufid 0.88 1.00 0.93

0430-AbuNucaymIsbahani 1.00 0.95 0.97

0456-IbnHazm 1.00 0.94 0.97

0458-Bayhaqi 1.00 1.00 1.00

0463-KhatibBaghdadi 1.00 0.78 0.88

0505-Ghazali 0.78 1.00 0.88

0571-IbnCasakir 0.97 0.95 0.96

0576-IbnMuhammadSilafi 0.67 0.67 0.67

0597-IbnJawzi 0.83 0.98 0.90

0600-CabdGhaniMuqaddasi 0.20 0.25 0.22

0643-DiyaDinMuqaddasi 1.00 0.80 0.89

0676-Nawawi 0.79 0.90 0.84

0728-IbnTaymiyya 0.86 0.91 0.88

0748-Dhahabi 0.94 0.97 0.96

0751-IbnQayyimJawziyya 0.81 0.81 0.81

0795-bnRajabHanbali 0.87 0.93 0.90

0852-IbnHajarCasqalani 0.97 0.95 0.96

0911-Suyuti 0.97 0.92 0.95

Therefore, these results can provides evidence for

the fact that the bag-of-words assumption is not

valid for Classical Arabic as well.

By looking at the individual performances for

each author based on the best model (KNN

classifier with character 3-gram, see Table 3), we

can notice that there are no clear patterns that

emerge. Some authors have a very distinguishable

writing style such as Bayhaqi

which have a perfect

attribution performance, or IbnHazm

which have

high attribution performance (F1=97% and

P=100%), others have less distinguishable text such

as CabdGhaniMuqaddasi

(F1=22%). These

individual results do not seem to neither show any

correlation between the attribution performances and

the authors time period on the one hand, nor the

qualities of text that we had for each of them on the

other hand.

5 CONCLUSIONS

This study addressed the authorship identification

problem for classical Arabic based on a machine

learning approach. Despite being shown unreliable

in addressing the authorship identification problems

for Arabic, works done traditionally on this language

have mainly focused on the investigation of style

markers derived from either lexical or structural

properties of the studied texts. In light of this

argument, we presented a comparative study on

With K=3

(994 – 1066 CE)

(994 – 1064 CE)

(1146 – 1203 CE)

A Machine Learning based Study on Classical Arabic Authorship Identiﬁcation

493

using different types of style markers for classical

Arabic. Our aim was to compare the effectiveness of

using style markers that do not rely primarily on the

lexical or structural dimensions of language. We

used three types of style markers based mostly on

the syntactical information contained in the structure

of the text: part of speech based features, function

word features and character-based features. To

evaluate the effectiveness of these markers, we

conducted an experiment on a diachronic classical

Arabic corpus comprising more than 700 books. Our

results show that these markers can indeed be very

effective stylistic features, achieving high

performance in authorship attribution results.

REFERENCES

Abbasi, A. and Chen, H. (2005) ‘Applying authorship

analysis to extremist-group web forum messages’,

IEEE Intelligent Systems. IEEE, 20(5), pp. 67–75.

Al-Ayyoub, M., Alwajeeh, A. and Hmeidi, I. (2017) ‘An

extensive study of authorship authentication of arabic

articles’, International Journal of Web Information

Systems. Emerald Publishing Limited.

Ali, A. S. M. and Ali, A. S. (1987) A linguistic study of the

development of scientific vocabulary in Standard

Arabic. Routledge.

Argamon, S. and Levitan, S. (2005) ‘Measuring the

usefulness of function words for authorship

attribution’, in Proceedings of the Joint Conference of

the Association for Computers and the Humanities and

the Association for Literary and Linguistic Computing.

El Bakly, A. H., Darwish, N. R. and Hefny, H. A. (2020)

‘A Survey on Authorship Attribution Issues of Arabic

Text’, International Journal of Artificial Intelligent

Systems and Machine Learning, 2, pp. 86–92.

Belinkov, Y. et al. (2019) ‘Studying the history of the

Arabic language: language technology and a large-

scale historical corpus’, Language Resources and

Evaluation. Springer, 53(4), pp. 771–805.

Boukhaled, M. A. and Ganascia, J.-G. (2015) ‘Using

function words for authorship attribution: Bag-of-

words vs. sequential rules’, in Natural Language

Processing and Cognitive Science. De Gruyter, pp.

115–122.

Boukhaled, M. A. and Ganascia, J.-G. (2017) ‘Stylistic

Features Based on Sequential Rule Mining for

Authorship Attribution’, in Cognitive Approach to

Natural Language Processing. Elsevier, pp. 159–175.

Chaski, C. E. (2005) ‘Who’s at the keyboard? Authorship

attribution in digital evidence investigations’,

International journal of digital evidence. Citeseer,

4(1), pp. 1–13.

Chung, C. and Pennebaker, J. W. (2007) ‘The

psychological functions of function words’, Social

communication, pp. 343–359.

Gamon, M. (2004) ‘Linguistic correlates of style:

authorship classification with deep linguistic analysis

features’, in Proceedings of the 20th international

conference on Computational Linguistics, p. 611.

Hoover, D. L. (2003) ‘Frequent collocations and authorial

style’, Literary and Linguistic Computing. ALLC,

18(3), pp. 261–286.

Jamak, A., Savatić, A. and Can, M. (2012) ‘Principal

component analysis for authorship attribution’,

Business Systems Research: International journal of

the Society for Advancing Innovation and Research in

Economy. Udruga za promicanje poslovne

informatike, 3(2), pp. 49–56.

Kestemont, M. (2014) ‘Function words in authorship

attribution. From black magic to theory?’, in

Proceedings of the 3rd Workshop on Computational

Linguistics for Literature (CLFL), pp. 59–66.

Kestemont, M. et al. (2019) ‘Overview of the Cross-

domain Authorship Attribution Task at PAN 2019.’, in

CLEF (Working Notes).

Martin-del-Campo-Rodriguez, C. et al. (2019)

‘Authorship Attribution through Punctuation n-grams

and Averaged Combination of SVM’.

Omar, A. and Hamouda, W. I. (2020) ‘The effectiveness

of stemming in the stylometric authorship attribution

in arabic’,

International Journal of Advanced

Computer Science and Applications (IJACSA), 11(1),

pp. 116–121.

Pokou, Y. J. M., Fournier-Viger, P. and Moghrabi, C.

(2016) ‘Authorship Attribution using Variable Length

Part-of-Speech Patterns.’, in ICAART (2), pp. 354–

361.

De Roeck, A., Sarkar, A. and Garthwaite, P. H. (2004)

‘Defeating the homogeneity assumption’, in

Proceedings of 7th International Conference on the

Statistical Analysis of Textual Data (JADT), pp. 282–

294.

Sapkota, U. et al. (2015) ‘Not all character n-grams are

created equal: A study in authorship attribution’, in

Proceedings of the 2015 conference of the North

American chapter of the association for computational

linguistics: Human language technologies, pp. 93–

102.

Sebastiani, F. (2002) ‘Machine learning in automated text

categorization’, ACM computing surveys (CSUR).

ACM, 34(1), pp. 1–47.

Stamatatos, E. (2008) ‘Author identification: Using text

sampling to handle the class imbalance problem’,

Information Processing & Management. Elsevier,

44(2), pp. 790–799.

Stamatatos, E. (2009) ‘A survey of modern authorship

attribution methods’, Journal of the American Society

for information Science and Technology. Wiley Online

Library, 60(3), pp. 538–556.

Yule, G. U. (1944) The statistical study of literary

vocabulary. CUP Archive.

Zamani, H. et al. (2014) ‘Authorship identification using

dynamic selection of features from probabilistic

feature set’, in International Conference of the Cross-

NLPinAI 2022 - Special Session on Natural Language Processing in Artiﬁcial Intelligence

494

Language Evaluation Forum for European

Languages, pp. 128–140.

Zhao, Y. and Zobel, J. (2005) ‘Effective and scalable

authorship attribution using function words’, in

Information Retrieval Technology. Springer, pp. 174–

189.

Zheng, R. et al. (2006) ‘A framework for authorship

identification of online messages: Writing-style

features and classification techniques’, Journal of the

American Society for Information Science and

Technology. Wiley Online Library, 57(3), pp. 378–

393.

A Machine Learning based Study on Classical Arabic Authorship Identiﬁcation

495