
 
for applying the lexicon/stemming algorithm to the 
corpus, showing Arabic morphology and describing 
other work that has contributed this strategy. Section 
two describes the approach that had been followed 
to solve the problem and the affixes used to enhance 
the performance of the stemmer. Section three 
explains the algorithm used to apply the new 
approach. Section four presents a statistical analysis 
of the new method. Finally, section five describes 
the planned future development and uses of the 
approach and presents some conclusions. 
1.1 Motivation 
Natural Language Processing (NLP) is the use of 
computer technologies for the creation, archiving, 
processing and retrieval of machine processed 
language data and is a common research topic 
involving computer science and linguistics 
(Maynard et al., 2002). Research in the NLP of 
Arabic are very limited (AbdelRaouf et al., 2010). 
So, for instance, the Arabic language lacks a robust 
Arabic corpus. The creation of a well-established 
Arabic corpus encourages Arabic language research 
and enhances the development of Arabic OCR 
applications. 
This paper presents a new approach which 
extends and develops that reported in (AbdelRaouf 
et al., 2008, AbdelRaouf et al., 2010). An Arabic 
corpus of 6 million Arabic words containing 
282,593 unique words was constructed. In order to 
check the performance and accuracy of this corpus, a 
testing dataset of 69,158 words was also created. 
Upon searching, 89.8% of the testing dataset was 
found to exist in the corpus. We considered this 
accuracy very low. To improve this the system was 
enhanced using a lexicon/stemming algorithm. A 
combination of stemming and lexicon lookup was 
used to provide a list of alternatives for the missing 
words. 
We designed our stemmer to avoid two common 
errors. The first error occurs when the stemmer fails 
to find the relevant words (words derived from the 
same root word) and hence fails to increase the 
corpus accuracy. The second error occurs when the 
stemmer uses many affixes to create a very long list 
of alternative words, and hence detects irrelevant 
words (words not related in meaning to the original 
word). This also makes it slower. 
Our stemmer increases the accuracy of the 
corpus and simultaneously improves the reporting of 
relevant words. 
 
1.2  Arabic Language Morphology 
The Arabic language depends mainly on the root of 
a word. The root word can produce either a verb or a 
noun, for instance “” - a root word – can be a 
noun as in “   ” or a verb as in “   ”. 
Stemmers, in general, tend to extract the root of 
the word by removing affixes. English stemmers 
remove only suffixes whereas Arabic stemmers 
mainly remove prefixes and suffixes, some of them 
also remove infixes. 
Lexica on the other hand create a list of 
alternative words that can be produced by that root 
(Al-Shalabi and Evens, 1998, Jomma et al., 2006). 
Arabic words change according to the following 
variables: (Al-Shalabi and Evens, 1998, Al-Shalabi 
and Kanaan, 2004) 
  Gender: Male or female, as in (   ). 
  Tense (verbs only): Past, present or future, as 
in (    ). 
  Number: Singular, pair or plural, as in (   
  ). 
  Person: First, second or third, as in (   
 ). 
  Imperative verb: as in (   ). 
  Definiteness: Definite or indefinite, as in ( 
). 
The Arabic language, in addition to verbs and nouns, 
contains prepositions, adverbs, pronouns and so on. 
1.3 Related Work 
The Arabic language is rich and has a large variety 
of grammar rules. Research in Arabic linguistics is 
varied and can be categorized into four main types. 
1.3.1  Manually Constructed Dictionaries 
A custom Arabic retrieval system is built depending 
on a list of roots and creates lists of alternative 
words depending on those roots. This method is 
limited by the number of roots collected (Al-
Kharashi and Evens, 1994). 
1.3.2 Morphological Analysis 
This is an important topic in natural language 
processing. It is mainly concerned with roots and 
stemming identification and is related more to the 
grammar of the word and its positioning (Al-Shalabi 
and Evens, 1998, Al-Shalabi and Kanaan, 2004, 
Jomma et al., 2006). 
 
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
436