Zarrouki.  It  had  been  manually  rewritten  and 
vocalized by  volunteers,  and  covers about 75 
million vocalized words (Zerrouki, 2017). 
  Al  Khaleej  Corpus:  collected  from  online 
newspaper “Akhbar El Khaleej” by M. Abbas. 
It  contains  more  than  five  hundred  articles, 
distributed into 3 categories (International and 
local  news,  Economy and  sports), and  covers 
about 3 million words (Abbas, 2005). 
  King  Abdulaziz  City  for  Science  and 
Technology  Arabic  Corpus:  collected  from  a 
diversity of  publishing media  by  Al-Thubaity 
and  al.  It  contains  more  than  869800  files, 
distributed  into  several  categories 
(manuscripts,  newspapers,  books,  magazines, 
scientific  periodicals,  etc.),  and  covers  more 
than  700  million  words;  7,464,396  of  which 
are unique (Al-Thubaity, 2015). 
  Contemporary  Arabic  Corpus:  collected 
between  1990  and  2004  from  newspapers, 
emails and websites by Al-Sulaiti and Atwell. 
It  is  tagged  in  xml  language  and  it  covers 
more than 842,684 words (Al-Sulaiti,, 2005). 
  Kalimat  Corpus:  collected  from  the  Arabic 
newspaper  Alwatan  by  el-haj  and  koulali, 
summed up into 2,057 multi document system 
summaries,  NER  annotated,  POS  tagged  and 
full  morphologically  analyzed.  It  contains 
more than  20,291 articles,  distributed  into six 
categories  (culture,  economy,  international 
news,  local  news,  religion  and  sports),  and 
covers  about  18,167,183  million  words  (El 
Haj, 2013). 
  SACS Corpus: collected from the proceedings 
of  the  Saudi  Arabian  National  Computer 
Science  Conference  by  Abu  Salem.  It  covers 
46,968  words  tagged  with  title,  authors, 
sources and abstract (Abu Salem). 
  The International Corpus  of Arabic:  collected 
from  electronic  books,  academic  research 
papers,  and  articles  of  newspapers  sites  by 
Alansary.  It  contains  70,022  articles, 
distributed  into  eleven  categories  (strategic, 
national  and  social  sciences,  sports,  religion, 
literature, bibliography and others), and covers 
more  than  80  million  words;  1,272,766  of 
which are unique (Alansary, 2014). 
  Al-Raya Corpus: collected from the articles of 
Al-Raya  newspaper  by  Hasnah.  It  contains 
about  187  articles  and  219,978  words,  over 
30,096  of  which  are  unique  words  (Hasnah, 
1996). 
  Arabic  Modern  Standard  Corpus:  collected 
from newspaper articles from different Arabic 
countries  by  Abdalali.  It  covers  102,134 
articles  with  about  113  million  words 
(Abdelali, 2005). 
  University of Jordan Arabic Corpus: collected 
from  15  Arabic  newspapers  and  other 
resources  from  19  Arabic  countries  by 
researchers  from  Jordan  University.  It  is 
tagged  in  XML,  and  contains  61,037  articles 
with  7,522,941  words,  and  over  70,  7385  of 
which are unique words (Hammo, 2013). 
3.2  Commercially Available Arabic 
Corpora 
The  5  monolingual  text,  and  annotated  corpora, 
which  is  cited  below,  are  commercially  Arabic 
corpus, and covers the news domain. 
  LDC  Corpus  (Arabic  Newswire):  collected 
from  the  articles  of  the  Agency  France  Press 
newswire  published  between  1994  and  2000 
by  Graff  and  Walker  at  the  University  of 
Pennsylvania’s  LDC.  It  covers  more  than  76 
million  words,  666,094  of  which  are  unique, 
distributed into 383,872 files (Graff, 2001). 
  An-Nahar  Newspaper  Text  Corpus:  collected 
from an-Nahar newspaper from 1995 to 2000, 
stored  as  hypertext  Mark-up  Language 
(HTML)  files.  It  covers  about  45  hundred 
articles and 24 million words (ELRA, 2001). 
  Al-Hayat  Arabic  Corpus:  collected  from  the 
al-Hayat Arabic newspaper. It contains 42,591 
articles,  distributed  into  several  categories 
(General,  Car,  Computer,  News,  Economics, 
Science and Sport), and covers around 42,591 
articles  with  18,639,264  unique  words 
(University Essex, 2001). 
  Nemlar  Corpus:  collected  from  13  different 
categories  (political  news,  Islamic  text, 
phrases  of  common  words,  broadcast  news, 
business,  Arabic  literature,  general  news, 
interviews,  scientific  press,  sports  press, 
dictionary  entries  explanation  and  legal 
domain text) by Nemlar project. It is provided 
four  versions:  raw,  fully  vowelized,  with 
Arabic lexical analysis, and with Arabic POS-
tags,  and  covers  more  than  500000  words 
(ALP team, 2003).
 
  Arabic Gigaword Corpus: collected from four 
distinct  Arabic  newswire  (Agency  France 
Press, Al-hayat, Annahar and Xinhua news 
agency) by Graff. It is encoded with utf-8 and 
written in SGML, and covers about 1,256,719 
articles  words  with  391619  words  (Graff, 
2003).