Arabic Corpus Enhancement using a New Lexicon/Stemming Algorithm

Ashraf AbdelRaouf, Colin A. Higgins, Tony Pridmore, Mahmoud I. Khalil

2013

Abstract

Optical Character Recognition (OCR) is an important technology and has many advantages in storing information for both old and new documents. The Arabic language lacks both the variety of OCR systems and the depth of research relative to Roman scripts. An authoritative corpus is beneficial in the design and construction of any OCR system. Lexicon and stemming tools are essential in enhancing corpus retrieval and performance in an OCR context. A new lexicon/stemming algorithm is presented based on the Viterbi path method which uses a light stemmer approach. Lexicon and stemming lookup is combined to obtain a list of alternatives for uncertain words. This list removes affixes (prefixes or suffices) if there are any; otherwise affixes are added. Finally, every word in the list of alternatives is verified by searching the original corpus. The lexicon/stemming algorithm also assures the continuous updating of the contents of the corpus presented by (AbdelRaouf et al., 2010), which copes with the innovative needs of Arabic OCR research.

References

  1. AbdelRaouf, A., C. Higgins and M. Khalil (2008). A Database for Arabic printed character recognition. The International Conference on Image Analysis and Recognition-ICIAR2008, PĆ³voa de Varzim, Portugal, Springer Lecture Notes in Computer Science (LNCS) series.
  2. AbdelRaouf, A., C. Higgins, T. Pridmore and M. Khalil (2010). "Building a Multi-Modal Arabic Corpus (MMAC)." The International Journal of Document Analysis and Recognition (IJDAR) 13(4): 285-302.
  3. Al-Kharashi, I. A. and M. W. Evens (1994). "Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System." Journal of the American Society for Information Science 45(8): 548 - 560.
  4. Al-Shalabi, R. and M. Evens (1998). A Computational Morphology System for Arabic. Workshop on Computational Approaches to Semitic Languages COLING-ACL98, Montreal.
  5. Al-Shalabi, R. and G. Kanaan (2004). "Constructing An Automatic Lexicon for Arabic Language." International Journal of Computing & Information Sciences 2(2): 114-128.
  6. Aljlayl, M. and O. Frieder (2002). On Arabic search: improving the retrieval effectiveness via a light stemming approach. The Eleventh International Conference on Information and knowledge Management, McLean, Virginia, USA, Conference on Information and Knowledge Management archive.
  7. Corpus, T. B. N. (2007). The British National Corpus (XML Edition).
  8. Jomma, H. D., M. A. Ismail and M. I. El-Adawy (2006). An Efficient Arabic Morphology Analysis Algorithm. The Sixth Conference on Language Engineering, Cairo, Egypt, The Egyptian Society of Language Engineering.
  9. Kucera, H. and W. N. Francis (1967). "Computational Analysis of Present-Day American English." International Journal of American Linguistics 35(1): 71-75.
  10. Larkey, L. S., L. Ballesteros and M. E. Connell (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. 25th International Conference on Research and Development in Information Retrieval (SIGIR).
  11. Maynard, D., V. Tablan, H. Cunningham, C. Ursu, H. Saggion, K. Bontcheva and Y. Wilks (2002). "Architectural Elements of Language Engineering Robustness." Journal of Natural Language Engineering - Special Issue on Robust Methods in Analysis of Natural Language Data: 1-20.
  12. Rogati, M., S. McCarley and Y. Yang (2003). Unsupervised learning of Arabic stemming using a parallel corpus. The 41st Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, Japan.
  13. Time. (2008). "Time Archive 1923 to present." from http://www.time.com/time/archive/.
Download


Paper Citation


in Harvard Style

AbdelRaouf A., A. Higgins C., Pridmore T. and I. Khalil M. (2013). Arabic Corpus Enhancement using a New Lexicon/Stemming Algorithm . In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-8565-41-9, pages 435-440. DOI: 10.5220/0004260704350440


in Bibtex Style

@conference{icpram13,
author={Ashraf AbdelRaouf and Colin A. Higgins and Tony Pridmore and Mahmoud I. Khalil},
title={Arabic Corpus Enhancement using a New Lexicon/Stemming Algorithm},
booktitle={Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2013},
pages={435-440},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004260704350440},
isbn={978-989-8565-41-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Arabic Corpus Enhancement using a New Lexicon/Stemming Algorithm
SN - 978-989-8565-41-9
AU - AbdelRaouf A.
AU - A. Higgins C.
AU - Pridmore T.
AU - I. Khalil M.
PY - 2013
SP - 435
EP - 440
DO - 10.5220/0004260704350440