Authors:
Imen Ben Cheikh
and
Zeineb Zouaoui
Affiliation:
LaTICE Research lab,University of Tunis and ESSTT, Tunisia
Keyword(s):
Natural Language Processing, Arabic Writing Recognition, Large Vocabulary, Hidden Markov Models, Canonical Vocabulary, Linguistic Properties, Viterbi Algorithm.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Classification
;
Information Retrieval and Learning
;
Knowledge Engineering and Ontology Development
;
Knowledge-Based Systems
;
Natural Language Processing
;
Pattern Recognition
;
Stochastic Methods
;
Symbolic Systems
;
Theory and Methods
Abstract:
The complexity of the recognition process is strongly related to language, the type of writing and the vocabulary size. Our work represents a contribution to a system of recognition of large canonical Arabic vocabulary of decomposable words derived from tri-consonantal roots. This system is based on a collaboration of three morphological classifiers specialized in the recognition of roots, schemes and conjugations. Our work deals with the first classifier. It is about proposing a root classifier based on 101 Hidden Markov Models, used to classify 101 tri-consonantal roots. The models have the same architecture endowed with Arabic linguistic knowledge. The proposed system deals, up to now, with a vocabulary of 5757 words. It has been learned then tested using a total of more than 17000 samples of printed words. Obtained results are satisfying and the top2 recognition rate reached 96%.