Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora

Adelle Abdallah, Hussein Awdeh, Youssef Zaki, Gilles Bernard, Mohammad Hajjar

2021

Abstract

Many methods have been applied to automatic construction or expansion of lexical semantic resources. Most follow the distributional hypothesis applied to lexical context of words, eliminating grammatical context (stopwords). This paper will show that the grammatical context can yield information about semantic properties of words, if the corpus be large enough. In order to do this, we present an unsupervised pattern-based model building semantic word categories from large corpora, devised for resource-poor languages. We divide the vocabulary between high-frequency and lower frequency items, and explore the patterns formed by high-frequency items in the neighborhood of lower frequency words. Word categories are then created by clustering. This is done on a very large Arabic corpus, and, for comparison, on a large English corpus; results are evaluated with direct and indirect evaluation methods. We compare the results with state-of-the-art lexical models for performance and for computation time.

Download


Paper Citation


in Harvard Style

Abdallah A., Awdeh H., Zaki Y., Bernard G. and Hajjar M. (2021). Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora. In Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI 2021) - Volume 1: NCTA; ISBN 978-989-758-534-0, SciTePress, pages 211-220. DOI: 10.5220/0010651700003063


in Bibtex Style

@conference{ncta21,
author={Adelle Abdallah and Hussein Awdeh and Youssef Zaki and Gilles Bernard and Mohammad Hajjar},
title={Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora},
booktitle={Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI 2021) - Volume 1: NCTA},
year={2021},
pages={211-220},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010651700003063},
isbn={978-989-758-534-0},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI 2021) - Volume 1: NCTA
TI - Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora
SN - 978-989-758-534-0
AU - Abdallah A.
AU - Awdeh H.
AU - Zaki Y.
AU - Bernard G.
AU - Hajjar M.
PY - 2021
SP - 211
EP - 220
DO - 10.5220/0010651700003063
PB - SciTePress