Unsupervised Grammatical Pattern Discovery from Arabic Extra Large

Corpora

Adelle Abdallah

1 a

, Hussein Awdeh

1 b

, Youssef Zaki

, Gilles Bernard

1 c

and Mohammad Hajjar

LIASD Lab, Paris 8 University, 2 rue de la Libert

e 93526 Saint-Denis, Cedex, France

Faculty of Technology, Lebanese University, Hisbeh Street, Saida, Lebanon

Keywords:

Arabic Language, Arabic Natural Language Process, Validation Information Retrieval, Silver Standard

Corpus.

Abstract:

Many methods have been applied to automatic construction or expansion of lexical semantic resources. Most

follow the distributional hypothesis applied to lexical context of words, eliminating grammatical context (stop-

words). This paper will show that the grammatical context can yield information about semantic properties

of words, if the corpus be large enough. In order to do this, we present an unsupervised pattern-based model

building semantic word categories from large corpora, devised for resource-poor languages. We divide the

vocabulary between high-frequency and lower frequency items, and explore the patterns formed by high-

frequency items in the neighborhood of lower frequency words. Word categories are then created by cluster-

ing. This is done on a very large Arabic corpus, and, for comparison, on a large English corpus; results are

evaluated with direct and indirect evaluation methods. We compare the results with state-of-the-art lexical

models for performance and for computation time.

1 INTRODUCTION

Lexical semantic resources are essential for a wide

range of Natural Language Processing (NLP) tasks.

The way to build automatically such resources from

large corpora is usually by deducing semantic similar-

ities between linguistic items from their distribution.

The distributional hypothesis says that “words that are

utilized and happen in the same contexts tend to pur-

port similar meanings” (Harris, 1954), in other words

“a word is characterized by the company it keeps”

(Firth, 1957).

This hypothesis has been applied to measuring se-

mantic similarity, to word clustering, to automated

creation of thesauri and bilingual dictionaries, etc.

But nearly every work has restricted the context to

lexical items (hence the use of stopwords). This re-

striction is not part of the initial hypothesis, as seen in

the work of Harris disciples on links between lexicon

and grammar (Gross, 1994).

We try here a new approach, which is to identify

the semantic properties that can be discovered from

https://orcid.org/0000-0001-5837-8688

https://orcid.org/0000-0002-2805-4444

https://orcid.org/0000-0002-4587-4209

the grammatical patterns a word can be used into. The

idea is to build a word vector based on grammati-

cal patterns and then compare it to vectors produced

by state-of-the-art methods based on lexical context.

This approach, inspired by previous works (Bernard,

1997; Lebboss et al., 2017), is especially aimed at

resource-poor languages as Arabic, as it is mostly

built from knowledge extracted from big corpora.

Detection of grammatical patterns without knowl-

edge from the language, even stopwords, is done

by dividing the corpus vocabulary between high fre-

quency items and low frequency items. Then we com-

pute all possible patterns of high frequency items in

the neighborhood of a word, considered as features

of this word. The produced vector is the number of

occurrences of each feature.

Evaluation of our results will be double: on one

side, we will compare semantic properties deduced

from grammatical patterns with those deduced by

state-of-the-art methods from lexical data; on the

other side, we will compare the results obtained from

Arabic corpora with those obtained from an English

corpus. We will use the few semantic gold standards

available for Arabic and English, and apply an indi-

rect evaluation in vector clustering, using WordNets

for both languages.

Abdallah, A., Awdeh, H., Zaki, Y., Bernard, G. and Hajjar, M.

Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora.

DOI: 10.5220/0010651700003063

In Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI 2021), pages 211-220

ISBN: 978-989-758-534-0; ISSN: 2184-3236

211

Next section will give a short survey of related

works. Section 2 describes our method. Section 4

presents our results, before concluding.

2 RELATED WORKS

Unsupervised methods for extracting semantic

knowledge from corpora are usually divided in two

types: statistical and embedding methods. According

to the distributional hypothesis, in both types of

methods, a context must be deﬁned, and each word

is characterized by the frequency of its occurrence

in each of these contexts. The contexts are usually

deﬁned as neighboring words in some window. The

window can be a document, paragraph or sentence,

or a ﬁxed-length window.

Most of these methods follow a bag-of-words ap-

proach, where the order of items is not taken into

account – among the rare exceptions, HAL (Lund

et al., 1995); nearly all also follow a lexical context

approach, where functional words are not taken into

account.

The statistical methods whose origins date back

to (Salton, 1971) are still very active. The principle

of operation of these methods is simply to produce a

vector where each column contains the number of oc-

currences of some context in the vicinity of the word.

With a high number of contexts, as in big corpora,

this produces a huge sparse vector, with a high com-

putational and memory cost, so these methods use di-

mensionality reduction methods as singular value de-

composition or other kind of matrices factorization,

random mapping, etc.

These vectors are generated by various algorithms

such as Latent Dirichlet Analysis (Blei et al., 2003),

Probabilistic HAL (Azzopardi et al., 2005), Gener-

alized Vector Space (Wong et al., 1985), Latent Se-

mantic Analysis (Dumais, 2004), Rocchio Classiﬁca-

tion (Schutze et al., 2008), Random Indexing (Kan-

erva et al., 2000). Some methods enhance the vectors

by pre- or post-processing techniques (Turney and

Pantel, 2010; Erk, 2012; Clark, 2012), for instance

weighing values with TF-IDF or PPMI.

To overcome the sparsity problem, word embed-

ders have been developed. They represent a word

in a dense low-dimensional vector using predictive

models, usually neural network ones (Bengio et al.,

2003; Collobert and Weston, 2008; Mnih and Hin-

ton, 2008; Collobert et al., 2011; Dhillon et al., 2011;

Mikolov et al., 2013a; Mnih and Kavukcuoglu, 2013;

Collobert, 2014; Pennington et al., 2014). The prin-

ciple here is to learn a weight matrix able to predict

some neighboring word(s) of a given word. Compo-

nents of a word vector do not represent directly the

number of occurrences of some context, even if the

number of times each context occurred inﬂuences the

learning model.

Other than capturing semantic information, word

embedding have successfully been employed by sev-

eral downstream NLP applications such as named-

entity recognition, semantic role labeling, sentiment

analysis, machine translation and dependency pars-

ing.

Despite their empirical success, recent papers

shed light on the limitations of word embedders:

(Levy et al., 2015b) targets its limitation in extracting

the hyponymy and entailment relations, while (Ru-

binstein et al., 2015) pointed at its failure in captur-

ing attributive properties. (Levy et al., 2015a) shows

that the advantage of word-embedding methods over

the count-based methods (Baroni et al., 2014) can be

overridden with a good choice of hyper-parameters,

which have more inﬂuence on the results than the cho-

sen method. On the other hand, (Levy et al., 2015b)

unveils that the word-embedding methods outperform

the count-based models in diverse semantic tasks,

such as word association, synonym detection and

word clustering. (Mikolov et al., 2013b) use the offset

method to solve analogical questions such as “man is

to king as woman is to. . . ?” by addition of vectors.

Whatever the method, the bag-of-word approach

disregards grammar and word order and yields very

little information on the syntactico-semantic relations

between the words. To overcome this limitation, some

solutions have been recommended. The ﬁrst solution

proposes injecting lexico-syntactic knowledge into

the embeddings, such as dictionary, ontology or the-

saurus information, to the objective function (Kiela

et al., 2015; Lazaridou et al., 2015; Liu et al., 2015),

or at the post-processing step (Faruqui et al., 2014;

Mrk

c et al., 2016).

Others suggest a lexical pattern-based approach,

replacing the bag-of-word contexts with other con-

text types considering syntactic items. These methods

were pioneered by (Hearst, 1992; Hearst, 1998) who

manually coded patterns to capture hyponymy rela-

tions. Hand crafted patterns were used for many ap-

plications, such as extracting hyperonymy/hyponymy

(Rydin, 2002; Tjong Kim Sang and Hofmann, 2007),

meronymy (Berland and Charniak, 1999), antonymy

(Lin et al., 2003), noun categories (Widdows and

Dorow, 2002). Some methods based on Hearst pat-

terns implement weakly supervised bootstrap tech-

niques (Riloff and Shepherd, 1997), while others

reﬁned its algorithm using the syntactic structure

(Roark and Charniak, 1998; Widdows and Dorow,

2002; Phillips and Riloff, 2002; Pantel and Ravichan-

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

212

dran, 2004; Tanev and Magnini, 2006). Other

attempts use lexical-syntactic contextual patterns

(Riloff et al., 1999; Thelen and Riloff, 2002), one

even uses only grammatical information (Bernard,

1997).

(Caraballo, 1999) was the ﬁrst to use conjunctions

and appositive features to build a hyperonym-labeled

noun hierarchy similar to WordNet. (Alfonseca and

Manandhar, 2002) use Hearst patterns in an unsuper-

vised manner to extend ontology and compare it with

WordNet. In this sense, Hearst patterns were applied

to increase recall in Information Extraction systems

as KNOWITALL (Etzioni et al., 2004; Ritter et al.,

2009), to extract web-based semantic relations (Pasca,

2004), and in various other studies (Sang, 2007; Som-

batsrisomboon et al., 2003; Sumida and Torisawa,

2008).

Other works take into account the dependency re-

lations (Lin, 1998; Pad

o and Lapata, 2007; Murphy

et al., 2012; Levy and Goldberg, 2014). Some even

use clusters of lexical-syntactic patterns in which the

word occurs as context to represent it (Baroni et al.,

2010). (Bollegala et al., 2015) replaces bag-of-words

contexts with various patterns (lexical, POS and de-

pendency). Another work (Yatbaz et al., 2012) uses

paradigmatic representations of word context by re-

placing the bag-of-word contexts with substitute vec-

tors, which include the potential words that could re-

place the target word given its neighboring words.

All these methods make use of knowledge about

the language. Alternatively, the ﬂexible approaches

aim to extract patterns in an unsupervised man-

ner. Such methods were used in various NLP tasks

as constructing noun categories (Davidov and Rap-

poport, 2006) analogy question answering (Bic¸ici and

Yuret, 2006), extracting semantic relationships (Tur-

ney, 2008a; Bollegala et al., 2009; Davidov et al.,

2007), detecting synonyms (Turney, 2008b), disam-

biguation of nominal compound relations (Davidov

and Rappoport, 2008), sentiment analysis (Davidov

et al., 2010) and detection of sarcasm (Tsur et al.,

2010) or irony with Probabilistic LDA (Nozza et al.,

2016).

In a similar way, (Lebboss et al., 2017; Lebboss

et al., 2019), building on (Bernard, 1997), cluster

word vectors based on grammatical patterns. Oth-

ers use statistical graph-based approaches for auto-

matic extraction of semantic relations (Vossen, 1998;

Widdows and Dorow, 2002; Shinzato and Torisawa,

2004).

3 OUR METHOD

In order to assess the quality of semantic information

that can be extracted from automatically constructed

grammatical pattern, the ﬂexible approach is the most

convenient. Very few tests have been made on other

languages than English, and only one (Lebboss et al.,

2017) has been done on Arabic.

In this last work, the corpus vocabulary was cut

in two parts, high-frequency elements and lower-

frequency elements. Then patterns were generated

from the high-frequency elements in the vicinity of

words, forming vectors to be clustered and compared

for evaluation with Arabic WordNet. There are draw-

backs: the method explores only some of the possible

patterns; it cannot work on a really big corpus; the

evaluation was partial and only done on Arabic data.

But on the other side, the method was not language-

dependant, was based on grammatical information

and had a very few number of parameters, relatively

easy to tune.

The method presented here corrects all these

drawbacks, adding different testing protocols (every

protocol available for Arabic), with a much larger cor-

pus (one billion instead of six millions words), while

reducing the computation time and minimizing occu-

pied space.

This is done in a modular and multi-threaded

framework (https://gitlab.com/Data-Liasd/Interface),

used by various projects of our laboratory since 2014,

and collecting tools for extracting language struc-

ture from extra-large corpora. It is mainly written in

C++, with parts in Python and Java, Qt5 library, Post-

greSQL database and Cmake. Figure 1 presents the

general operation of our system, and its three main

stages will be presented in next subsections.

3.1 Word Trie Preparation

A Trie or Preﬁx Tree, is one of the best data struc-

tures for word indexing. It stores words by nodes;

each node has sons and brothers, and contains one

Unicode letter. Every common preﬁx is represented

by one node; each node branches off when the letters

diverge from the other preﬁxes in the Trie. The word

is represented by merging the characters from the root

node to the end node. The structure is compact as it

stores the common preﬁxes only once. Unlike hash-

code or binary trees where the search time of a word is

proportional to the number of stored elements, search

time here is proportional to word length. It is thus

faster as the maximum length of a word is lower than

In a big enough corpus, high-frequency words are, as is

well known, mostly functional ones.

Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora

213

Figure 1: Our system.

the number of words. It is also more efﬁcient than

hashing since it does not require hash function com-

puting or collision handling. The only drawback is

that it consumes a lot of space if the words are het-

erogeneous – few common preﬁxes (which is not the

case here).

We store the Trie in two ways: globally as a Json

ﬁle or by nodes in the database. It is worthy to note

that the word-trie is built before applying any prepro-

cessing, in order to generate various tests without hav-

ing to re-read the corpus. For reasons that will appear

later, word identiﬁers are frequency-based.

Word preprocessing integrates Khoja stemmer

(Khoja and Garside, 1999) implemented in Java by

Motaz Saad

), Arabic IBM normalization rules im-

plemented by Stanford NLP group

, among others. In

normalization preprocessing, we also replace all num-

bers by ‘1’, and all words in foreign scripts by ‘0’, in

order to keep track of those as potential context. After

every preprocessing, a mapping to the initial word-trie

is stored in the database.

3.2 Phrase Segmentation Preparation

Before building patterns, we ﬁrst segment the corpus

in small textual units that will serve as context. The

segments are delimited with punctuation (including

carriage return and linefeed). The type of punctua-

tion used for boundaries has an impact on the pattern

results. For Arabic, cutting at double punctuations

(parenthesis, brackets) or at quotes gave us lowest

quality results and induced bad grammar rules. For

https://github.com/motazsaad/

khoja-stemmer-command-line

https://github.com/stanfordnlp/CoreNLP/blob/master/

src/edu/stanford/nlp/trees/international/arabic/ArabicUtils.

java

instance, prepositions were linked to the word before

and not to the following word. So we suppressed dou-

ble punctuations and quotes from the list of delim-

iters, and obtained much better results.

Segments are stored as arrays of preprocessed

word identiﬁers (the Trie structure is not convenient

here) with ﬁle swapping for multithreading. We then

clean the segments that would not be useful for con-

text determination: those reduced to one word, or

containing only digits, only foreign words, or both.

3.3 Pattern Discovery

The ﬁrst step here is to split the vocabulary between

high-frequency and lower-frequency items. This is

done by selecting a threshold, the ﬁrst parameter of

our system; though we have done it by hand, it could

be automated by looking at the Zipf curve. We term

the high-frequency items “markers”; most of them

are grammatical words; among the rare exceptions,

“ibn” (literally “son of”) marks proper nouns, “he

said” marks a hadith (as in English “once upon a time”

marks a tale). Lower-frequency items will simply be

called words. Testing if a form is a marker is done

by comparing the form identiﬁer with the identiﬁer of

the last deﬁned marker. If it is greater, it is a marker,

otherwise it is a word. We have two basic grammar

rules:

1. A pattern contains at least one word,

2. A pattern contains at most JL consecutive words.

The JL parameter (for JokerLength) mentioned in

the last rule is our second parameter; it is the max-

imum number of consecutive words that can be at-

tached to a pattern. We deﬁne a joker (a wildcard) as

a sequence of words whose length is comprised be-

tween 1 and JL. A pattern is then a sequence of mark-

ers, coined m

in the following examples, and jokers,

coined ∗. Patterns are most appropriately stored in a

Trie; every node in the Trie contains:

• a label (the marker or joker identiﬁer),

• the number of outputs attested,

• a map of word identiﬁers and occurrences in the

pattern.

The second step is to cut the segments into pieces

that corresponds to patterns of maximum length (ac-

cording to rules 1 and 2 above). For instance, if words

are coined x, the segment x m

x x x

x x m

x, if

JL = 3, must be cut in two: x m

x x x

and x

x x m

(note that x

is in both parts).

The third step is to generate, for each part of a

segment, every pattern that is compatible with it

, to

In (Lebboss et al., 2017) this third step did not exist.

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

214

store the pattern in the pattern-trie, and to store in the

map of the pattern every word that has occurred in it,

updating its number of occurrences. The pattern-trie

is constructed in multithreaded master-slave mode.

We will keep only the patterns that appear in the

corpus more than some frequency threshold: this

threshold is our last parameter.

The last step is to generate the matrix words ×

patterns. We developed three methods to generate

the matrix, depending on the corpus size, to optimize

the performance. The matrix is stored in compressed

format; only the non-zero components are recorded

with the word and pattern identiﬁer and the number

of occurrences.

1. For small corpora: After merging the pattern-tries,

we parsed and generated the word × pattern ma-

trix (as a map) along with the parsing.

2. For medium corpora (> 70 million segments, 6

million unique words): After merging the pattern-

tries, each component of the matrix is stored in a

table, and then SQL queries are used to group all

patterns for each word.

3. For large corpora: At the end of each local Trie,

the partial components of the matrix are stored in

a table. We retrieve the different identiﬁers for

the same pattern. In the database, we merge these

identiﬁers (by SQL queries) to obtain the compo-

nents of the matrix; then we proceed as for the

medium corpora.

3.4 Evaluation

We conduct two types of evaluation. In the ﬁrst

one, based on the few semantic similarity gold stan-

dards available for Arabic (RG65, MC30 and Word-

Sim353), dot products

between vectors of the words

included in the gold standard are computed; then cor-

relations (Pearson, Spearman) between these and the

human scores are computed. Both correlations give

in general close results, so we will only give Pearson

here.

The second evaluation is more complex but bet-

ter suited for Arabic and resource-poor languages. It

is based on WordNet. The idea is to compare simi-

larities in WordNet (graph-vicinity) to the result of a

clustering. Of course, WordNet similarities compare

concepts (synsets), not words, so they cannot be di-

rectly used. Following (Aliane, 2019), the similarity

between words chosen is derived from wup similarity

(Wu and Palmer, 1994) as follows:

We also tried Euclidian distance but dot product gives

better results.

sim(m

) = ArgMax

k∈synset(m

j∈synset(m

)

wup similarity(k, j) (1)

Among the numerous clustering models avalaible,

we chose SOM (Kohonen, 1982); the main reason for

this choice is that in later stages we aim to explore

cluster topology and analyse metaclusters in the map;

another reason was the existence of a very efﬁcient

version for sparse vectors (Melka and Mariage, 2017),

which we used for our method, while using regular

SOM with dense vectors. The quality of each cluster

is computed on the basis of similarity between words

in the cluster, according to:

Qlt(C

) =

∑

−1|

k=1

∑

j=k+1

sim(m

)

|.|c

− 1|/2

(2)

The global quality is the average on all k clusters:

Qlt =

∑

i=1

Qlt(C

)

(3)

We compare the results of our method based

on grammatical information to three state-of-the-art

methods based on lexical information, CBoW, Skip-

Gram and Glove (Mikolov et al., 2013a; Pennington

et al., 2014), both on Arabic and on English.

The corpora used are given in table 1. For Ara-

bic, we used three corpora of different sizes, in order

to study the inﬂuence of corpus size on the results:

a big extract of a very large corpus created by (Leb-

boss, 2016), hereafter ABC (for Arabic Big Corpus),

the reduced corpus on which his method was exper-

imented (ARC), and Arabic Wikipedia 2019 (AW).

For English, we used a single large corpus, Wikipedia

2019 (EW). The support is the number of words com-

mon to the corpus and WordNet (Arabic WordNet and

Princeton WordNet respectively), for wup evaluation.

Uniques means number of unique normalized words.

Table 1: Corpora.

Corpus Words Uniques Support

EW 2,250,594,333 9,257,846 47,118

ABC 1,021,006,238 6,084,342 4,159

AW 119,435,658 2,700,803 3,566

ARC 8,573,345 544,796 822

4 EXPERIMENTS

Our experiments were carried on a 64-bits server with

72 CPU and 125 GB of memory. For each method

tested, we have empirically determined the parame-

ter values giving the best results. For CBOW and

Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora

215

SkipGram, we used negative sampling and a four-

words window. For GloVe, we used a ten-words win-

dow. For all three methods, we adopted 15 iterations

and 300 as vector dimension. For our method, based

on the observations below, we choose the number of

markers as 150 in Arabic and 75 in English, with

JL=4. Threshold for the patterns (depending on the

corpus) was ﬁxed for obtaining a vector of ≈ 1500 di-

mensions. For the clustering, the best results were

obtained with the SOM architecture and operation

recommended by (Kohonen, 1982) (hexagonal neigh-

borhood, Euclidian distance, and learning parameter

computed from the number of samples in two stages),

and with ≈ 550 clusters.

We will begin by some preliminary observations,

then give the results of the two types of evaluation.

4.1 Observations

As expected, the frequency distribution of words fol-

lows a Zipf Law. In English, 50 markers cover more

than 40% of the occurrences, while Arabic needs

more markers to cover the same percentage (150 with

ABC corpus). This implies that Arabic has more

structural diversity of the patterns.

More interesting, as shown in ﬁgure 2, the distri-

bution of the segments, while being a long tail dis-

tribution (with 4,357,913 segments having only one

occurrence), does not obey any kind of Zipf distribu-

tion. The black line in the ﬁgures represent the Zipf

curve closer to the data. To the contrary, the distribu-

tion of the patterns (ﬁgure 3) strictly follows a Zipf

Law (with only two humps compared to word distri-

butions). This is independent of the value of the pa-

rameters, and of the language.

The average length of the segments is 8.67 in En-

glish and 9.22 in Arabic (for the biggest corpus). Av-

erage length of the patterns is 7,1 in English and 7,75

in Arabic. The average number of markers in a pattern

is ≈ 5.

Study of the inﬂuence of number of markers and

JL value on the number and length of patterns yields

the following observations:

• when the number of markers grows, so does the

number of patterns, but their average length di-

minishes;

• length and number of patterns grow with JL un-

til a plateau is reached; the bigger the corpus, the

sooner the plateau is reached (JL = 3 or 4).

4.2 Semantic Similarity

The results for WordSim353 (table 2, with Pearson

correlation) are not good for any method, but very bad

(a) Arabig Big Corpus.

(b) English Wikipedia.

Figure 2: Log-log segment frequency distribution.

for our method. For Arabic they are worse when the

corpus size is bigger.

Table 2: Results for Direct Evaluation with WordSim353.

Corpus CBow SkipG. GloVe Our

EW 0.63 0.66 0.66 0.03

ABC 0.36 0.39 0.40 0.1

AW 0.46 0.48 0.49 0.07

ARC 0.52 0.45 0.46 -0.06

The results for RG-65 and MC-30 are very simi-

lar, so we give only the ﬁrst ones (table 3); Pearson

correlation is good with all state-of-the-art methods

(Spearman, not given, is somewhat lower) and not

good with our method, but especially bad for English.

For Arabic the same hold than for WordSim353: the

correlation worsens when the corpus size is bigger.

Table 3: Results for Direct Evaluation with RG.

Corpus CBow SkipG. GloVe Our

EW 0.78 0.78 0.74 0.02

ABC 0.88 0.89 0.92 0.41

AW 0.90 0.92 0.92 0.47

ARC 0.99 0.99 0.99 0.59

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

216

(a) Arabig Big Corpus.

(b) English Wikipedia.

Figure 3: Log-log pattern frequency distribution.

Thus it is clear that our method, only based on

grammatical information, does not capture this type

of semantic similarity. Still, results are better for Ara-

bic and RG65 (or MC30).

4.3 WordNet based Similarity

The situation is not at all the same with WordNet sim-

ilarity in indirect evaluation by clustering. As shown

in table 4, our method can give better results in Qlt for

Arabic than state-of-the-art ones; the condition seems

to be that the corpus be large. For English, even if the

results are not as good as state-of-the-art ones, they

are largely better than those for semantic similarity.

Table 4: Results for WordNet similarity.

Corp. CBoW SkipG. GloVe Our

EW 0.563 0.560 0.518 0.399

ABC 0.419 0.402 0.398 0.504

AW 0.425 0.422 0.411 0.459

ARC 0.455 0.446 0.450 0.390

Difference between English and Arabic here is

perhaps correlated to the more diverse pattern struc-

ture in Arabic observed before, which could reﬂect

that grammatical structure conveys more semantic in-

formation in Arabic.

5 CONCLUSION

The most notable (and unexpected) observation, never

done before to the best of our knowledge, is that pat-

terns, like words but unlike segments, follow Zipf

Law, at least in English and Arabic. The cause of

Zipf Law in word distribution has not been settled;

if this phenomenon is conﬁrmed in other languages,

it would mean that patterns are structurally similar to

words and not to segments or sentences, and perhaps

bring new light into the debate. The “monkey” model

for Zipf Law (Miller, 1957) aims to explain it as a

random effect; but why does it apply to words and

patterns and not to segments or sentences?

Second, considering the issue we have adressed

here, our results show that at least in some cases se-

mantic information can be deduced from grammatical

patterns, even at times better than from lexical pat-

terns. It seems clear that this holds more for languages

with more diverse pattern structure, as Arabic, than

for English. Of course that does not mean that one

should not look at lexical patterns, rather than both

types of information should be combined.

It is hard to assess what exactly is at stake between

semantic similarity gold standards and WordNet clus-

ter similarity. First, one may note that those gold stan-

dards are very small (hundreds of word pairs at most),

while WordNet, even for Arabic, is much more popu-

lated (with millions of potential word pairs). Second,

those gold standards do not distinguish between types

of similarity, while WordNet is by design much more

precise.

Whatever the case, we hope to have convinced the

reader that the contribution of grammatical structure

to semantic characterization of words deserves to be

explored.

Our strategy for discovering grammatical struc-

ture functions at an acceptable computational cost,

as shown in table 5, at least for Arabic. While Leb-

boss program took 125 minutes for the ARC corpus

(and did not compute all patterns), our program takes

only 2 minutes. We have optimized our algorithm

only on the basis of Arabic data, and a comparable

optimization is probably needed for English, in order

to shorten the computation time. The most costly in

the process is the pattern discovery (taking more than

90% of the total time).

Table 5: Computation times.

Method ABC AW ARC EW

GloVe 31mn 7mn 1mn 3.88h

CBoW 1.68h 28mn 2mn 9.8h

SkipGram 7.15h 128mn 5mn 45.56h

Our 7.49h 36mn 2mn 122.9h

Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora

217

For the time being a human intervention is nec-

essary for splitting the vocabulary, but this could be

automated by taking into account the higher hump of

Zipf curve. Absolutely no knowledge of the language

nor of its grammar is needed here.

We aim to develop further different aspects of

this work: analyse the relations between clusters in

the SOM map produced; research about distribu-

tional properties of grammatical patterns on other lan-

guages; look closer at English features that could ex-

plain the much longer computation times; last, but

not least, elaborate a method integrating lexical and

grammatical patterns in order to categorize words, es-

pecially for resource-poor languages.

ACKNOWLEDGEMENTS

This work has been done as a part of the project

”Analyses s

emantiques de textes arabes utilisant

l’ontologie et WordNet” supported by Paris 8 Univer-

sity and the Lebanese University.

REFERENCES

Alfonseca, E. and Manandhar, S. (2002). Improving an

ontology reﬁnement method with hyponymy patterns.

Language Resources and Evaluation. Las Palmas:

LREC.

Aliane, N. (2019). Evaluation des repr

esentations vecto-

rielles de mots. PhD thesis, Paris 8.

Azzopardi, L., Girolami, M., and Crowe, M. (2005). Proba-

bilistic hyperspace analogue to language. In Proceed-

ings of the 28th annual international ACM SIGIR con-

ference on Research and development in information

retrieval, pages 575–576.

Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t

count, predict! a systematic comparison of context-

counting vs. context-predicting semantic vectors. In

Proceedings of the 52nd Annual Meeting of the As-

sociation for Computational Linguistics (Volume 1:

Long Papers), pages 238–247.

Baroni, M., Murphy, B., Barbu, E., and Poesio, M. (2010).

Strudel: A corpus-based semantic model based on

properties and types. Cognitive science, 34(2):222–

254.

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C.

(2003). A neural probabilistic language model. Jour-

nal of machine learning research, 3(Feb):1137–1155.

Berland, M. and Charniak, E. (1999). Finding parts in very

large corpora. In Proceedings of the 37th annual meet-

ing of the Association for Computational Linguistics,

pages 57–64.

Bernard, G. (1997). Experiments on distributional catego-

rization of lexical items with Self Organizing Maps.

In Proceedings of WSOM, volume 97, pages 4–6.

Bic¸ici, E. and Yuret, D. (2006). Clustering word pairs to

answer analogy questions. In Proceedings of the Fif-

teenth Turkish Symposium on Artiﬁcial Intelligence

and Neural Networks (TAINN 2006), Akyaka, Mugla,

Turkey.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of machine Learning re-

search, 3(Jan):993–1022.

Bollegala, D., Maehara, T., Yoshida, Y., and

Kawarabayashi, K.-i. (2015). Learning word

representations from relational graphs. In Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

volume 29.

Bollegala, D. T., Matsuo, Y., and Ishizuka, M. (2009). Mea-

suring the similarity between implicit semantic rela-

tions from the web. In Proceedings of the 18th inter-

national conference on World wide web, pages 651–

660.

Caraballo, S. A. (1999). Automatic construction of a

hypernym-labeled noun hierarchy from text. In Pro-

ceedings of the 37th annual meeting of the Association

for Computational Linguistics, pages 120–126.

Clark, S. (2012). Vector space models of lexical mean-

ing. Handbook of Contemporary Semantics–second

edition. Wiley-Blackwell, page 8.

Collobert, R. (2014). Word embeddings through Hellinger

PCA. In Proceedings of the 14th Conference of the

European Chapter of the Association for Computa-

tional Linguistics. Citeseer.

Collobert, R. and Weston, J. (2008). A uniﬁed architec-

ture for natural language processing: Deep neural net-

works with multitask learning. In Proceedings of the

25th international conference on Machine learning,

pages 160–167.

Collobert, R., Weston, J., Bottou, L., Karlen, M.,

Kavukcuoglu, K., and Kuksa, P. (2011). Natural lan-

guage processing (almost) from scratch. Journal of

machine learning research, 12:2493–2537.

Davidov, D. and Rappoport, A. (2006). Efﬁcient unsuper-

vised discovery of word categories using symmetric

patterns and high frequency words. In Proceedings of

the 21st International Conference on Computational

Linguistics and 44th Annual Meeting of the Associa-

tion for Computational Linguistics, pages 297–304.

Davidov, D. and Rappoport, A. (2008). Classiﬁcation of se-

mantic relationships between nominals using pattern

clusters. In Proceedings of ACL-08: HLT, pages 227–

235.

Davidov, D., Rappoport, A., and Koppel, M. (2007). Fully

unsupervised discovery of concept-speciﬁc relation-

ships by web mining. In Proceedings of the 45th

Annual Meeting of the Association of Computational

Linguistics, pages 232–239.

Davidov, D., Tsur, O., and Rappoport, A. (2010). Enhanced

sentiment learning using twitter hashtags and smileys.

In Coling 2010: Posters, pages 241–249.

Dhillon, P. S., Foster, D., and Ungar, L. (2011). Multi-view

learning of word embeddings via cca. Proc. Of NIPS.

Dumais, S. T. (2004). Latent semantic analysis. An-

nual review of information science and technology,

38(1):188–230.

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

218

Erk, K. (2012). Vector space models of word meaning and

phrase meaning: A survey. Language and Linguistics

Compass, 6(10):635–653.

Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu,

A.-M., Shaked, T., Soderland, S., Weld, D. S., and

Yates, A. (2004). Web-scale information extraction

in knowitall: (preliminary results). In Proceedings of

the 13th international conference on World Wide Web,

pages 100–110.

Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E.,

and Smith, N. A. (2014). Retroﬁtting word vectors to

semantic lexicons. arXiv preprint arXiv:1411.4166.

Firth, J. R. (1957). A synopsis of linguistic theory 1930-55.

Studies in Linguistic Analysis, Special vol:1–32.

Gross, M. (1994). Computational approaches to the lexicon.

chapter Constructing Lexicon-Grammars, pages 213–

263. Oxford University Press.

Harris, Z. S. (1954). Distributional structure. Word, 10(2-

3):146–162.

Hearst, M. (1998). Wordnet: An electronic lexical database

and some of its applications. Automated Discovery of

WordNet Relations.

Hearst, M. A. (1992). Automatic acquisition of hyponyms

from large text corpora. In Coling 1992 volume 2:

The 15th international conference on computational

linguistics.

Kanerva, P., Kristoferson, J., and Holst, A. (2000). Random

indexing of text samples for latent semantic analysis.

In Proceedings of the Annual Meeting of the Cognitive

Science Society, volume 22.

Khoja, S. and Garside, R. (1999). Stemming arabic text.

Lancaster, UK, Computing Department, Lancaster

University.

Kiela, D., Hill, F., and Clark, S. (2015). Specializing word

embeddings for similarity or relatedness. In Proceed-

ings of the 2015 Conference on Empirical Methods in

Natural Language Processing, pages 2044–2048.

Kohonen, T. (1982). Self-organized formation of topolog-

ically correct feature maps. Biological cybernetics,

43(1):59–69.

Lazaridou, A., Baroni, M., et al. (2015). A multitask objec-

tive to inject lexical contrast into distributional seman-

tics. In Proceedings of the 53rd Annual Meeting of the

Association for Computational Linguistics and the 7th

International Joint Conference on Natural Language

Processing (Volume 2: Short Papers), pages 21–26.

Lebboss, G. (2016). Contribution

a l’analyse s

emantique

des textes arabes. PhD thesis, Paris 8.

Lebboss, G., Bernard, G., Aliane, N., Abdallah, A., and

Hajjar, M. (2019). Evaluating methods for building

arabic semantic resources with big corpora. In Stud-

ies in Computational Intelligence, volume 829, pages

179–197. Springer International Publishing.

Lebboss, G., Bernard, G., Aliane, N., and Hajjar, M. (2017).

Towards the enrichment of Arabic WordNet with big

corpora. In IJCCI, pages 101–109.

Levy, O. and Goldberg, Y. (2014). Dependency-based word

embeddings. In Proceedings of the 52nd Annual Meet-

ing of the Association for Computational Linguistics

(Volume 2: Short Papers), pages 302–308.

Levy, O., Goldberg, Y., and Dagan, I. (2015a). Improv-

ing distributional similarity with lessons learned from

word embeddings. Transactions of the Association for

Computational Linguistics, 3:211–225.

Levy, O., Remus, S., Biemann, C., and Dagan, I. (2015b).

Do supervised distributional methods really learn lex-

ical inference relations? In Proceedings of the 2015

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, pages 970–976.

Lin, D. (1998). Automatic retrieval and clustering of simi-

lar words. In 36th Annual Meeting of the Association

for Computational Linguistics and 17th International

Conference on Computational Linguistics, Volume 2,

pages 768–774.

Lin, D., Zhao, S., Qin, L., and Zhou, M. (2003). Identifying

synonyms among distributionally similar words. In

IJCAI, volume 3, pages 1492–1493. Citeseer.

Liu, Q., Jiang, H., Wei, S., Ling, Z.-H., and Hu, Y. (2015).

Learning semantic word embeddings based on ordi-

nal knowledge constraints. In Proceedings of the 53rd

Annual Meeting of the Association for Computational

Linguistics and the 7th International Joint Conference

on Natural Language Processing (Volume 1: Long Pa-

pers), pages 1501–1511.

Lund, K., Burgess, C., and Atchley, R. A. (1995). Semantic

and associative priming in high-dimensional semantic

space. In Proceedings of the 17th annual conference

of the Cognitive Science Society, volume 17, pages

660–665.

Melka, J. and Mariage, J.-J. (2017). Efﬁcient implementa-

tion of self-organizing map for sparse input data. In

IJCCI, pages 54–63.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Yih, W.-t., and Zweig, G. (2013b). Linguis-

tic regularities in continuous space word representa-

tions. In Proceedings of the 2013 conference of the

north american chapter of the association for com-

putational linguistics: Human language technologies,

pages 746–751.

Miller, G. A. (1957). Some effects of intermittent silence.

The American Journal of Psychology, 70(2):311.

Mnih, A. and Hinton, G. E. (2008). A scalable hierarchi-

cal distributed language model. Advances in neural

information processing systems, 21:1081–1088.

Mnih, A. and Kavukcuoglu, K. (2013). Learning word em-

beddings efﬁciently with noise-contrastive estimation.

Advances in neural information processing systems,

26:2265–2273.

Mrk

c, N., S

eaghdha, D. O., Thomson, B., Ga

c, M.,

Rojas-Barahona, L., Su, P.-H., Vandyke, D., Wen,

T.-H., and Young, S. (2016). Counter-ﬁtting word

vectors to linguistic constraints. arXiv preprint

arXiv:1603.00892.

Murphy, B., Talukdar, P., and Mitchell, T. (2012). Learn-

ing effective and interpretable semantic models using

non-negative sparse embedding. In Proceedings of

COLING 2012, pages 1933–1950.

Nozza, D., Fersini, E., and Messina, E. (2016). Unsuper-

vised irony detection: a probabilistic model with word

Unsupervised Grammatical Pattern Discovery from Arabic Extra Large Corpora

219

embeddings. In International Conference on Knowl-

edge Discovery and Information Retrieval, volume 2,

pages 68–76. SCITEPRESS.

Pad

o, S. and Lapata, M. (2007). Dependency-based con-

struction of semantic space models. Computational

Linguistics, 33(2):161–199.

Pantel, P. and Ravichandran, D. (2004). Automatically la-

beling semantic classes. In Proceedings of the Human

Language Technology Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: HLT-NAACL 2004, pages 321–328.

Pasca, M. (2004). Acquisition of categorized named entities

for web search. In Proceedings of the thirteenth ACM

international conference on Information and knowl-

edge management, pages 137–145.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP),

pages 1532–1543.

Phillips, W. and Riloff, E. (2002). Exploiting strong syn-

tactic heuristics and co-training to learn semantic lex-

icons. In Proceedings of the 2002 Conference on

Empirical Methods in Natural Language Processing

(EMNLP 2002), pages 125–132.

Riloff, E., Jones, R., et al. (1999). Learning dictionaries for

information extraction by multi-level bootstrapping.

In AAAI/IAAI, pages 474–479.

Riloff, E. and Shepherd, J. (1997). A corpus-based ap-

proach for building semantic lexicons. arXiv preprint

cmp-lg/9706013.

Ritter, A., Soderland, S., and Etzioni, O. (2009). What

is this, anyway: Automatic hypernym discovery. In

AAAI Spring Symposium: Learning by Reading and

Learning to Read, pages 88–93.

Roark, B. and Charniak, E. (1998). Noun-phrase co-

occurrence statistics for semi-automatic semantic lex-

icon construction. In COLING 1998 Volume 2: The

17th International Conference on Computational Lin-

guistics.

Rubinstein, D., Levi, E., Schwartz, R., and Rappoport, A.

(2015). How well do distributional models capture

different types of semantic knowledge? In Proceed-

ings of the 53rd Annual Meeting of the Association

for Computational Linguistics and the 7th Interna-

tional Joint Conference on Natural Language Pro-

cessing (Volume 2: Short Papers), pages 726–730.

Rydin, S. (2002). Building a hyponymy lexicon with hi-

erarchical structure. In Proceedings of the ACL-02

workshop on Unsupervised lexical acquisition, pages

26–33.

Salton, G. (1971). The smart system. Retrieval Results and

Future Plans.

Sang, E. T. K. (2007). Extracting hypernym pairs from the

web. In Proceedings of the 45th Annual Meeting of

the Association for Computational Linguistics Com-

panion Volume Proceedings of the Demo and Poster

Sessions, pages 165–168.

Schutze, H., Manning, C. D., and Raghavan, P. (2008). In-

troduction to information retrieval, volume 39. Cam-

bridge University Press Cambridge.

Shinzato, K. and Torisawa, K. (2004). Acquiring hyponymy

relations from web documents. In Proceedings of

the Human Language Technology Conference of the

North American Chapter of the Association for Com-

putational Linguistics: HLT-NAACL 2004, pages 73–

80.

Sombatsrisomboon, R., Matsuo, Y., and Ishizuka, M.

(2003). Acquisition of hypernyms and hyponyms

from the www. In Proceedings of the 2nd Interna-

tional Workshop on Active Mining.

Sumida, A. and Torisawa, K. (2008). Hacking wikipedia

for hyponymy relation acquisition. In Proceedings of

the Third International Joint Conference on Natural

Language Processing: Volume-II.

Tanev, H. and Magnini, B. (2006). Weakly supervised ap-

proaches for ontology population. In 11th Conference

of the European Chapter of the Association for Com-

putational Linguistics.

Thelen, M. and Riloff, E. (2002). A bootstrapping method

for learning semantic lexicons using extraction pat-

tern contexts. In Proceedings of the 2002 conference

on empirical methods in natural language processing

(EMNLP 2002), pages 214–221.

Tjong Kim Sang, E. and Hofmann, K. (2007). Automatic

extraction of dutch hypernym-hyponym pairs. LOT

Occasional Series, 7:163–174.

Tsur, O., Davidov, D., and Rappoport, A. (2010). Icwsm—a

great catchy name: Semi-supervised recognition of

sarcastic sentences in online product reviews. In Pro-

ceedings of the International AAAI Conference on

Web and Social Media, volume 4.

Turney, P. D. (2008a). The latent relation mapping engine:

Algorithm and experiments. Journal of Artiﬁcial In-

telligence Research, 33:615–655.

Turney, P. D. (2008b). A uniform approach to analogies,

synonyms, antonyms, and associations. arXiv preprint

arXiv:0809.0124.

Turney, P. D. and Pantel, P. (2010). From frequency to

meaning: Vector space models of semantics. Journal

of artiﬁcial intelligence research, 37:141–188.

Vossen, P. (1998). A multilingual database with lexical se-

mantic networks. Dordrecht: Kluwer Academic Pub-

lishers. doi, 10:978–94.

Widdows, D. and Dorow, B. (2002). A graph model for

unsupervised lexical acquisition. In COLING 2002:

The 19th International Conference on Computational

Linguistics.

Wong, S. M., Ziarko, W., and Wong, P. C. (1985). Gen-

eralized vector spaces model in information retrieval.

In Proceedings of the 8th annual international ACM

SIGIR conference on Research and development in in-

formation retrieval, pages 18–25.

Wu, Z. and Palmer, M. (1994). Verb semantics and lexical

selection. arXiv preprint cmp-lg/9406033.

Yatbaz, M. A., Sert, E., and Yuret, D. (2012). Learning syn-

tactic categories using paradigmatic representations

of word context. In Proceedings of the 2012 Joint

Conference on Empirical Methods in Natural Lan-

guage Processing and Computational Natural Lan-

guage Learning, pages 940–951.

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

220