Modeling Syntactic Knowledge With Neuro-Symbolic Computation

Hilton Alers-Valent

ın

, Sandiway Fong

and J. Fernando Vega-Riveros

Linguistics and Cognitive Science, University of Puerto Rico-Mayag

uez, Puerto Rico

Department of Linguistics, University of Arizona-Tucson, U.S.A.

Electrical and Computer Engineering, University of Puerto Rico-Mayag

uez, Puerto Rico

Keywords:

Minimalist Syntax, Parser, Lexicon, Structural Ambiguity, Cognitive Modeling, Computational Linguistics,

Natural Language Processing, Symbolic computation, Neural Networks, Explainable Artiﬁcial Intelligence.

Abstract:

To overcome the limitations of prevailing NLP methods, a Hybrid-Architecture Symbolic Parser and Neural

Lexicon system is proposed to detect structural ambiguity by producing as many syntactic representations

as there are interpretations for an utterance. HASPNeL comprises a symbolic AI, feature-uniﬁcation parser,

a lexicon generated using manual classiﬁcation and machine learning, and a neural network encoder which

tags each lexical item in a synthetic corpus and estimates likelihoods for each utterance’s interpretation with

respect to the corpus. Language variation is accounted for by lexical adjustments in feature speciﬁcations and

minimal parameter settings. Contrary to pure probabilistic system, HASPNeL’s neuro-symbolic architecture

will perform grammaticality judgements of utterances that do not correspond to rankings of probabilistic sys-

tems; have a greater degree of system stability as it is not susceptible to perturbations in the training data;

detect lexical and structural ambiguity by producing all possible grammatical representations regardless of

their presence in the training data; eliminate the effects of diminishing returns, as it does not require massive

amounts of annotated data, unavailable for underrepresented languages; avoid overparameterization and po-

tential overﬁtting; test current syntactic theory by implementing a Minimalist grammar formalism; and model

human language competence by satisfying conditions of learnability, evolvability, and universality.

1 INTRODUCTION

The human language faculty allows speakers to as-

sociate thoughts and concepts into mental linguistic

representations, which are subsequently externalized

as speech, text or signs. These mental representations

are hierarchical in nature, but because of constraints

of nature, the externalization is linear. Therefore,

speech and text consist only of strings of words as

leaves or terminal nodes of the whole syntactic struc-

ture, and so information about constituents, classes

and categories is literally lost in externalization. An

important consequence of this fundamental property

of language is structural ambiguity, or the fact that a

single utterance or string of words can be interpreted

in more than one way by our mental grammars. For

example, the utterance ‘They can ﬁsh’ can be inter-

preted in two different ways: as meaning that they are

able to ﬁsh or that they put ﬁsh in cans. This sentence

is ambiguous because our internal language system

can assign two different structure representations to

the same string. Ambiguity may be problematic for

efﬁcient communication as it leads to misunderstand-

ings, yet it is pervasive in language use.

To address the linguistic problem of ambiguity,

we propose a Hybrid-Architecture Symbolic Parser

with Neural Lexicon (HASPNeL) system that com-

bines the effectiveness of probabilistic systems with

the accuracy of syntactic representation of symbolic

parsing. By encoding the syntactic rules from natu-

ral language to create a generalizable tagging system,

this interdisciplinary approach represents a paradig-

matic departure from traditional attempts to iden-

tify ambiguity in natural language, such as statisti-

cal methods based on machine learning and applica-

tions following machine-learning-guided rule-based

derivations (Petkevi

c, 2014). HASPNeL would be

able to not only parse grammatically acceptable novel

strings and represent structural and lexical ambiguity,

but would also be able to identify those strings that

are not grammatically acceptable, effectively approx-

imating the performative effectiveness of the gram-

maticality judgments of native speakers of a given

language—ﬂexible enough to accept novel input, yet

608

Alers-Valentín, H., Fong, S. and Vega-Riveros, J.

Modeling Syntactic Knowledge With Neuro-Symbolic Computation.

DOI: 10.5220/0011718500003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 608-616

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

strict enough to be able to identify the acceptability of

such input, even when this input is novel.

2 ARCHITECTURE

HASPNeL uses a hybrid architecture consisting of

three major components: (1) a probabilistic encoder,

(2) a symbolic decoder, and (3) a lexicon.

2.1 Probabilistic Encoder

This component is implemented using an RNN, or

self-attentive neural network, as two strong alterna-

tives for this stage. Neural parsers can be visual-

ized as composed of two stages: encoder and de-

coder. The encoder takes an input string and assigns

syntactic categories to the words in that string. The

decoder takes the tagged lexical items and builds the

constituent structures of the sentence. This architec-

ture is denoted encoder-decoder (Aggarwal, 2018).

The output vectors should provide information about

the potential categories of the lexical items together

with their probabilities inferred from the learning al-

gorithm. The syntactic categories produced by the en-

coder component can be a vector or set of vectors.

These vectors correspond to the syntactic categories

of the lexical items in the input string. Lexical items

can have more than one syntactic category, though

(e.g., ‘ﬁsh’ can be a noun or a verb). The assignment

of category depends on the context where the word is

used. The decoder uses these vectors to incrementally

build up a labeled parse tree (Kitaev and Klein, 2018).

Encoders have been built using ﬁxed-window-size

feed-forward NNs (Durrett and Klein, 2015), but have

been displaced by Recurrent Neural Networks (RNN)

in part due to RNNs’ ability to capture global con-

text in sentences with variable lengths. Nevertheless,

RNNs have one major limitation: the long-term mem-

ory problem (Aggarwal, 2018), i.e., RNNs are not

able memorize data for long time and begin to forget

their previous inputs as the learning time passes. Two

implementations that compensate for the long-term

memory problem of RNNs are the Long-Short-Term

Memory networks (LSTM) (Hochreiter and Schmid-

huber, 1997), and the Gated Recurrent Unit (GRU)

(Cho et al., 2014). (Kitaev and Klein, 2018) propose

the use of a self-attention encoder which makes ex-

plicit the manner in which information is transferred

between different locations in the sentence. They use

this approach to study the relative importance of dif-

ferent kinds of context to the parsing task. The lo-

cations in the sentence attend to each other based on

their positions, but also based on their contents.

2.2 Symbolic Decoder

The decoder will produce the different structural anal-

ysis based on the syntactic categories produced by the

neural encoder. Symbolic systems are characterized

by (i) the use of a set of symbols as knowledge repre-

sentations, (ii) a speciﬁc formal code (metalanguage)

to formulate the symbol-handling system, and (iii) au-

tonomy between the syntactic component (which sets

the conditions for structural well-formedness) and the

semantic component (which computes meaning from

well-formed expressions).

The symbol-handling component of our proposed

system encodes the formalisms of Minimalist Gram-

mars (MGs) ((Stabler, 1997), (Stabler, 2011), (Collins

and Stabler, 2016)) as a formalization of Minimalist

syntax ((Chomsky, 1995), (Chomsky, 2001); (Chom-

sky, 2008)). The mathematical rigor makes it possi-

ble to address questions about the generative power

and explanatory adequacy of this formalism for nat-

ural language (Graf, 2021). Moreover, by putting

Minimalism on a mathematical foundation, it can be

linked to existing work on parsing and learnability.

This approach not only strengthens the connection be-

tween theoretical syntax and psycholinguistics, but it

also opens up the gate to large-scale applications in

modern language technology. As Graf points out, if

Minimalist ideas can be shown to be useful for prac-

tical applications, that is mutually beneﬁcial for all

involved ﬁelds (Graf, 2021).

Linguistic theories are generative models of the

human language faculty. Broadly speaking, two fac-

tors of generative models pertain to the construction

of parsing models: 1) Binary Merge, the primitive op-

eration at the heart of modern theories, is a bottom-up

operation that constructs larger phrases from smaller

one. However, parsing models generally operate from

left to right, this is termed online, and results in struc-

ture being ﬁlled in incrementally as parsing proceeds.

Therefore, it is a research challenge to re-interpret

Merge as predictive parsing. 2) Merge is word-order

free, in other words, core operations of grammar con-

struct dependencies, e.g. agreement, binding, con-

trol or movement chain, between phrases based on

hierarchical structure only. Syntactic objects built by

Merge must be linearized during Externalization. It

is a challenge to reconstruct or reverse this process

during parsing.

Recent work in the Minimalist Program has high-

lighted the role of locally deterministic computations

in the construction of syntactic representation as part

of a shift in the structure of linguistic theories of

narrow syntax from abstract systems of declarative

rules and principles, (Chomsky, 1981), to systems

Modeling Syntactic Knowledge With Neuro-Symbolic Computation

609

where design speciﬁcations call for efﬁcient com-

putation within the human language faculty (Fong,

2005). Case agreement is reanalyzed in terms of a

system of probes, e.g., functional heads that target and

agree with goals, e.g., referential and expletive nomi-

nals, within their c-command domain. In this system,

probe-goal agreement can be long-distance and need

not trigger movement.

The proposed system represents a development of

Fong’s implementation of the probe-goal account. He

also sustains that ”efﬁcient assembly, i.e., locally de-

terministic computation, from a generative perspec-

tive with respect to (bottom-up) MERGE does not

guarantee that parsing with probes and goals will

also be similarly efﬁcient. By locally deterministic

computation, we mean that the choice of operation

to apply to properly continue the derivation is clear

and apparent at each step of the computation” (Fong,

2005). Therefore, following Fong’s system, instead

of MERGE and MOVE as the primitive combinatory

operations for the assembly of phrase structure, the

proposed system will also be driven by elementary

tree composition with respect to a range of heads in

the extended verb projection (v*, V, c, and T). Ele-

mentary tree composition is an operation that is a ba-

sic component of Tree-Adjoining Grammars (TAG)

(Joshi and Shabes, 1997). The system will also be

on-line in the sense that once an input element has

fulﬁlled its function, it is discarded, i.e., no longer ref-

erenced. To minimize search, there is neither looka-

head nor lookback in the sense of being able to exam-

ine or search the derivational history, but Fong’s two

novel devices with well-deﬁned properties: a Move

Box that encodes the residual properties of CHAINs

and theta theory, and a single or current Probe Box

to encode structural Case assignment and to approxi-

mate the notion of (strong) Phase boundaries. In par-

ticular, the restriction to a single Probe Box means

that probes cannot “see” past another probe; thereby

emulating the Phase Impenetrability Condition (PIC).

Limiting the Move Box to operate as a stack will al-

low nesting but not overlapping movement. A con-

sequence of this is that extraction through the edge

of a strong Phase is no longer possible. Examples of

parses will be used to illustrate the empirical prop-

erties of these computational elements. The system

is also incremental in the sense that a partial parse is

available at all stages of processing (Fong, 2005).

In more recent work, e.g. (Fong and Ginsburg,

2019), many dependency relations and phenomena

across different languages (English, Arabic, Japanese

and Persian) have been directly implemented in the

generative framework using a Minimalist Machine.

Our plan is to adapt this machinery for parsing.

2.3 Lexicon

The lexicon is the module that contains the grammati-

cal information about all lexical items in the sentences

to be analyzed by the parser. Following the Chomsky-

Borer hypothesis, MGs situate all language-speciﬁc

variation in the lexicon. Hence every MG is just a ﬁ-

nite set of lexical items. Each lexical item takes the

form A :: α, where A is the item’s phonetic exponent

and α its string of features (Graf, 2021).

As ”the heart of the implemented system”, the

lexicon will contain every fully inﬂected word-form

appearing in a corpus of 2000 manually-tagged sen-

tences that were constructed for validation purposes.

Lexical items are entered as a string of literals, and

features are indicated by means of different data

types. All lexical items are labeled with a syntactic

category; additionally, each category requires a spe-

ciﬁc subset of valued features and lexical properties,

which at least contains the syntactic category, sub-

categorization frames and relevant grammatical fea-

tures (such as case, c-selectional and phi-features)

for each lexical item. Since it is necessary to deter-

mine if a certain combination of words is licensed or

grammatical in the language, the lexicon should in-

clude every possible entry for each ambiguous lex-

ical item. (Alers-Valent

ın et al., 2019). The prop-

erty of selection and uninterpretable feature match-

ing will drive the parsing process. In the course of

computation, unintepretable features belonging to an-

alyzed constituents will be eliminated through probe-

goal agreement. A (valid) parse is a phrase structure

that obeys the selectional properties of the individual

lexical items, covers the entire input, and has all un-

interpretable features properly valued (Fong, 2005).

3 ENCODING AND ESTIMATING

AMBIGUITY

Lexically ambiguous items will have as many lexical

entries as meanings and/or feature bundles are identi-

ﬁed and tagged for that item in the corpus. In those

cases, the RNN encoder will produce as many out-

puts as there are entries for said item. For example,

let us say that for the word ”can” there should be (at

least) three outputs: (MD 0.7 can), (VB 0.1 can), and

(NN 0.2 can), where the number n, 0 ≤ n ≤ 1, corre-

sponds to the likelihood of each category. The like-

lihood of a category is calculated within the corpus.

The sum of the likelihood of each category should be

exactly 1. On the other hand, if the item were ”cans”,

the output would have at least options like (cans 0.2

VBZ) and (cans 0.8 NNS). In the case of the lexi-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

610

cal verb ”cans”, it has 3rd person singular phi- (gen-

der/person/number) features and non-preterit tense

feature. So, an utterance like ”I can ﬁsh” is ambigu-

ous, yet ”He cans ﬁsh” is not. For this last utterance,

the symbolic component should be able to discard the

NNS category, in spite of having a higher likelihood.

4 PROBLEMS WITH CURRENT

PARSING TRENDS

Probabilistic parsers produce the most likely parse for

a given word string, regardless of the acceptability or

ambiguity of the string. They use a statistical model

of the syntactic structure of a language, e. g., proba-

bilistic context free grammar (PCFG). Although prob-

abilistic parsers are widely used in NLP applications,

they always output a structure even if the word se-

quence is ungrammatical or unacceptable for native

speakers. They also require a manually annotated

corpus and a statistical learning algorithm. Although

these parsers are particularly good in identifying syn-

tactic categories and have a desirable cost-beneﬁt re-

lation between accuracy and speed, they have been

found rather ineffective in the representation of sen-

tences containing long-distance relations among con-

stituents (Alers-Valent

ın et al., 2019).

(Bernardy and Chatzikyriakidis, 2019) point out

that symbolic systems can be very precise, yet they

break easily in the presence of new data. Symbolic

systems in NLP tasks have been criticized on the fact

of their “brittleness”, i.e., that these systems tend to

easily break down once they are moved to open do-

mains. Neural Network (NN) models are currently

the most used in all sorts of NLP applications since

these systems, at ﬁrst sight, do not seem to suffer

from the brittleness problem that characterize sym-

bolic approaches. In spite of their apparent suc-

cess, (Bernardy and Chatzikyriakidis, 2019) recog-

nize that recent studies show that NLP applications

of NN, such as state-of-the-art natural language infer-

ence (NLI) systems, are rather brittle in the sense that

they “fail to generalize outside individual datasets and

are, furthermore, unable to capture certain NLI pat-

terns at all” and therefore argue that, with respect to

symbolic approaches, “the NLP community has been

probably too hasty in dismissing them.”

There is recent literature regarding hybrid parsing

systems ((Gaddy et al., 2018); (Stanojevi

c and Sta-

bler, 2018); (Torr et al., 2019)), like the A* neural

parser developed by a research team in the Univer-

sity of Edinburgh. This particular system is an im-

plementation of a minimalist grammar that uses the

A* search algorithm. This system produces accu-

rate syntactic representation in many cases, including

complex structures as in across-the-board movement;

however, the results are not always consistent, and the

system does not account for any kind of ambiguity.

4.1 Grammaticality Judgements

In our hybrid approach, a neural lexicon handles

the multiple syntactic categories of words and lexi-

cal items, while the rule-based component attempts

to match those syntactic categories with well-formed

phrases according to a set of grammar rules. If the

rule-based component cannot ﬁt the syntactic cat-

egories into a well-formed structure, the string is

deemed non grammatical. Machine learning-based

parsers are trained with sentences from a corpus. This

approach infers rules from a limited set of examples,

however large the set may be. To logically infer a

rule describing every member of a set, the system

must have information about every member of that

set. According to the No Free Lunch Theorem for ma-

chine learning, every classiﬁcation algorithm, when

averaged over all possible data-generating distribu-

tions, has the same error rate when classifying pre-

viously unobserved points (Wolpert and Macready,

1997). “In some sense, no machine learning algo-

rithm is universally any better than any other” (Good-

fellow et al., 2016). Moreover, most ML-based

parsers are trained with grammatical sentences. Even

if a ML-based parser were trained including non-

grammatical sentences, it is biased by the proportion

of non-grammatical utterances in the corpus. Dur-

ing testing, both false positive and false negative non-

grammatical results get buried together with other

types of parsing errors.

Probabilistic systems do not appear to show any

correlation between grammaticality and the rankings

in a list of examples (Fong, 2022). In experiments

performed by (Pereira, 2002) and (Fong, 2022), gram-

matical examples are not ranked highly enough to

make an appearance within the 10-best list. In fact,

the grammatical example ”colorless green ideas sleep

furiously” only ranks 23rd out of the list of the 120

possible permutations of those ﬁve words. The re-

sults show a clear lack of discrimination between

the grammatical and the ungrammatical, and Chom-

sky’s observation still holds: “there is no signiﬁ-

cant correlation between order of approximation and

grammaticalness. If we order the strings of a given

length in terms of order of approximation to English,

we shall ﬁnd both grammatical and ungrammatical

strings scattered throughout the list, from top to bot-

tom. Hence the notion of statistical approximation ap-

pears to be irrelevant to grammar” (Chomsky, 1956).

Modeling Syntactic Knowledge With Neuro-Symbolic Computation

611

4.2 System Stability

The experiment by (Fong, 2022) also shows that sta-

tistical systems are not as stable as could be pre-

sumed. For example, the grammatical sentence color-

less green ideas sleep furiously ranks higher than the

ungrammatical *furiously sleep ideas green colorless

when 34,000-40,000 treebank sentences are used in

training. However, when trained with about 15,000-

32,000 sentences, not only does the ungrammatical

sentence rank higher, but it achieves a top-10 score

for this interval, a score not achieved by the gram-

matical sentence at any stage of the experiment. This

calls into question the stability of the statistical sys-

tem (Fong, 2022). Further experimentation conﬁrms

the observed instability.

The probabilistic context free grammar (CFG)

system is also surprisingly sensitive to perturbation in

the training data. Another experiment by (Fong and

Berwick, 2008) conﬁrms this problem, despite the

many thousands of treebank sentences available for

training. Prepositional phrase (PP) attachment am-

biguity is an important task for any syntactic parser,

with either high attachment to the VP or low attach-

ment to the NP, as in [Herman [VP [VP mixed [NP the

milk] ] [PP with the water] ] ] (PP high-attachment)

versus [Herman [VP drink [NP the milk [PP with the

water] ] ] ] (PP low-attachment).

For this sentence, the system produced a low at-

tachment representation. A single training example

was enough to account for the low attachment. To

conﬁrm this, the relevant PP was deleted from the

training example and the parser was retrained, result-

ing in an output with high attachment in both sen-

tences (Fong and Berwick, 2008). The reason for

this extreme sensitivity to perturbation in the train-

ing data is that there are millions of parameters that

need to be estimated, and this particular parser makes

use of nearly every statistical event (recorded during

training), even if those events occur only once (Fong,

2022). Since symbolic systems do not make use of

any statistical event, they cannot experience any de-

gree of perturbation.

4.3 Ambiguity Detection

In contrast to probabilistic parsers, symbolic pars-

ing systems perform very well in handling syntactic

ambiguity, as they do not depend on training data

(Alers-Valent

ın et al., 2019). It is enough to spec-

ify in the lexicon the categorial selection of lexical

items like drink (one NP internal argument) and mix

(one NP and one PP internal arguments). In this case,

the symbolic system will always produce a PP high-

attachment in clauses whose predicate requires a PP

internal argument (e.g. with mix as main verb), but

both PP high- and low-attachment in clauses whose

predicate does not require it (as with the verb drink).

4.4 Data Requirements

To produce structural representations, symbolic sys-

tems require a (manually) annotated lexicon contain-

ing an array of lexical items with the grammatical fea-

tures and properties used by the parser. The size of the

lexicon is determined by the number of lexical entries

required to characterize the target language, which is

ﬁnite by nature, probably to a maximum in the order

of 10

entries. On the other hand, current probabilis-

tic parsing systems require massive amounts of good

quality data. Since machine-generated data is low

quality, it leads to poor performance, while good qual-

ity data, which is manually annotated, makes it ex-

tremely expensive. In machine learning approaches,

a fraction of the instances is used to build and tune

the training model. The remaining instances, referred

to as the held-out instances, are used for testing. The

accuracy of predicting the labels of the held-out in-

stances is then reported as the accuracy of the model.

The fraction used to build the model is further divided

in two sets: training and validation. Strictly speaking,

the validation data is also a part of the training data,

because it inﬂuences the ﬁnal model. For very large

labeled data sets, only a modest number of examples

to estimate accuracy is needed. There are two options

for training the model. One is to hold-out the valida-

tion set. The other is to use cross-validation, which

can closely estimate the true accuracy under certain

circumstances. However, cross-validation can result

computationally expensive (Aggarwal, 2018).

Besides, huge amounts of data for general-

purpose NLP tasks, albeit low quality, is available for

only a relatively small number of languages. For ex-

ample, GPT-2 was trained on the WebText corpus,

containing about 40 GB of text data. In the case

of English, 40 GB is not particularly burdensome,

but in the case of under-represented languages, large

amounts of training data may never become available

(Fong, 2022). Diminishing returns are another (ex-

pected) negative factor: “to halve the error rate, you

can expect to need more than 500 times the com-

putational resources” (Thomson et al., 2021). The

enormous resources required, both in terms of en-

ergy and exposure to large amount of data, means that

these probabilistic systems, independent of their po-

tential achievements or promise of their biologically-

inspired architecture, cannot possibly meet the aus-

tere learning conditions met by nature (Fong, 2022).

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

612

4.5 Model Size and Parameters

There exists a serious number-of-degrees-of-freedom

problem (Fong, 2022) with (probabilistic) Context-

free grammars (CFGs) -the most commonly imple-

mented parsing formalism- as they are too uncon-

strained; in principle, all combinations of phrases,

both exo- and endo-centric, are possible in this frame-

work. From the point of view of empirical coverage,

CFGs are too broad. At the same time, CFGs are a

poor choice of formalism for encoding many types of

structurally-determined relations, e.g., displacement,

control, long-distance agreement or pronominal bind-

ing. Bikel observes that “...it may come as a surprise

that the [parser] needs to access more than 219 million

probabilities during the course of parsing the 1,917

sentences of Section 00 [of the Penn Treebank: SF]”

(Bikel, 2004).

Recently, general-purpose deep neural networks

have been adopted that contain vastly more parame-

ters than the statistical CFG models. For example, the

well-known GPT-2 neural net model has 1.5B param-

eters, and the next-generation GPT-3 model has 175B

parameters (Brown et al., 2020). However, it is not

clear whether these systems do anything more with

the upscaled parameter size other than simply mem-

orize more. A substantial downside of these scaled-

up systems is in terms of the computational resources

required to perform the training (Fong, 2022). Large

model sizes (more than 100 million parameters) make

it computationally expensive to train separate models

for each language (Kitaev et al., 2018). In fact, GPT-

3 is reputed to have cost around 4.6 million dollars to

train. This has resulted in a curious admission in the

case of GPT-3 (by the authors): “unfortunately, a bug

resulted in only partial removal of all detected over-

laps from the training data. Due to the cost of train-

ing, it wasn’t feasible to retrain the model” (Brown

et al., 2020).

4.6 Explanatory Adequacy

Different from current probabilistic models, HASP-

NeL aims to model a generative grammar, i.e., a

theory that seeks to explain the properties of the I-

language and the system of externalization possessed

by the language user. At a deeper level, the theory of

the shared language faculty, Universal Grammar (UG)

in modern terms, is concerned with the innate fac-

tors that make language acquisition possible — fac-

tors that distinguish humans from all other organisms.

One achieves a genuine explanation of some linguistic

phenomenon only if it keeps to mechanisms that sat-

isfy the joint conditions of learnability, evolvability,

and universality, which appear to be at odds (Chom-

sky, 2021).

Models based on information-theoretic and

machine-learning ideas have been successful in a

variety of language processing tasks in which what is

sought is a decision among a ﬁnite set of alternatives,

or a ranking of alternatives (Pereira, 2002). In

each case, the task can be formalized as learning a

mapping from spoken or written material to a choice

or ranking among alternatives. However, a potential

weakness of such task-directed learning procedures

is that they ignore regularities that are not relevant

to the task, even though those regularities may be

highly informative about other questions. This is in

sharp contrast with human learners who are general

learners and as such sensitive to regularities observed

beyond those relevant to a speciﬁc task. “Further-

more, one may reasonably argue that a task-oriented

learner does not really ‘understand’ language, since

it can accurately decide just one question, while

our intuitions about understanding suggest that a

competent language user can accurately decide many

questions pertaining to any discourse it processes.

For instance, a competent language user should be

able to reliably answer ‘who did what to whom’

questions pertaining to each clause in the discourse”

(Pereira, 2002). We do not claim that HASPNeL will

‘understand’ language, yet it may resemble Searle’s

(1980) Chinese room, able to efﬁciently perform

operations on symbolic representations to produce

correct descriptions without having to choose or rank

among alternatives.

4.7 Cognitive Plausibility

CFGs also pose an acquisition problem that contrasts

with the human experience. Unlike the case of the

cognitively-unrealistic treebank containing already-

parsed sentences, hierarchical structure is not ex-

plicitly represented in primary linguistic data (Fong,

2022). General-purpose systems (GPS) are attractive

to the engineering community; advantages include

ﬂexibility across problem sets and (non-language) do-

mains. There is also an intuitive appeal in assuming

setup simplicity in the language domain as if “nothing

necessarily particular to language is hardcoded ahead

of time” (Fong, 2022). One can regard these GPS

as a continuation of the behaviorist conception, as in

Bloomﬁeld’s description of language as “a matter of

training and habit” (Chomsky, 2021). “However, with

so many parameters, the chief downsides are that a

lot of training data is required, much more than what

seems to be cognitively plausible, and that there are

burdensome requirements in terms of computational

Modeling Syntactic Knowledge With Neuro-Symbolic Computation

613

resources (for training). The term overparameteriza-

tion is used when a model has many more param-

eters than data points, like in GPT-2’s case, poten-

tially leading to overﬁtting, i.e., memorization of the

training data, rather than true generalization” (Fong,

2022).

“While typically task-agnostic in architecture, this

method still requires task-speciﬁc ﬁne-tuning datasets

of thousands or tens of thousands of examples. By

contrast, humans can generally perform a new lan-

guage task from only a few examples [...] —some-

thing which current NLP systems still largely struggle

to do” (Brown et al., 2020).

We agree with Pereira’s conclusion that “although

statistical learning theory and its computational ex-

tensions can help us ask better questions and rule

out seductive non sequiturs, their quantitative re-

sults are still too coarse to narrow signiﬁcantly the

ﬁeld of possible acquisition mechanisms. However,

some of the most successful recent advances in ma-

chine learning arose from theoretical analysis (Cortes

& Vapnik, 1995; Freund & Schapire, 1997), and

theory is also helping to sharpen our understanding

of the power and limitations of informally-designed

learning algorithms” (Pereira, 2002). On the other

hand, information-theoretic and computational ideas

are also playing an increasing role in the scientiﬁc un-

derstanding of language. We envision our proposed

hybrid system as a step towards bringing together the

best of these seemingly irreconcilable perspectives of

formal linguistics and information theory.

5 EVALUATION AND

ASSESSMENT

Some evaluation methods commonly used among the

NLP community are not suitable for HASPNeL. That

does not mean that the system cannot be evaluated,

but rather that evaluation must be grounded in linguis-

tic principles and formal computational methods.

Since we claim that the HASPNeL system over-

comes some of the disadvantages of statistical parsers,

it would seem reasonable at ﬁrst to attempt to evalu-

ate our system’s output against that of other statisti-

cal systems like the Stanford Parser. However, there

are two reasons why this attempt would be futile. In

the ﬁrst place, it is not possible to compare the repre-

sentations produced by a symbolic system with those

of a probabilistic one, since, by design, parses pro-

duced by symbolic systems have to be grammatical,

yet parses by probabilistic systems do not have any

guarantee or presumption of grammaticality. Sym-

bolic parsers as HASPNeL only produce trees that

can be generated by the grammar procedures and re-

strictions (external and internal Merge, uniﬁcation,

locality constraints) that is modeled by the system,

On the other hand, probabilistic systems always parse

any string of words, regardless of its grammaticality.

The parser documentation of the (Group, 2022) states

that ”this parser is in the space of modern statistical

parsers whose goal is to give the most likely sentence

analysis to a list of words. It does not attempt to de-

termine grammaticality, though it will normally pre-

fer a ”grammatical” parse for a sentence if one ex-

ists.” In answering why a parse tree assigned to a sen-

tence may be wrong, they give as a possible expla-

nation that “it may be because the parser made a mis-

take. While our goal is to improve the parser when we

can, we can’t ﬁx individual examples. The parser is

just choosing the highest probability analysis accord-

ing to its grammar.” Evaluating the grammaticality of

HASPNeL’s performance against a parser like Stan-

ford’s will be advantageous to our assessment, but in

the end it would not say much about our system.

Another problem with comparing HASPNeL

parsing trees against those of another statistical sys-

tem is that there is no match between the structural de-

scriptions produced by the two different systems. At

the core of the HASPNeL system there is a minimal-

ist grammar, following contemporary linguistic the-

ory. Among many other things, minimalist trees are

strictly binary and endocentric (every phrase or pro-

jection has to have a head of the same category), while

statistical systems still use PCFG, with unrestricted

rules that allow for multiple branching nodes and exo-

centric representations. Also, the differences in label-

ing conventions are beyond comparison. Evaluation

methods sometimes applied to probabilistic parsers,

such as measuring the accuracy of a structural de-

scription by counting and comparing nodes and labels

in trees, are not linguistically plausible. Structural

representations are grammar-dependent, so they do

not have an absolute or “ﬁxed” number of nodes and

branches. Likewise, trees may have the same num-

ber of the same labels although they were describing

different structures. These kinds of comparisons may

be somewhat useful between systems using the same

grammar, but otherwise they do not produce a valid

assessment. Unlike HASPNeL, since typically sta-

tistical parsers only choose “the highest probability

analysis according to its grammar”, they are not par-

ticularly well suited to detect ambiguity, either lexical

or structural.

From the arguments outlined above, we conclude

that the best evaluation of the results obtained from a

symbolic, knowledge-based system can only be done

by experts who, in this case, have to be human. A

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

614

descriptive adequacy assessment methodology simi-

lar to that presented by (Gomez-Marco, 2015) func-

tions as a suitable benchmark to assess the correct-

ness of the HASPNeL parsing model. Expert evalu-

ators will be given a number of sentences with their

corresponding structural representations produced by

the system. Each evaluator assesses syntactic crite-

ria per sentence, such as, (1) the clause’s immediate

constituents, (2) each constituent’s internal structure,

(3) argument structure, (4) identiﬁcation of categories

and projections, and (5) detection of structural ambi-

guities by representations that succeed in criteria 1-4.

Assessment of each criterion can be Boolean or using

a scale. Evaluators may write comments about their

judgements and observations, which shall be used to

ﬁx bugs in the system’s theory modeling.

Since we are working with a synthetic corpus of

a manageable size, at a later stage of the project, we

may be able to measure by hand the cases of lexical

ambiguity in the annotated synthetic corpus and cal-

culate the likelihood of structural ambiguity in sen-

tences with those lexical units that are ambiguous

with respect to the corpus. To assess the system’s am-

biguity estimation, these measurements may be com-

pared with both the results of the system by detect-

ing possible ambiguity and the likelihood estimates

of each interpretation in those cases.

6 CONCLUSIONS

Although machine learning systems have the advan-

tage of a relatively fast and easier training, they fail

to acquire the capacity to detect structural ambigu-

ity that gives rise to semantic ambiguity. Symbolic

systems, on the other hand, do account for structural

ambiguities and are suitable for the construction of a

knowledge base as a model of human language cog-

nition. The system we propose exploits the advan-

tages of both strategies, as current literature suggests

that NLP implementations are improved by combin-

ing resources from both probabilistic and symbolic

AI to perform the speciﬁc tasks to which they are

best. Syntactic formalisms of minimalist grammars

and tree-adjoining grammars will be implemented in

the system, which can be used as a computational

model of language knowledge and acquisition, as well

as to test current syntactic theory. This system may

also serve as foundation to applications in education,

text editing, and the development of other human lan-

guage technologies, particularly for underrepresented

languages which cannot beneﬁt from big data ap-

proaches.

ACKNOWLEDGEMENTS

This material is based upon work supported by the

National Science Foundation (NSF) under Grant No.

2219712 and 2219713. Any opinions, ﬁndings, and

conclusions or recommendations expressed in this

material are those of the authors and do not neces-

sarily reﬂect the views of the NSF.

REFERENCES

Aggarwal, C. C. (2018). Neural Networks and Deep Learn-

ing. Springer.

Alers-Valent

ın, H., Rivera-Vel

azquez, C. G., Vega-Riveros,

J. F., and Santiago, N. G. (2019). Towards a princi-

pled computational system of syntactic ambiguity de-

tection and representation. In Proceedings of the 11th

International Conference on Agents and Artiﬁcial In-

telligence - NLPinAI,, volume 2, pages 980–987. IN-

STICC, SciTePress.

Bernardy, J.-P. and Chatzikyriakidis, S. (2019). What kind

of natural language inference are nlp systems learn-

ing: Is this enough? In Proceedings of the 11th In-

ternational Conference on Agents and Artiﬁcial In-

telligence - NLPinAI,, volume 2, pages 919–931. IN-

STICC, SciTePress.

Bikel, D. M. (2004). Intricacies of collins’ parsing model.

Computational Linguistics, 30:479–511.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler,

D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,

E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,

C., McCandlish, S., Radford, A., Sutskever, I., and

Amodei, D. (2020). Language models are few-shot

learners.

Cho, K., van Merri

enboer, B., Bahdanau, D., and Bengio, Y.

(2014). On the properties of neural machine transla-

tion: Encoder–decoder approaches. In Proceedings of

SSST-8, Eighth Workshop on Syntax, Semantics and

Structure in Statistical Translation, pages 103–111,

Doha, Qatar. ACL.

Chomsky, N. (1956). Three models for the description of

language. IRE Transactions on Information Theory,

2:113–124.

Chomsky, N. (1981). Lectures on Government and Binding.

Number 9 in Studies in generative grammar. Foris,

Dordrecht.

Chomsky, N. (1995). The minimalist program. MIT Press.

Chomsky, N. (2001). Derivation by phase (mitopl 18). In

Ken Hale: A Life is Language, pages 1–52. MIT Press.

Chomsky, N. (2008). On phases. In Foundational Issues

in Linguistic Theory: Essays in Honor of Jean-Roger

Vergnaud, pages 133–166. MIT Press.

Chomsky, N. (2021). Minimalism: Where are we now, and

where can we hope to go. Gengo Kenkyu, 160:1–41.

Modeling Syntactic Knowledge With Neuro-Symbolic Computation

615

Collins, C. and Stabler, E. (2016). A formalization of mini-

malist syntax. Syntax, 19(1):43–78.

Durrett, G. and Klein, D. (2015). Neural CRF parsing. In

Proceedings of the 53rd Annual Meeting of the As-

sociation for Computational Linguistics and the 7th

International Joint Conference on Natural Language

Processing (Volume 1: Long Papers), pages 302–312,

Beijing, China. ACL.

Fong, S. (2005). Computation with probes and goals. In UG

and External Systems: Language, Brain and Compu-

tation, pages 311–334. John Benjamins, Amsterdam.

Fong, S. (2022). Simple models: Computational and lin-

guistic perspectives. Journal of the Institute for Re-

search in English Language and Literature, 46:1–48.

Fong, S. and Berwick, R. (2008). Treebank parsing and

knowledge of language: A cognitive perspective. In

Proceedings of the Annual Conference of the Cogni-

tive Science Society, volume 30.

Fong, S. and Ginsburg, J. (2019). Towards a minimalist

machine. In Minimalist Parsing, pages 16–38. Oxford

University Press.

Gaddy, D., Stern, M., and Klein, D. (2018). What’s go-

ing on in neural constituency parsers? an analysis.

In Proceedings of the Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies (NAACL

HLT) 2018, volume 1, pages 999–1010.

Gomez-Marco, O. (2015). Towards an X-bar Parser: a

Model of English Syntactic Performance. PhD thesis,

University of Puerto Rico Mayag

uez.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT, Cambridge, MA.

Graf, T. (2021). Minimalism and computational linguistics.

Group, S. N. L. P. (2022). Stanford parser faq.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computation, 9(8):1735–1780.

Joshi, A. and Shabes, Y. (1997). Tree-adjoining grammars.

In Rozenberg and Salomaa, editors, Handbook of For-

mal Languages. Springer-Verlag, Berlin.

Kitaev, N., Cao, S., and Klein, D. (2018). Multilingual con-

stituency parsing with self-attention and pre-training.

Kitaev, N. and Klein, D. (2018). Constituency parsing with

a self-attentive encoder. In Proceedings of the 56th

Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 2676–

2686, Melbourne. ACL.

Pereira, F. (2002). Formal grammar and information theory:

Together again? In Nevin, B. E. and Johnson, S. B.,

editors, The Legacy of Zellig Harris: Language and

Information into the 21st Century. Volume 2: Mathe-

matics and Computability of Language, pages 13–32.

John Benjamins, Amsterdam.

Petkevi

c, V. (2014). Ambiguity, language structures and

corpora. La linguistique, 50(2):63–82.

Stabler, E. (1997). Derivational minimalism. In Retor

e, C.,

editor, Logical Aspects of Computational Linguistics,

pages 68–95, Berlin, Heidelberg. Springer Berlin Hei-

delberg.

Stabler, E. (2011). Computational perspectives on mini-

malism. In Boeckx, C., editor, Oxford Handbook of

Linguistic Minimalism, pages 617–643. Oxford Uni-

versity Press.

Stanojevi

c, M. and Stabler, E. (2018). A sound and com-

plete left-corner parsing for Minimalist Grammars. In

Proceedings of the Eight Workshop on Cognitive As-

pects of Computational Language Learning and Pro-

cessing, pages 65–74, Melbourne. ACL.

Thomson, N. C., Greenwald, K., Lee, K., and Manso, G. F.

(2021). Deep learning’s diminishing returns: The

cost of improvement is becoming unsustainable. IEEE

Spectrum.

Torr, J., Stanojevi

c, M., Steedman, M., and Cohen, S. B.

(2019). Wide-coverage neural A* parsing for Min-

imalist Grammars. In Proceedings of the 57th An-

nual Meeting of the Association for Computational

Linguistics, pages 2486–2505, Florence, Italy. ACL.

Wolpert, D. and Macready, W. (1997). No free lunch theo-

rems for optimization. IEEE Transactions on Evolu-

tionary Computation, 1:67.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

616