Object Parsing Grammars with Composition

Stefan Sobernig

Institute for Information Systems and New Media, WU Vienna, Welthandelsplatz 1, A-1020 Vienna, Austria

Keywords:

Language-product Line, Parsing Expression, Object Graph, Grammar Reuse, Grammar Transformation.

Abstract:

An Object Parsing-Expression Grammar (OPEG) is an extension of parsing expression grammars (PEG) in-

cluding generator expressions to directly produce object graphs from parsed text. This avoids typical abstrac-

tion mismatches of intermediate parse representations (e.g., decomposition mismatches). To develop language

families via extension, uniﬁcation, and extension compositions, OPEGs can be composed—without preplan-

ning and with unmodiﬁed reuse. Composability is established by supporting both forming basic grammar

unions and performing grammar transformations between two or more OPEGs (e.g., rule extraction, sym-

bol rewriting). These transformation operators assist developers in mitigating the consequences of the non-

disjointness under composition of parsing expressions (e.g., language hiding). An implementation of OPEGs

is available as part of the multi-DSL development system DjDSL.

1 INTRODUCTION

Developing variable software languages shifts em-

phasis from developing and analysing a single lan-

guage to developing and to analysing composable de-

velopment artefacts for a language family. This am-

bition gave rise to approaches to language-product

line engineering (M

endez-Acu

na et al., 2016; K

uhn

et al., 2015; J

equel et al., 2015; Liebig et al., 2013)

and their supporting multi-language development sys-

tems. Their shared goals are to minimise preplanning

effort as well as, at the same time, to reuse develop-

ment artefacts and language tooling in an unmodiﬁed

manner.

The composability of language-deﬁnition artefacts

(e.g., deﬁnitions of abstract and concrete syntaxes,

context conditions, behaviour, and test cases) is a pre-

requisite for a variable design and implementation of

a language (Erdweg et al., 2015). Composability is

also required to adhere to the conditions of a modular

de- and re-composition, for example, preserving mod-

ular comprehensibility of grammars and grammar ex-

tensions (Johnstone et al., 2014). Language compo-

sition must be tackled at different levels of language

and processing (abstract syntax, context conditions,

behaviour implementation). The emphasis in this pa-

per is on composition of concrete-syntax deﬁnitions.

Syntax-level composition involves two or more

concrete-syntax deﬁnitions (e.g., production or pars-

ing grammars) to become combined. Text writ-

ten using the combined syntax must be processed—

by grammar-based parser generators, grammar inter-

preters, or parser combinators—as if it were pro-

duced from or recognised by distinct syntax deﬁni-

tions. Each syntax piece is thought of as conform-

ing to one of the source syntaxes, or a mix of source-

syntax fragments. The deﬁnition of a resulting parser

(by a composed grammar or by a parser combina-

tor) is ideally formed by referencing the source def-

initions, rather than cloning them. This has the ben-

eﬁt of tracking any modiﬁcation in the source deﬁ-

nitions without further intervention by the developer

(e.g., providing patch code to the generated parser).

Reusing the source deﬁnitions without modiﬁcation

is the key objective.

Rendering syntax deﬁnitions composable presents

important challenges (Degueule, 2016, Section 3.3).

The main challenge is ambiguity under composition.

Ambiguity can arise as an unwanted consequence

of a composition: Two unambiguous grammars may

enter a composition and turn into an ambiguous com-

posed grammar (Diekmann and Tratt, 2014; van der

Storm et al., 2014). Ambiguous parsing is particularly

critical when each parse presents different and possi-

bly contradicting interpretations in terms of the under-

lying semantics. But also earlier, when constructing

higher-level parse representations such as an abstract-

syntax graph (ASG), each valid parse may translate

into a different ASG (van der Storm et al., 2014, Sec-

tion 2.2).

In parsing expression grammars (PEG), ambiguity

under composition takes a characteristic form: lan-

Sobernig, S.

Object Parsing Grammars with Composition.

DOI: 10.5220/0010558303730385

In Proceedings of the 16th International Conference on Software Technologies (ICSOFT 2021), pages 373-385

ISBN: 978-989-758-523-4

373

E ← `Event` ON name:<alnum>+ ;

void: ON ← WS 'on' WS;

Listing 1: An excerpt from an Object Parsing-Expression

Grammar (OPEG), showing two parsing rules in EBNF-

like notation. The ﬁrst rule exempliﬁes the inline mapping

of concrete-syntax elements to object-classes (Event) and

their properties (name).

guage hiding. Parsing expressions from two or more

grammars, when combined, may result in a composed

recogniser that hides a language’s syntax unintention-

ally.

This paper documents ﬁne-grained grammar

transformations to prevent unintended language hid-

ing from happening when composing Object Parsing-

Expression Grammars (OPEG). OPEGs are a general

PEG extension for deﬁning, in one, a concrete syn-

tax as well as a mapping between concrete syntax and

an object-oriented primary abstract syntax (ASG). For

production grammars, these became known as object

grammars (van der Storm et al., 2014).

The ﬁne-grained grammar transformations intro-

duced in this paper include rule extractions with and

without symbol rewriting, transitive symbol rewriting

as well as rule removals (see Section 3).

A proof-of-concept implementation, the running

examples as well as the code listings in this paper

are available from a supplemental Web site as an exe-

cutable tutorial

2 PARSING TO OBJECT GRAPHS

A parsing expression grammar (PEG; Ford 2004) is

deﬁned as a 4-tuple G = (N,T,R,e

). N denotes the

ﬁnite set of non-terminals, T is the ﬁnite set of ter-

minals, R is the ﬁnite set of rules, and e

is the start

expression. Each rule r ∈ R is a pair (A, e) typically

written as a maplet A ← e, with A ∈ N and e being

another parsing expression.

A parsing expression deﬁnes a pattern to match

(recognise) and, if matched, to consume a speciﬁed

fragment of input. A parsing expression is deﬁned

using the empty string (ε), the sets of terminals and

non-terminals (N, T ), as well as operator expressions

summarised in Table 1 (1–12).

The meaning of a PEG is given by a recogni-

tion program (Grune and Jacobs, 2010, Section 15.7).

A recognition program is a program for recognising

and structuring (incl. tokenising, parsing) a string.

The (operational) meaning of a PEG-based recogni-

tion program can be thought of character-level inter-

https://github.com/mrcalvin/djdsl

preter of some input that works left-right, top-down to

recognise, and if recognised, to consume the matched

input. The interpreter always consumes the longest

possible matched preﬁx of some input. A given pars-

ing expression is said to succeed when it consumes

what it has recognised; if an expression fails (i.e.,

it does not recognise anything), it consumes nothing

from the input. This is even so when some of its sub-

expressions have succeeded.

Table 1: Overview of the operators available for OPEG/PT

parsing expressions. The operators correspond to those of

other parsing-expression language implementations, e.g.,

Rats! (Grimm, 2006), APEG (Reis et al., 2015), Arpeg-

gio (Dejanovi

c et al., 2016). Note that ε (epsilon) stands

for matching the empty string. OPEG extensions are gen-

erators (13–14; see Section 2.2). Other extensions are in-

herited from the reused parsing virtual machine (PT): rule

modiﬁers syntax-tree generation (e.g., void, leaf, value);

character classes (e.g., <alnum>, <digit>, <xdigit>).

op description desugared

1 e

sequence e

2 e

/ e

prioritised (ordered) choice e

/ e

3 ’d’ literal character ’d’

4 ’abc’ literal string ’a’ ’b’ ’c’

5 [A-z0-9] character ranges [A-z] / [0-9]

6 . any character .

7 (e) sub-expression (group) (e)

8 e? optional expression e / ε

9 e* inclusive-or (zero-or-more) e*

10 e+ inclusive-or (one-or-more) e e*

11 !e not predicate !e

12 &e and predicate !(!e)

13 `c` e instantiation generator `c` e

14 f:e assignment generator f ← e

15 f:(`q` e) query generator f ← `q` e

Parsing expressions can contain operator expres-

sions and operator behaviours not available in other

parsing approaches. Most importantly, for a given

expression, alternate subexpressions are tried in their

order of deﬁnition. The ﬁrst one to succeed wins, any

remaining ones are discarded. This is referred to as

a prioritised or ordered choice (see operator 2 in Ta-

ble 1). Prioritised or ordered choice has been doc-

umented as the key discriminator between PEG and

CFG (Mascarenhas et al., 2014). On top, the choice

operator gives rise to all difﬁculties associated with

PEG regarding composition: ambiguity handling and

language hiding.

ICSOFT 2021 - 16th International Conference on Software Technologies

374

2.1 Language Hiding

Parsing expressions build on ordered choices and un-

limited lookahead. These preclude the possibility of

ambiguous parses, but incur the risk of unwanted lan-

guage hiding. Language hiding is a practical conse-

quence of the absence of general semi-disjointness of

a choice expression (Schmitz, 2006) for the scope of

the language matched by a PEG.

Consider appending a parsing expression e

as an

alternate to an existing parsing rule S ← e

, yield-

ing S ← e

as a result of composing two PEG.

This ordered choice is commutative only if e

and e

are semi-disjoint expressions, that is, they succeed in

consuming input from two languages that are semi-

disjoint. Otherwise, the choice is not commutative

and the order of composition becomes essential for

the parsing result.

Intuitively, e

and e

are semi-disjoint if e

does not overlap with any preﬁx also recognised by

(Schmitz, 2006, Section 4). Disjointness must also

hold for any super-expression that contains e

. The parsing rule S ←('aa'/'a')'a', with

aa as parsing expression e

and a as e

. This rule will

successfully consume one input: aaa. Input aa will be

rejected, on the ground that the e

(aa) is tested ﬁrst,

rejecting any input not having a third a. When ﬂip-

ping the order between e

and e

from (’aa’/’a’)

to (’a’/’aa’), only aa will be consumed and aaa

becomes now rejected.

Covering all input, i.e., the language {aa, aaa},

in this one example with a single expression re-

quires an informed re-arrangement. First, the sub-

expression (e

) must be moved to the right

yielding e

). This is equivalent to writing

)/(e

) according to the distributive property

of the ordered choice (Ford, 2004, Section 3.7). This

way, e

cannot fail the super-expression uncondition-

ally, the second alternate will be tested when e

fails

the ﬁrst one. Second, within the sub-expression, it

must be taken care that the expression consuming

more of the input (the longer preﬁx) on success is

tested ﬁrst. This re-arrangment yields S ← 'a' ('aa

'/'a').

To summarise: Language hiding occurs when a

(greedy) alternate of a choice expression prevents a

later alternate from being applied to inputs that it

could otherwise succeed on. This is also called a pre-

emptive preﬁx capture (Redziejowski, 2008). See also

Section 3.7 in Ford 2004.

2 start idle

4 state idle

5 on doorClosed go active

7 state active

8 on lightOn go waitingForDrawer

9 on drawerOpened go waitingForLight

11 state waitingForDrawer

12 on drawerOpened go unlockedPanel

14 state unlockedPanel

15 go idle on panelClosed

17 state waitingForLight

Listing 2: Miss Grant is told to maintain a secret compart-

ment in her bedroom. This compartment requires a partic-

ular sequence of actions from her side to become unlocked

for her to open. The corresponding state-machine models

the modal behaviour of the software-based compartment

controller, reacting to Miss Grant’s input actions (Fowler,

2010, Section 1.1.1).

2.2 Object Parsing Expressions

A parsing grammar can contain extended parsing ex-

pressions to process the consumed syntactic structure

into an object graph. This way, an OPEG deﬁni-

tion lays out two-in-one: (a) input recognition and

(b) mapping the recognised input onto objects, their

ﬁelds, and non-hierarchical relationships between the

mapped objects. This is shared spirit with object (pro-

duction) grammars (van der Storm et al., 2014).

In what follows, the extensions to parsing gram-

mars are highlighted by referring to the running ex-

ample of modelling the state machine driving “Miss

Grant’s Controller”. In later sections, this is referred

to as the State-Machine Deﬁnition Language (SMDL)

notation. Listing 2 depicts the concrete-syntax snip-

pet of a state-machine deﬁnition.

Object parsing expressions are only covered to the

extent necessary to provide a general background and

to render the contributions on grammar transforma-

tions in Section 3 understandable (object generation,

alternates, associations, and references). For a com-

prehensive coverage including details on multi-valued

properties and non-positional parsing, see Sobernig

(2020).

Object Generation. Parsing rules in parsing gram-

mars can contain special-purpose expressions at their

RHS that compute one or several instantiations of

object-classes when their rule is applied. These ex-

pressions are referred to as instantiation generators

(see Table 1, operator 13). Listing 1 shows a gram-

Object Parsing Grammars with Composition

375

T ←

`Transition` trigger:E GO target:<alnum>+ /

`Transition` GO target:<alnum>+ trigger:E;

Listing 3.

mar excerpt with two rules E and ON, with WS han-

dling and discarding whitespace characters (the WS

rule not being shown). Rule E consumes trigger-event

deﬁnitions for state machine transitions of the form on

doorClosed (line 5, Listing 2). It features the rule ele-

ment Event enclosed by single grave accents (`...`).

This is an instantiation generator that will translate

into an instantiation call for an object-class Event.

To become useful, a parsing rule can be extended

to include assignment generators (see Table 1, op-

erator 14). These generators mark recognised and

consumed values from the processed input as values

to become assigned to the properties of objects cre-

ated by an instantiation generator. Listing 1 shows

the example of an assignment generator for a prop-

erty name. The so-generated assignment binds any

value returned from applying the parsing expression

<alnum>+, that is, a string of at least one alphanumer-

ical character. In the example, this value denote the

event’s name.

Alternates. Each alternate at a RHS of a parsing

rule, i.e., the operand parsing expressions of an or-

dered choice, can deﬁne an instantiation generator.

The instantiation generators can point to the same or

different object-classes. Listing 3 demonstrates how

two alternative writing styles for transitions (i.e., on-

go vs. go-on) could be deﬁned as alternates.

In accordance with the semantics of ordered

choices in parsing grammars, only the generator as

part of the matching choice branch will be evaluated.

For all but the transition deﬁnition on line 15, List-

ing 2, the ﬁrst alternate applies; the second alternate

applies then to the input on line 15.

Associations and References. Assignment genera-

tors allow a developer to relate objects, as deﬁned by

instantiation generators, in two ways: First, an assign-

ment generator refers to a bare parsing expression.

The result computed by this parsing expression will

be bound as value of an assignment. Given the hi-

erarchical relationship between parsing expressions,

objects are therefore related in a manner reﬂecting

the parsing hierarchy. A StateMachine references its

State instantiations, each State maintains Transi-

tions that, again, reference a trigger Event. This web

of relations corresponds to the parsing procedure.

Second, assignment generators can be used to re-

late objects independently from the parse. This is re-

M ← `StateMachine` START

start:(`$root states $0` <alnum>+)

states:S+ ;

Listing 4.

quired because an abstract-syntax graph typically in-

volves some form of circular initialisation (Servetto

et al., 2013). This refers to associations (references)

established between objects beyond those induced by

the parse, i.e., at different times of a parse. Circu-

larity requires, to be fully resolved, that all objects to

enter circular relationships have been fully initialised

before.

Lines 2 and 4 in in Listing 2 exemplify a circu-

lar dependency between two declaration statements.

Setting the start state to idle is in the preamble of

the deﬁnition (line 2). The state, however, is about to

be deﬁned later (line 4). To defer the assignment, to

a moment the remainder of the object graph with all

states including idle has been constructed, an assign-

ment operator can be extended into a second form.

This second form nests a parsing expression with a

query generator (see Table 1, operator 15). In List-

ing 4, the assignment generator for the property start

is assigned a parsing expression that contains such a

query expression: $root states$0.

A query expression allows for navigating and for

accessing the object graph under construction. The

ﬁrst word of a query (e.g., the command) roots the

query in the object graph: $ root refers to the top-

level object corresponding to the root of the parse tree.

$parent refers to the ancestor object according to the

parse tree. $self is the self-reference to the receiver

of the assignment. In addition, a query generator can

refer to the parse matches of the surrounding parsing

expression in a positional manner. For example, in

Listing 4, $0 will bind the ﬁrst value computed by the

ﬁrst sub-expressions <alnum>+ at position 0 of the se-

quence expression.

The result of evaluating the query expression in an

environment that provides values for the predeﬁned

variables (e.g., root, parent, 0) is then assigned to the

property denoted by the assignment generator. The

generated assignment, however, is deferred to a mo-

ment when all objects are guaranteed to being exist-

ing, according to the underlying parse.

3 COMPOSING PARSING

EXPRESSIONS

A grammar composition relates a receiving grammar

and one or more composed grammars, with grammars

as deﬁned in Section 2. A composition produces a

ICSOFT 2021 - 16th International Conference on Software Technologies

376

resulting grammar (Johnstone et al., 2014). The fun-

damental unit of composition is a grammar rule as a

pair of a non-terminal as the rule’s LHS and a pars-

ing expression as the RHS. A collection of rules hav-

ing the identical LHS, but different RHS are said to

be alternates. The multi-sets of all rules (R) of a re-

ceiving grammar and one or more composed gram-

mars enter the composition. Composition operations

on the rules sets fall into three coarse-grained groups:

overriding, combination, and restriction (Johnstone

et al., 2014). Concrete variations of these operation

types then propagate between rules and into the sub-

expression level (e.g., alternate selection).

In Section 3.1, concrete composition operations

for OPEGs are introduced. This also highlights

speciﬁcs to PEG (as opposed to production gram-

mars). The ﬁt of the composition operations for dif-

ferent types of language compositions is elaborated

on in Sections 3.2 through 3.4.

3.1 Merges and Transforms

Object parsing grammars can be connected using a

merges relationship. A receiving grammar can merge

one or several composed grammars to obtain a result-

ing grammar, with and without intermittent grammar

transformations between the merged ones. Compo-

sition starts from a disjoint union of the rules set of

the receiving and composed grammars. To obtain a

disjoint union, all symbols at the RHS and the LHS

of the rules are qualiﬁed by their origin grammars.

This union of all rules is the basis for transformations

(extracts, rewrites) that yield the resulting OPEG. As

a default, if no transformations have been deﬁned, a

simple union with override is performed.

The grammar deﬁnition in Listing 5 deﬁnes a

merge relationship between two grammars: G1 acts

as the receiving, G0 as the composed one.

OPEGs allow for multiple levels of deﬁning

merges relationships. Before computing a resulting

grammar, the collection of composed deﬁnitions is

turned into an unambiguous linear order. This lin-

ear order preserves local-precedence orders. Viola-

tions (e.g., circular merges) under linearisation are

signalled at deﬁnition time

. This linearisation is then

used to resolve dependencies between rules and as the

basis for the subsequent transformations. The merges

relationship does not directly determine which kind

of composition operation is to be performed between

Our OPEG implementation represents parsing gram-

mars as object-classes: The merges relationship is therefore

derived from an ordinary subclass-superclass relationship.

This way, OPEGs can leverage the built-in C3 linearisa-

tion (Barrett et al., 1996).

2 # G0 (composed grammar)

3 Grammar create ::G0 -start S

5 G0 loadRules {

6 S <- A B / 'a';

7 A <- 'a' A;

8 B <- 'b';

9 }

11 # G1 (receiving grammar)

12 Grammar create ::G1 -merges G0 -start A

{

13 A <- ('a' / 'A') A / D;

14 D <- 'd';

15 } {

16 # transformations

17 G0::B ==> ; # rule deletion

18 }

Listing 5.

receiving and composed grammars. This is achieved

in a separate step.

In addition to establishing a merges relation-

ship, the receiving grammar can also deﬁne a script

of grammar transformations to implement different

composition operations. These include simple union

with override in the absence of transformations, as

well as different variants of extraction and of restric-

tion in the presence of transformations. Transforma-

tions are deﬁned as a dedicated section of a Grammar

deﬁnition (see Listing 5). The transformations can

be called repeatedly, causing a ﬂush of the resulting

grammar and a rerun of any transformations.

The composition behaviour in presence of trans-

formations is implemented on the procedure illus-

trated in an informal manner in Figure 1 (steps a–d).

First, the set of rules of the input grammars (G0,

G1) are processed to turn the non-terminals names

into qualiﬁed names (a): A qualiﬁed non-terminal is

a non-terminal whose name is preﬁxed by the name

of the owning grammar. For example, non-terminal

A becomes qualiﬁed as G0::A. Non-terminal B so

becomes G0::B

. Second, a union operation is per-

formed with precedence for rules from the receiving

grammars over those of the composed one (b). Due

to prior name qualiﬁcation, this represents effectively

a disjoint-union operation. This has the consequence

that the original sets of rules enter the intermediate

set of rules in an unaltered and in a complete fash-

ion. Third, the deﬁned transformations (e.g., extrac-

tion, restriction) are performed on this intermediate

set of rules (c). In Figure 1, the example refers to the

removal of the rule with the LHS non-terminal G0::B.

Fourth, after completion, standard grammar cleaning

Johnstone et al. (2014) refer to this auxiliary transfor-

mation as introducing name hygiene.

Object Parsing Grammars with Composition

377

B <- ‘b’ A

A <- ‘a’ A

G0::B <- ‘b’ G0::A

G1::A <- ‘a’ G1::A

G0::B <- ‘b’ G0::A

G1::A <- ‘a’ G1::A

G1::A <- ‘a’ G1::A G1::A <- ‘a’ G1::A

(a) (b) (c) (d)

G1’

Figure 1: A procedural overview of creating a resulting grammar including transforms in four steps (a–d): (a) narrow: Non-

terminals in the input rules-sets are turned into qualiﬁed symbols; (b) compose: the (disjoint) union of the input rules-sets is

formed; (c) modify: the transformation operations (e.g., append, removal) are performed; (d) clean: cleaning operations on

unrealisable and unreachable non-terminals are performed.

is performed, most importantly: unrealisable (includ-

ing undeﬁned) non-terminals are dropped, then un-

reachable non-terminals are removed (d).

As for the actual grammar transformations sup-

ported in step (c) of Figure 1, an overview of the

available operators is presented in Table 2. An (1)

Table 2.

op type description example

1 ⇐ binary extract w/o rewrite A ⇐ G0::A

2 ⇔ binary extract w/ rewrite A ⇔ G0::A

∗

⇐⇒ binary transitive extract w/rw A

∗

⇐⇒ G0::A

4 ⇒ unary remove G0::B ⇒

5 ← binary op. 1 w/o generators A ← G0::A

6 ↔ binary op. 2 w/o generators A ↔ G0::A

none n/a union with override G1 merges set G0

extract w/o rewrite (⇐) selects the RHS expression

of the referenced rule (e.g., G0::A) and introduces it

into the receiving rules set. Introduction refers to ei-

ther creating a new rule G1::A with the extracted RHS

or appending the selected RHS as an additional alter-

nate to an existing rule. An (2) extract w/ rewrite (⇔)

proceeds as the extract. Additionally, it renames any

non-terminals reachable at the extracted RHS expres-

sion using the preﬁx of the receiving grammar (e.g.,

substituting preﬁx G1::* for G0::*). The (3) transi-

tive variant of extract w/ rewrite (

∗

⇐⇒) additionally im-

ports any rules providing deﬁnitions for the extracted

and renamed RHS non-terminals. These rule depen-

dencies are satisﬁed from the pool of linearised com-

posed grammars. Finally, (4) resulting grammars can

be restricted by using the removal operator (⇒). A

removal affects an entire rule or a rule alternate. To

selectively insert an extracted RHS as a new alternate,

the operators 1–3 allow for deﬁning an insertion po-

sition, e.g.: A ⇐ G0::A 0. The extracted RHS expres-

sion becomes inserted at the ﬁrst (zero-based) posi-

tion. That is, it is prepended as a new alternate to

a rule, if existing. The position qualiﬁer defaults to

prepending extracted expressions

Generators. The generators for instantiations, as-

signments, and queries as part of object parsing ex-

pressions are integral parts of the parsing expressions

also under transformation. Generators become com-

bined, extracted, and removed with the surrounding

parsing expressions or sub-expressions (alternates)

according to the stipulated behaviour of the ﬁrst four

operators (1–4). This is particularly important for

choice expressions. At their top level, generators

are elements of each alternate and can become in-

serted or removed during a transformation affecting

the respective alternate. However, to reference and to

reuse parsing (sub-)expressions without their genera-

tors (e.g., to attach matches to an alternative genera-

tor), there are two transform operators that operate on

the plain expressions, without generators (see opera-

tors 5 and 6 in Table 2). An (5) extract w/o rewrite

w/o generators (←) selects the RHS expression of the

referenced rule (e.g., G0::A), omitting any genera-

tors, and introduces it into the receiving rules set (see

also operator 1). An (6) extract w/ rewrite w/o gen-

erators (↔) performs the extraction/ introduction and

patches the namespace preﬁxes (see also operator 2),

again, omitting any generators. See Sections 3.2–3.4

for concrete applications of these two generator-free

operators.

As already explained, in absence of transforma-

tions, a merges relationship defaults to a union with

override. When forming the union, the receiving rules

take precedence over the composed ones.

The operand values consumed by the six operators

listed in Table 2 must be qualiﬁed (G0::A) or unquali-

ﬁed non-terminal names (A). Unqualiﬁed names, both

on the left-hand side and on the right-hand side, will

be narrowed by automatically prepending the enclos-

This default is a consequence of the issue of language

hiding in PEG.

ICSOFT 2021 - 16th International Conference on Software Technologies

378

2 graph {

3 // node definitions

4 "1st Edition";

5 "2nd Edition";

6 "3rd Edition";

7 // edge definitions

8 "1st Edition" -- "2nd Edition"

9 [weight = 5];

10 "2nd Edition" -- "3rd Edition"

11 [weight = 10];

12 }

Listing 6: A deﬁnition of an undirected and weighted graph

using DOT notation.

G ← `Graph` GRAPH OBRACKET

StmtList CBRACKET;

StmtList ← (Stmt SCOLON)*;

Stmt ← edges:EdgeStmt / NodeStmt;

EdgeStmt ← `Edge`

a:(`$root nodes $0` NodeID

)

EDGEOP

b:(`$root nodes $0` NodeID

);

NodeStmt ← `Node` name:NodeID;

NodeID ← QUOTE Id QUOTE;

Id ← !QUOTE (<space>/<alnum>)+;

Listing 7: A base notation for graphs w/o weight attributes.

Some deﬁnitions are omitted for brevity; OBRACKET: ”[”,

CBRACKET: ”]”, SCOLON: ”;”, QUOTE: ”"”, GRAPH: ”graph”,

EDGEOP: ”--”.

ing grammar’s name. The transformations are exe-

cuted in their order of deﬁnition, without any partic-

ular precedence of one operator over the other (see,

e.g., Listing 5). Varying transforms can be applied,

repeatedly or consecutively, to obtain different result-

ing grammars.

In the following subsections, the application of

these parsing grammar compositions (union, extrac-

tion, and restriction) is exempliﬁed in the context of

the recurring types of language composition (Erd-

weg et al., 2012): extension, uniﬁcation, and exten-

sion composition. The running composition examples

are tutorial throughout literature: DOT/ GPL, SMDL

(Miss Grant’s Controller), and BCEL (see Figure 2).

They highlight the capabilities or restrictions of ex-

ternal syntax composition, in general, and for object

parsing grammars, in particular.

3.2 Syntax Extension

A language developer composes a base language, e.g.,

for deﬁning graphs using DOT in Listing 6, with a

language extension. A language extension is an in-

complete language fragment which depends directly

# a) receiving rules

EdgeStmt ← `Edge` CoreEdge WeightAttr;

WeightAttr ← OSQBRACKET WEIGHT EQ

weight:Weight CSQBRACKET;

Weight ← `Weight` value:<digit>+;

# b) transforms

CoreEdge ↔ Dot::EdgeStmt

∗

⇐⇒ Dot::G

{EdgeStmt end} ⇒

Listing 8: The rules set of an extension grammar (a) and ex-

plicit transformations (b) to produce a resulting (extended)

grammar from the base grammar in Listing 7. Auxiliary,

attribute-speciﬁc rule deﬁnitions (WEIGHT, EQ) are not de-

picted for clarity.

on the base language for completion (in terms of

the concrete syntax, the abstract syntax, and the be-

haviours; Erdweg et al. 2012). For example, graph

deﬁnitions using DOT node and edge statements

should be extended to model edge-weighted graphs

(see Listing 7). For this, the DOT notation must be

extended accordingly to support attribute statements,

so that edge weights can be captured as attributes (see

Figure 2a).

Listing 7 shows an object parsing grammar

incl. generators for instantiations, assignments, and

queries recognising graph deﬁnitions w/o weight at-

tributes: the base notation. The object parsing gram-

mar establishes three correspondences between pars-

ing expressions (their matches) and a class model of

graphs:

1. Matches obtained by NodeStmt map to instantia-

tions of the Node class.

2. Matches obtained by EdgeStmt map to instantia-

tions of the Edge class.

3. Matches obtained by the top-level or start rule G

map to instantiations of the Graph class.

In addition, the Node instantiations must be initialised

to the provided node names. The Edge instantiations

must obtain references to the Node instantiations iden-

tiﬁed by the node names given in DOT edge state-

ments. Finally, all Edge instantiations must be as-

signed to the edges property of the Graph.

To allow for edge statements to carry attribute

statements in-between brackets deﬁning edge weights

(see lines 7 and 8 in Listing 6), among others, the

base grammar can be composed with a grammar ex-

tension. This extension can be realised in different

manners using OPEGs. Options include a straightfor-

ward union between two OPEGs or a grammar trans-

formation. In the following, OPEG transforms are ex-

hibited.

As for extra rules (the receiving grammar’s rules

in Listing 8a, the rules WeightAttr and Weight de-

Object Parsing Grammars with Composition

379

(DOT)

(DOT,

weighted)

merges

(a)

(SMDL)

(SMDL +

BCEL)

merges

(BCEL)

(b)

(weighted)

(weighted,

coloured)

merges

(product)

(DOT)

merges

(c)

Figure 2: A structural overview of the merges relationships for (a) syntax extension, (b) syntax uniﬁcation, and (c) syntax-

extension composition. DOT is the graph-deﬁnition notation as available from the Graphviz toolchain, on top of a Graph

Product Line (GPL; Lopez-Herrejon and Batory 2001). The State Machine Deﬁnition Language notation (SMDL) is inspired

by Miss Grant’s Controller, (Fowler, 2010, pp. 4), modelling a software-based controller for a secret compartment. The

Boolean and Comparison Expression Language (BCEL) exempliﬁes both a notation of an Expression Product Line, (Liebig

et al., 2013), as well as an embeddable notation in terms of a kernel expression language (V

olter, 2018), made to become

combined with a second language.

ﬁne the actual extension (attribute) syntax. The rule

for EdgeStmt links the former with the base syntax of

edge statements. This is achieved by referencing the

corresponding base rule for edge statements as-is via

a non-terminal CoreEdge; and amending it in subse-

quent steps.

Consider the transforms shown in Listing 8b, one

step (line) at a time: Recall that in presence of trans-

forms, merging produces a disjoint union of two sets

of rules, with all rules and non-terminals being preﬁx-

qualiﬁed by their originating grammars (see Figure 1,

step b). Therefore, at the start, there will be effec-

tively two EdgeStmt rules in the intermediate set of

rules: Dot::EdgeStmt from the base grammar and

ExtDot::EdgeStmt from the extension grammar (see

Listing 8a).

Based on this intermediate set, the transforms can

be used to extract the RHS of Dot::EdgeStmt, park

it in a helper rule for ExtDot::CoreEdge, and refer-

ence this helper from the RHS of ExtDot::EdgeStmt.

This corresponds to what is achieved by the ↔ trans-

form of Listing 8. Then, to integrate this revised

EdgeStmt with the remainder of the composed gram-

mar, the transitive-extraction transform (

∗

⇐⇒) is used

to draw the entirety of the composed grammar into the

namespace of the receiving grammar. This starts from

the start symbol Dot::G. The operation will pick up

the previously deﬁned, revised ExtDot::EdgeStmt.

This, in turn, activates the added syntax for weight

attributes via ExtDot::CoreEdge.

Syntax extensions using explicit OPEG transfor-

mations, rather than a union with override (also sup-

ported), avoids duplicated rules. Besides, all changes

to the base (composed) grammar will be automati-

cally tracked by the extended (resulting) one. In addi-

tion, accidental overrides are avoided by maintaining

the merged sets of rules in separate namespaces.

state active

on lightOn go waitingForDrawer

on drawerOpened go waitingForLight

[ counter > 3 ]

Listing 9: One guarded transition for Miss Grant’s Con-

troller.

3.3 Syntax Uniﬁcation

Composing two or more, otherwise free-standing and

independent languages and their syntaxes has been re-

ferred to as language uniﬁcation. Uniﬁcation of the

two or more languages mandates that they maintain

their standalone functional properties, and still, when

composed, interoperate without modifying their im-

plementation. A uniﬁcation must preserve the com-

posed syntaxes, leaving them intact and unmodiﬁed.

In this setting, the same basic composition operations

provided between two or more OPEGs apply, as for

language extension or extension compositions. Im-

portant differences arise from the fact that unintended

or accidental overrides, for instance, are much more

likely. This is because one syntax deﬁnition might

contend with the other’s symbol names, whitespace

conventions, and reserved literals (e.g., keywords).

Consider two separately developed languages.

These are a Boolean and comparison expression lan-

guage (BCEL) and a state-machine-deﬁnition lan-

guage (SMDL). See also Figure 2. The BCEL is

a candidate of a functional kernel language (V

olter,

2018) to become uniﬁed with SMDL to implement

guarded transitions. A guarded transition is a transi-

tion that is annotated by a guard expression and whose

ﬁring is controlled by the prior evaluation of the at-

tached guard expression. If the guard is evaluated to

true at that time, the transition is enabled, otherwise,

it is disabled and will not ﬁre. Listing 9 shows two

transitions, one with and the other without a guard

expression.

ICSOFT 2021 - 16th International Conference on Software Technologies

380

# a) receiving rules

T ← `Transition` OrigT OBRACKET

guard:Expression CBRACKET;

void: OBRACKET ← WS '\[' WS;

void: CBRACKET ← WS '\]' WS;

# b) transforms

OrigT ↔ MissGrants2::T

Expression ⇐ BCEL::Expression

∗

⇐⇒ MissGrants2::M

Listing 10: A minimal unifying grammar that integrates the

state-machine language with the BCEL language.

A uniﬁcation is marked by two or more com-

posed grammars being merged by a receiving (unify-

ing) grammar. The running example requires the de-

veloper to deﬁne a receiving grammar (e.g., Guarded-

MGC) that merges the BCEL’s grammar and the SMDL

grammar. The deﬁnitional content of the unifying

grammar is documented by Listing 10. Guard expres-

sions are attached to the Transition instantiations.

Intuitively, the uniﬁcation is achieved in three

transformational steps (see Listing 10b): First, a re-

vised rule deﬁnition for transition deﬁnitions is pro-

vided. This rule derives from the original deﬁnition

via the rewrite transform named OrigT. Note that gen-

erators are excluded from the rewrite transform. This

is to avoid duplication of the instantiation generator

for Transition in the resulting grammar. The re-

vamped rule T becomes extended by an assignment

generator guard. This will establish the guard refer-

ence between an instance of Transition and an in-

stance of Expression. Second, the assignment gen-

erator is related to the Expression rule of the BCEL

grammar. This rule becomes referenced on the second

to last line of Listing 10. Third, and ﬁnally, the entire

SMDL rules set is dragged into the resulting grammar

(on the last line).

These transforms resemble closely the ones for a

syntax extension, regarding the provision of an ex-

tension point in the state-machine language (OrigT).

The main difference comes with the referencing of a

syntax element from the second composed language

(extract w/o rewrite): BCEL::Expression. As this

happens to be the start symbol of BCEL, the entire

BCEL rules set is effectively incorporated into the re-

sulting grammar. This is achieved in a way that avoids

conﬂicts with the state-machine rules set (thanks to

preﬁxing of non-terminals).

To summarise, OPEGs with transforms allow for

the unanticipated, the unmodiﬁed, and the controlled

reuse of two independently developed syntaxes to

form a uniﬁed syntax.

graph {

// node definitions

"1st Edition";

"2nd Edition";

"3rd Edition";

// edge definitions

"1st Edition" -- "2nd Edition"

[weight = 5];

"2nd Edition" -- "3rd Edition"

[colour = #000];

}

Listing 11: A deﬁnition of an undirected graph with

weight or colour attributes using DOT notation.

3.4 Syntax-extension Composition

Extension composition captures situations in which

two or more syntax extensions can be composed with

one another or can be co-present as an extension to

a base syntax. Two or more extensions may be com-

posed incrementally (step-wise) into a base language,

one at a time. Alternatively, the extensions can be

composed ﬁrst, and the resulting grammar becomes

merged once with a base grammar. The ﬁrst is some-

times referred to as incremental extension composi-

tion, the second as extension uniﬁcation. OPEGs sup-

port both variants of syntax-extension composition,

with incremental compositions being a ﬂavour of syn-

tax extensions as described in Section 3.2. In what

follows, the emphasis is on extension uniﬁcation. As

the name implies, this combines aspects of syntax

extension (Section 3.2) and syntax uniﬁcation (Sec-

tion 3.3).

In an extension uniﬁcation, the points of depar-

ture are the extension grammars per se. First, the ex-

tensions become composed, then, as a last step the

uniﬁed (resulting) grammar is merged into the base

grammar (see Figure 2c). The uniﬁed extension gram-

mar is, as its merged extension grammars, abstracted,

that is they must be completed by merging a base

grammar.

Consider the example of a second syntax exten-

sion to the DOT-like graph-modelling language intro-

duced in Section 3.2. This extension’s aim is to add a

colour attribute to edge deﬁnitions. Colour attributes

carry 3-digit hex codes of colours as part of edge def-

initions, as depicted in Listing 11.

The basic ﬂow of an extension uniﬁcation is exem-

pliﬁed in Listing 12 for a colour- and weight-enabled

graph syntax. The two main steps are identiﬁed as

steps (2) and (3). In step (2), a uniﬁed extension is

created by composing the two grammars documented

(as excerpts) in Listings 13 (for weight attributes) and

in Listing 14 (for colour attributes). The actual uniﬁ-

cation of the underlying rules sets in accomplished by

Object Parsing Grammars with Composition

381

2 # 1) weighted extension

3 Grammar create WeightedExtGrm \

4 -start EdgeStmt $weightedGrmStr

6 # 2) unified (coloured+weighted) extension

7 Grammar create ColouredWeightedExtGrm \

8 -start EdgeStmt \

9 -merges [WeightedExtGrm] $colouredGrmStr

{

10 EdgeStmt <*> WeightedExtGrm::EdgeStmt

11 }

13 # 3) base + unified extension

14 Grammar create FinalGrm \

15 -start G \

16 -merges [list [ColouredWeightedExtGrm

resulting] $dotGrammar] {} {

17 # transforms

18 ColouredWeightedExtGrm::WS ==>

19 ColouredWeightedExtGrm::CoreEdge ==>

20 CoreEdge <-> Dot::EdgeStmt

21 EdgeStmt <*>

ColouredWeightedExtGrm::EdgeStmt

22 G <*> Dot::G

23 }

Listing 12: An actual extension uniﬁcation implementation.

First, two Grammar instances embody the extension gram-

mars: WeightedExtGrm, ColouredWeightedExtGrm. The

latter reiﬁes the uniﬁed grammar of the two extensions. Fi-

nally, the FinalGrm becomes composed from the uniﬁed

extensions and the original DOT grammar (whose deﬁni-

tion is not shown here).

2 # rules

3 EdgeStmt ← `Edge` CoreEdge

4 WeightAttr ;

5 WeightAttr ← OSQBRACKET WEIGHT

6 EQ weight:Weight

7 CSQBRACKET;

8 Weight ← `Weight`

9 value:<digit>+;

10 # deferred

11 CoreEdge ← '';

12 void: WS ← '';

Listing 13: Excerpt from the extension grammar for the

weighted feature, as used for extension uniﬁcation.

a single, fetch-all transform on line 10 of Listing 12:

EdgeStmt <*> WeightedExtGrm::EdgeStmt

The result of this transform is a uniﬁed extension,

without dependence on a base (i.e., the DOT) gram-

mar. To render the uniﬁed extension independent

from a base grammar, the two extension gram-

mars must be deﬁned in a self-sufﬁcient manner.

Most importantly, the start symbols (EdgeStmt) must

be deﬁned. Any deferred non-terminals must be

matched by placeholder deﬁnitions (ε-expressions).

2 # rules

3 EdgeStmt ← `Edge` CoreEdge

4 ColourAttr ;

5 ColourAttr ← OSQBRACKET COLOUR

6 EQ colour:Colour

7 CSQBRACKET;

8 Colour ← `Colour` value:('#'

9 <xdigit>

10 <xdigit>

11 <xdigit>);

12 # deferred

13 CoreEdge ← '';

14 void: WS ← '';

Listing 14: Excerpt from the extension grammar for the

coloured feature, as used for extension uniﬁcation.

See WS and CoreEdge in Listings 13 and 14 for

examples. In step (3), the base (DOT) gram-

mar is merged together with the uniﬁed extension

(ColouredWeightedExtension) into a completed

and operative grammar (FinalGrm). From this ﬁnal

grammar, a parser can be derived.

Two details are noteworthy about this ﬁnal com-

positional step: First, this ﬁnal receiving grammar

does not introduce any new rules. Second, it is the

resulting grammar of the uniﬁed extension becoming

merged into the ﬁnal grammar, and not the receiving

grammar of the uniﬁcation itself. See line 16 of List-

ing 12.

As for the ﬁrst detail: There are no dedicated rules

for the ﬁnal grammar because its rules set is popu-

lated purely from running grammar transformations

(see lines 18–22 in Listing 12). This is not only per-

mitted in OPEGs, but also corresponds to the nature

of an extension uniﬁcation. The ﬁrst three transforms

(lines 18–20) provide actual deﬁnitions for the de-

ferred non-terminals coming with the uniﬁed exten-

sions (WS and CoreEdge). Without the upfront re-

moval of the ε-placeholders, the resulting grammar

would remain dysfunctional. Recall that deﬁnitions

present in the grammars under composition are turned

into alternates of a combined rule. The subsequent

two lines 21 and 22 load the sets of rules of the

two composed grammars into the ﬁnal resulting one.

This is equivalent to the use of the transitive extract/

rewrite transformation, as applied for syntax exten-

sion and for syntax uniﬁcation.

To recap, extension uniﬁcation differs from step-

wise syntax extensions in that at the time of compos-

ing the extensions, they are treated in isolation from

any base grammar. Most importantly, any undeﬁned

or deferred non-terminal deﬁnitions must be provided

either by the extension grammars themselves or any

intermediate (uniﬁed) grammar (at least in terms of

placeholders). This complicates a uniﬁcation, as com-

ICSOFT 2021 - 16th International Conference on Software Technologies

382

pared to repeated syntax extensions. However, uni-

ﬁcation also presents immediate beneﬁts. One up-

side is that the extension uniﬁcation is deﬁned under

a closed-world assumption: The start symbols point

to parsing rules introduced by the extension gram-

mars; any deferred non-terminals are clearly marked

as such

. Another consequence is that an exten-

sion uniﬁcation is symmetric as opposed to incremen-

tal composition. Extensions are composed as peers,

without one taking precedence over the other. In addi-

tion, extension uniﬁcation provides more control over

a syntax composition (see derivatives in Section 4).

4 DISCUSSION

Language Hiding Revisited. Language hiding

(a.k.a. pre-emptive preﬁx capture) is a practical con-

sequence of the absence of general semi-disjointness

of a choice expression for the scope of the language

matched by a PEG (see Section 2.1). This is counter

the otherwise practical, advantageous consequence of

PEGs precluding ambiguity. Language hiding has im-

plications for composition operations as introduced

in Section 3.1, in that alternates become automati-

cally (combination) or selectively added (extraction

w/ and w/o insertion position). Any added alternate

may unintentionally hide others, and, therefore, im-

portant fragments of the matched language.

The implementation of OPEGs tries to minimize

unintended language hiding for two or more com-

posed grammars, by applying precautionary defaults:

For example, alternates introduced by DSL exten-

sions are prepended to those of the receiving gram-

mar. This follows from the assumption that, in exten-

sions, the aim is to capture longer preﬁxes. Beyond

that point, manual inspection (Redziejowski, 2018)

and ﬁne-grained control during composition are sup-

ported (explicit alternate positioning).

Grammar Cleaning. In Section 3.1, it was es-

tablished that techniques for reducing (“cleaning”)

object parsing grammars from unrealisable and un-

reachable non-terminals is a building block for mod-

elling and implementing certain composition opera-

tions (see Figure 1). For production grammars (CFG),

this is a matter of static analysis (Aho and Ullman,

1972, Section 2.4.2). For parsing grammars, in the

These are not only matters of deﬁnitional clarity, but

also an implementation-level requirement: Once composed,

the resulting grammar will be cleaned from useless non-

terminals. To prevent this from happening, there must be

placeholder deﬁnitions.

general case, ﬁnding useless (non-recognising, unde-

ﬁned, and unused) non-terminals is known to be un-

decidable (see Grune and Jacobs 2010, p. 507 and

Ford 2004, Section 3.5). This is, again, due to the is-

sue of non-disjointness of the ordered-choice operator

(see Section 2.1): The evaluation of some alternate is

conditional on the success or failure of its preceding

alternates.

The parsing expression with two alternates 'a' /

'ab' is pathological because it will only recognise

a in all inputs preﬁxed by a single a (e.g., aa, ab).

The second alternate is realisable (i.e., it recognises a

literal string) but is effectively shadowed by the ﬁrst

alternate (’a’). Therefore, even if an alternate expres-

sion can be statically marked as realisable (i.e., it does

recognise and possibly consume at least one terminal

on the input stream), it may be actually unreachable

in the order of any evaluation of a given choice ex-

pression.

(Practical) Workarounds are tool-supported man-

ual inspection (Redziejowski, 2018) or leveraging

higher-level CFG that are transformed into corre-

sponding PEG, said being well-behaved having only

choice expressions containing alternates then known

to be disjoint (Mascarenhas et al., 2014). None of

these apply to automated grammar cleaning, however.

For the scope of this work, a conservative approxima-

tion is applied. An approximative cleaning of pars-

ing grammars will lead to false negatives. A false

negative is a non-terminal marked as realisable that

may still turn out unreachable, conditionally. There-

fore, for an OPEG, it is not able to obtain fully re-

duced grammars and fully optimised parsers. How-

ever, the approximation is sufﬁcient for cleaning re-

sulting grammars from composition artefacts, such as

non-terminals becoming undeﬁned.

Additional Composition Types. Beyond the basic

types covered in Section 3, one can realise language

restrictions and higher-order extension uniﬁcations

(derivatives).

By default, and to maintain closure under com-

position, rules are combined as alternates. This ef-

fectively widens the matching space of the resulting

grammar as compared to the composed one. If the re-

sulting syntax should be restricted to disallow previ-

ously allowed syntax elements, one must restrict the

resulting grammar by removing alternates explicitly.

This restriction can be achieved by employing the re-

move operator (⇒; see Table 2, operator 4).

In syntax-extension composition, a developer

must realise a dual goal. A developer must (a) pro-

vide for coordination code to accommodate the two

co-present syntax extensions. In addition, the devel-

Object Parsing Grammars with Composition

383

oper must to (b) implement the coordination code in

a way that both syntax extensions remain deployable

in isolation from each other. OPEGs help achieve this

double goal by derivative extension composition. Co-

ordination code can be provided as a dedicated gram-

mar (derivative grammar; Liu et al. 2006). This gram-

mar provides for extra rules and transforms to resolve

unwanted interactions such as syntax failures of two

or more syntax extensions (composed grammars).

5 RELATED WORK

The relevant context is set by approaches to compos-

able and modular grammars. Grammar (deﬁnition)

reuse without modiﬁcation (Erdweg et al., 2012) is

referred to, but mainly with respect to the limitations

perceived at the time. These include conﬂicting lexers

(token scanners) vs. scannerless parsing and parser

generators being limited to single and closed gram-

mar deﬁnitions.

To overcome these limitations, ﬁrst contributions

included syntax modules of the series of Syntax Deﬁ-

nition Formalisms (SDF, SDF2, SDF3; Visser 1997),

the grammar-inclusion mechanism by TXL (Cordy,

2006) and grammar imports by ANTLR (Parr, 2013,

pp. 257). These approaches turned grammars into

open deﬁnitions. SDF starting with version 2 intro-

duced parametrised syntax modules. Modules can

import from each other. When an imported module

exposes non-terminals or terminals as named module

parameters, they can be bound under different names

in the importing module. The same can be achieved in

SDF using explicit renaming, without formal parame-

ters. SDF is contained by a number of language devel-

opment systems, including Spoofax and RascalMPL.

SDF also addressed composability issue by operat-

ing on scannerless and generalised parsing (i.e., scan-

nerless GLR). It is noteworthy that the support for

parametrised modules has been discontinued starting

from SDF3.

TXL allowed a developer to place rules over dif-

ferent ﬁles, rooted under one start symbol though. In

addition, TXL provided for a refine to replace or add

a new alternate to a given rule. ANTLR, as elaborated

on in this section, applies a union-with-override tech-

nique, with particularities regarding different types of

deﬁnition artefacts. As ANTLR serves as the parsing

infrastructure for several language development sys-

tems such as Xtext (Bettini, 2013), MontiCore (Krahn

et al., 2010), MetaDepth (Meyers et al., 2012), gram-

mar imports have seen uptake.

Throughout Section 3, the PEG-based system

Rats! was referred to, mainly because Rats! provides

for basic grammar composition on the basis of rules

and alternates using dedicated transformations (add,

delete, append). In addition, Grimm (2006) highlights

important barriers to composing PEG-based syntax

deﬁnitions (e.g., ordering).

6 CONCLUDING REMARKS

This paper departs from the foundations of ad-

vanced parsing expression grammars (PEG) and de-

livers object parsing-expression grammars (OPEGs)

to deﬁne—in one—a concrete syntax and the map-

ping to an object-oriented primary abstract syntax

(language model). The double aim is to avoid com-

mon abstraction mismatches of parse representations

(e.g., decomposition mismatches) and to render the

extended parsing grammars composable. The ex-

tended parsing grammars support different compo-

sition techniques under the umbrella of a uniform

framework: simple grammar unions and ﬁne-grained

grammar transformations. The latter transformations

are backed by well-deﬁned operators taking as input

the sets of parsing rules of two or more OPEGs at

different levels (e.g., rule-wise, alternates) to form

a valid OPEG as their output. The transformation

procedure and the provided operators enable devel-

opers to mitigate the consequences of unwanted lan-

guage hiding by PEGs under composition. This cov-

erage of robust composition techniques is shown to

be necessary to enable a developer to implement the

different grammar compositions relevant for realising

language-product lines: extensions, uniﬁcation, ex-

tension composition, and derivative grammars.

REFERENCES

Aho, A. V. and Ullman, J. D. (1972). The Theory of Pars-

ing, Translation, and Compiling: Parsing, volume I.

Prentice Hall.

Barrett, K., Cassels, B., Haahr, P., Moon, D. A., Playford,

K., and Withington, P. T. (1996). A monotonic super-

class linearization for dylan. In Proc. 11th ACM SIG-

PLAN Conference on Object-oriented Programming,

Systems, Languages, and Applications (OOPLSA’96),

pages 69–82. ACM.

Bettini, L. (2013). Implementing Domain-Speciﬁc Lan-

guages with Xtext and Xtend. Packt Publishing, 2nd

edition.

Cordy, J. R. (2006). The TXL source transforma-

tion language. Science of Computer Programming,

61(3):190–210.

Degueule, T. (2016). Composition and Interoperability

ICSOFT 2021 - 16th International Conference on Software Technologies

384

for External Domain-Speciﬁc Language Engineering.

Theses, Universit

e de Rennes 1 [UR1].

Dejanovi

c, I., Milosavljevi

c, G., and Vaderna, R.

(2016). Arpeggio: A ﬂexible peg parser for python.

Knowledge-Based Systems, 95:71–74.

Diekmann, L. and Tratt, L. (2014). Eco: A language com-

position editor. In Proc. 7th International Conference

on Software Language Engineering (SLE’14), volume

8706 of LNCS, pages 82–101. Springer.

Erdweg, S., Giarrusso, P. G., and Rendel, T. (2012). Lan-

guage composition untangled. In Proc. Twelfth Work-

shop on Language Descriptions, Tools, and Applica-

tions (LDTA’12), pages 7:1–7:8. ACM.

Erdweg, S., van der Storm, T., V

olter, M., Tratt, L.,

Bosman, R., Cook, W. R., Gerritsen, A., Hulshout,

A., Kelly, S., Loh, A., Konat, G., Molina, P. J., Palat-

nik, M., Pohjonen, R., Schindler, E., Schindler, K.,

Solmi, R., Vergu, V., Visser, E., van der Vlist, K.,

Wachsmuth, G., and van der Woning, J. (2015). Eval-

uating and comparing language workbenches: Exist-

ing results and benchmarks for the future. Computer

Languages, Systems & Structures, 44(Part A):24–47.

Ford, B. (2004). Parsing expression grammars: A

recognition-based syntactic foundation. In Proc. 31st

ACM SIGPLAN-SIGACT Symposium on Principles of

Programming Languages (POPL’04), pages 111–122.

ACM.

Fowler, M. (2010). Domain Speciﬁc Languages. Addison-

Wesley, 1st edition.

Grimm, R. (2006). Better extensibility through modular

syntax. In Proc. 27th ACM SIGPLAN Conference on

Programming Language Design and Implementation

(PLDI’06), pages 38–51. ACM.

Grune, D. and Jacobs, C. J. (2010). Parsing Techniques: A

Practical Guide. Springer, 2nd edition.

equel, J.-M., M

endez-Acu

na, D., Degueule, T., Combe-

male, B., and Barais, O. (2015). When systems en-

gineering meets software language engineering. In

Proc. Fifth International Conference on Complex Sys-

tems Design & Management (CSD&M’14), pages 1–

13. Springer.

Johnstone, A., Scott, E., and van den Brand, M. (2014).

Modular grammar speciﬁcation. Science of Computer

Programming, 87:23–43.

Krahn, H., Rumpe, B., and V

olkel, S. (2010). Monticore: a

framework for compositional development of domain

speciﬁc languages. International Journal on Software

Tools for Technology Transfer, 12(5):353–372.

uhn, T., Cazzola, W., and Olivares, D. M. (2015). Choosy

and picky: Conﬁguration of language product lines.

In Proc. 19th International Conference on Software

Product Line (SPLC’15), pages 71–80. ACM.

Liebig, J., Daniel, R., and Apel, S. (2013). Feature-

oriented language families: A case study. In Proc. 7th

International Workshop on Variability Modelling of

Software-intensive Systems (VaMoS’13), pages 11:1–

11:8. ACM.

Liu, J., Batory, D., and Lengauer, C. (2006). Feature ori-

ented refactoring of legacy applications. In Proc.

28th International Conference on Software Engineer-

ing (ICSE’06), pages 112–121. ACM.

Lopez-Herrejon, R. E. and Batory, D. S. (2001). A standard

problem for evaluating product-line methodologies.

In Proc. 3rd Int. Conf. Generative and Component-

Based Softw. Eng., pages 10–24. Springer.

Mascarenhas, F., Medeiros, S., and Ierusalimschy, R.

(2014). On the relation between context-free gram-

mars and parsing expression grammars. Science of

Computer Programming, 89:235–250.

endez-Acu

na, D., Galindo, J. A., Degueule, T., Combe-

male, B., and Baudry, B. (2016). Leveraging software

product lines engineering in the development of exter-

nal DSLs: A systematic literature review. Computer

Languages, Systems & Structures, 46:206–235.

Meyers, B., Cicchetti, A., Guerra, E., and de Lara, J.

(2012). Composing textual modelling languages in

practice. In Proc. 6th International Workshop on

Multi-Paradigm Modeling (MPM’12), pages 31–36.

ACM.

Parr, T. (2013). The Deﬁnitive ANTLR 4 Reference. Prag-

matic Bookshelf, 2nd edition.

Redziejowski, R. R. (2008). Some aspects of parsing ex-

pression grammar. Fundamenta Informaticae, 85(1-

4):441–454.

Redziejowski, R. R. (2018). Trying to understand PEG.

Fundamenta Informaticae, 157(4):463–475.

Reis, L. V., Iorio, V. O. D., and Bigonha, R. S. (2015). An

on-the-ﬂy grammar modiﬁcation mechanism for com-

posing and deﬁning extensible languages. Computer

Languages, Systems & Structures, 42:46–59.

Schmitz, S. (2006). Modular syntax demands veriﬁca-

tion. Technical Report I3S/RR-2006-32-FR, Labora-

toire I3S, Universit

e de Nice-Sophia Antipolis.

Servetto, M., Mackay, J., Potanin, A., and Noble, J. (2013).

The billion-dollar ﬁx. In Proc. 27th Europ. Con-

ference Object-Oriented Programming (ECOOP’13),

volume 7920 of LNCS, pages 205–229. Springer.

Sobernig, S. (2020). Variable Domain-speciﬁc Software

Languages with DjDSL. Springer.

van der Storm, T., Cook, W. R., and Loh, A. (2014). The de-

sign and implementation of object grammars. Science

of Computer Programming, 96:460–487.

Visser, E. (1997). Syntax Deﬁnition for Language Prototyp-

ing. PhD thesis, University of Amsterdam.

olter, M. (2018). The design, evolution, and use of ker-

nelf. In Proc. 11th International Conference on Model

Transformation (ICMT’18), volume 10888 of LNCS,

pages 3–55. Springer.

Object Parsing Grammars with Composition

385