On Representing Natural Languages and Bio-molecular
Structures using Matrix Insertion-deletion Systems
and its Computational Completeness
Lakshmanan Kuppusamy
1
, Anand Mahendran
1
and Shankara Narayanan Krishna
2
1
School of Computing Science and Engineering, VIT University, Vellore 632 014, India
2
Department of Computer Science and Engineering, IIT Bombay, Powai - 400 076, India
Abstract. Insertion and deletion are considered to be the basic operations in Bi-
ology, more specifically in DNA processing and RNA editing. Based on these
evolutionary transformations, a computing model insertion-deletion system has
been proposed in formal language theory. Recently, in [14], a new computing
model named Matrix insertion-deletion system has been introduced to model var-
ious bio-molecular structures. In this paper, we represent some natural language
constraints such as triple agreement, crossed dependency, copy language using
Matrix insertion-deletion systems and discuss how these constraints resemble
some bio-molecular structures. Next, we analyze the computational completeness
result for Matrix insertion-deletion system where the importance is given for not
using any contexts when deletion rule takes place. We see that when the insertion-
deletion system is combined with matrix grammar, the universality result is ob-
tained with weight just 3 (1, 1; 1, 0) whereas for insertion-deletion systems, the
universality result is available with weight 4 (1, 1; 1, 1).
1 Introduction
Linguistics agreed in late 1960s that many natural languages including English, are
not context-free [4]. This initiated the idea of thinking grammars beyond the scope of
context-free. As membership problem is tough to handle for context-sensitive gram-
mars, they are not a good model for describing natural languages. To overcome these
difficulties and to give a syntactical representation for natural languages, a notion called
mildly context-sensitive (MCS) grammar formalisms has been defined by (majority of)
computational linguistics people and later many attempts have been made to find MCS
grammar formalisms [3], [5], [15]. A grammar formalism is said to be MCS formalism
if it satisfies the following properties.
1. The class of its languages contain all context-free languages
2. The class of its languages contain the following three basic non-context-free lan-
guages:
triple agreements: L
ta
= {a
n
b
n
c
n
|n 1},
This work was partially supported by the project SR/S3/EECE/054/2010, Department of Science
and Technology (DST), New Delhi, India.
Kuppusamy L., Mahendran A. and Krishna Narayanan S..
On Representing Natural Languages and Bio-molecular Structures using Matrix Insertion-deletion Systems and its Computational Completeness.
DOI: 10.5220/0003308900470056
In Proceedings of the 1st International Workshop on AI Methods for Interdisciplinary Research in Language and Biology (BILC-2011), pages 47-56
ISBN: 978-989-8425-42-3
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
crossed dependencies: L
cd
= {a
n
b
m
c
n
d
m
|n, m 1}, and
copy language L
cp
= {ww|w {a, b}
}.
3. All languages in the class are parsable in polynomial time.
4. All languages in the class are semilinear or at least satisfy the bounded growth
property.
In the last three decades, biology played a great role in the field of formal languages
by being the root for the development of various biologically inspired computing mod-
els such as sticker systems, splicing systems, Watson-Crick automata, insertion-deletion
systems, p systems [6], [11], [12]. Since, most of the language generating devices are
based on the operation of rewriting systems, the insertion-deletion systems opened a
particular attention in the field of formal languages. Informally, the insertion and dele-
tion operations of an insertion-deletion systems are defined as follows: If a string α is
inserted between two parts w
1
and w
2
of a string w
1
w
2
to get w
1
αw
2
, we call the oper-
ation as insertion, whereas if a substring β is deleted from a string w
1
βw
2
to get w
1
w
2
,
we call the operation as deletion.
DNA molecules may be considered as strings over alphabet consisting of four sym-
bols namely a, t, g and c (denoted as Σ
DN A
). Similarly, RNA molecules may be con-
sidered as strings over alphabet consisting of four symbols namely a, u, g and c (de-
noted as Σ
RNA
). We discuss below in brief some of the important structures seen in
bio-molecules such as protein, DNA and RNA. Fig.1. shows the structures (a) stem,
(b) cloverleaf and (c) dumbbell. Since the bio-molecular structures can be defined in
terms of sequence of symbols (i.e., strings) there exists a correlation between formal
grammars and bio-molecular structures. The following example witnesses this corre-
lation. Consider a context-free language {ww
R
| w {a, b }
} (where w
R
is the
reverse of w) and a gene sequence cta tc gcgatag. As the complements are ¯a = t,
¯
t = a, ¯g = c and ¯c = g, the above gene sequence resembles the palindrome lan-
guage {w ¯w
R
| w Σ
DN A
} where w = ctatcg and ¯w
R
= cgatag. Like this, the
structures mentioned in Fig.1. can be represented by context-free languages.
However, there are some more structures that are predominantly available in bio-
molecules which cannot be modelled by context-free grammars. Fig.2. represents such
structures (a) pseudoknot and (b) attenuator. A closer look at these bio-molecular struc-
tures showsa resemblance with well knownnatural language constructs, such as crossed
dependencies: {a
n
b
m
c
n
d
m
| n, m 1} and copy language: {ww | w {a, b}
}, see
[8,9]. Therefore, if a formal grammar is capable of generating the context-free and non-
context-free languages, then that grammar is also suitable to represent bio-molecular
structures and the vice versa is also true. Next, we discuss some attempts made in for-
mal grammars to represent the above mentioned bio-molecular structures.
In the last two decades or so, many attempts have been made to establish the lin-
guistic behaviour of biological sequences by defining new grammar formalisms like cut
grammars [7], crossed-interaction grammar [10], simple linear tree adjoining gram-
mars and extended simple linear tree adjoining grammars [20] which are capable of
generating some of the biological structures mentioned above. However, there was
no unique grammar system that encapsulate all essential and important bio-molecular
structures. For example double copy language cannot be modelled by a simple lin-
ear tree adjoining grammar [20]. Very recently, a new biologically inspired computing
model namely Matrix insertion-deletion system has been introduced in [14] by com-
48
A
B
C
u u
_ R
a g a t c t t g a t c a
(a) (b) (c)
S
u a
S
a u
S
u a
S
c g
u a u c g a u a
Fig.1. Bio-molecular structures: (a) stem (b) cloverleaf (c) dumbbell.
u a
g c
u a
c g
u g c u c a a g
(a) (b)
a u
g c
a u
a g a
a g a
u a
c g
u a
a g a u c u a g a
Fig.2. Bio-molecular structures: (a) pseudoknot structure (b) attenuator.
bining insertion-deletion system and matrix grammars. This system represents all the
above discussed bio-molecular structures and also the other structures such as hairpin,
non-ideal attenuator, ideal strings, orthodox string. In this paper, we first briefly re-
call this system and using this system we generate a few natural language constraints
and we discuss their coherences in bio-molecular structures. Thus we aim to show that
Matrix insertion-deletion grammar system might be a unique model that is suitable for
both natural language and bio-molecular representations.
Given an insertion-deletion system, the weight of the system is based on the max-
imal length of insertion, maximal length of the context used for insertion, maximal
length of deletion, maximal length of the context used for deletion and they are (respec-
tively) denoted as (n, m; p, q). The total weight is defined as the sum of n, m, p, q. There
have been many attempts to characterize the recursively enumerable languages (i.e.,
computationalcompleteness) using insertion-deletion systems with less weights. In [16]
the universality results were obtained with weight 5 (of the combinations (1, 2; 1, 1),
(2, 1; 2 , 0), (1, 2; 2, 0 )). In [1] and [11], this result was improved with weight 4 (of
the combinations (1, 1; 1, 1) and (1, 1; 2, 0) respectively). In [17], a variant of insertion-
deletion systems called context-free insertion-deletion systems were introduced by con-
sidering no context in insertion and deletion rules. The universality of context-free
insertion-deletion systems were proved initially with weight (, 0 ; , 0) and reduced
to weight 6 (3, 0; 3, 0). Further, in the same paper, the result was improved to weight 5
49
(of the combinations (3, 0; 2, 0) and (2, 0; 3, 0)). As Matrix insertion-deletion system is
a variation of insertion-deletion systems and is more powerful than insertion-deletion
systems, the universality result of this system must be obtained with less weight than 4.
In this paper, we prove the universality result of Matrix insertion-deletion systems with
just weight 3 (1, 1; 1, 0) where no contexts is considered for deletion thus the deletion
operation works in a context-free manner.
2 Preliminaries
We assume that the readers are familiar with the notions of formal language theory.
However, we recall the basic notions which are used in the paper. A finite non-empty
set V or Σ is called an alphabet. We denote by V
or Σ
, the free monoid generated
by V or Σ, by λ its identity or the empty string, and by V
+
or Σ
+
the set V
{λ} or
Σ
{λ}. The elements of V
or Σ
are called words or strings. For any word w V
or Σ
, we denote the length of w by |w|. For more details on formal language theory,
we refer to [13].
Next, we look into the basic definitions of insertion-deletion systems. Given an
insertion-deletion system γ = (V, T, A, R), where V is an alphabet, T V , A is a
finite language over V , R is a finite set triples of the form (u, β/α, v), where (u, v)
V
× V
, (α, β) (V
+
× {λ}) ({λ} × V
+
). The pair (u, v) is called as contexts.
Insertion rule will be of the form (u, λ/α, v) which means that α is inserted between
u and v. Deletion rule will be of the form (u, β/λ, v), which means that β is deleted
between u and v. In other words, (u, λ/α, v) corresponds to the rewriting rule uv
uαv, and (u, β/λ, v) corresponds to the rewriting rule v uv.
Consequently, for x, y V
we can write x =
y, if y can be obtained from x
by using either an insertion rule or a deletion rule which is given as follows: (the down
arrow indicates the position where the string is inserted, the down arrow indicates
the position where the string is deleted and the underlined string indicates the string
inserted/deleted)
1. x = x
1
u
vx
2
, y = x
1
uαvx
2
, for some x
1
, x
2
V
and (u, λ/α, v) R.
2. x = x
1
vx
2
, y = x
1
u
vx
2
, for some x
1
, x
2
V
and (u, β/λ, v) R.
The language generated by γ is defined by
L(γ) = {w T
| x =
w, for some x A}
where =
is the reflexive and transitive closure of the relation =.
Next, we discuss about the weight of the insertion-deletion system. An insertion-
deletion system γ = (V, T, A, R) is of weight(n, m; p, q) if
n = max{|α| | (u, λ/α, v) R}
m = max{|u| | (u, λ/α, v) R or (v, λ/α, u) R}
p = max{|β| | (u, β/λ, v) R}
q = max{|u| | (u, β/λ, v) R or (v, β/λ, u) R}
We denote by INS
m
n
DEL
q
p
for n, m, p, q 0, the family of languages L(γ) gener-
ated by insertion-deletion systems of weight (n
, m
; p
, q
) such that n
n, m
m,
50
p
p, q
q. The total weight of γ is n + m + p + q.
Next, we will look into the definition of matrix grammars. A matrix grammar is
an ordered quadruple G = (N, T, S, M) where N is a set of non-terminals, T is a
set of terminals, S is the start symbol and M is a finite set of nonempty sequences
whose elements are ordered pairs (P, Q). The pairs are referred to as productions and
written in the form P Q. The sequences are referred to as matrices and written m =
[P
1
Q
1
, ..., P
r
Q
r
], r 1. The above matrix grammar is without appearance
checking. The language generated by the matrix grammar is defined by L(G) = { w
T
| S =
w}. A matrix grammar with appearance checking is defined as G =
(N, T, S, M, F ) where F is a set of occurrences of rules in the matrices of M. While
deriving, a rule may be exempted to apply if the rule is in F . The language generated
by the matrix grammar with appearance checking is defined as L
ac
(G, F ) = {w
T
| S =
w}. The family of languages generated by matrix grammars without/with
appearance checking is denoted by M AT
λ
, M AT
λ
ac
(the λ on the upper index indicates
where the rule P λ is allowed). For more details on matrix grammars, we refer to
[2], [18].
3 Matrix Insertion-deletion Systems
In this section, we describe Matrix insertion-deletion systems. A Matrix insertion-deletion
system is a construct Υ = (V, T, A, R) where V is an alphabet, T V , A is a finite lan-
guage over V , R is a finite set triples of the form in matrix format [(u
1
, β
1
1
, v
1
), . . . ,
(u
n
, β
n
n
, v
n
)], where (u
k
, v
k
) V
× V
, and (α
k
, β
k
) (V
+
× {λ}) ({λ} ×
V
+
), with (u
k
, β
k
k
, v
k
) R
I
i
R
D
j
R
I
i
/D
j
, for 1 k n. Here R
I
i
denotes the
matrix which consists of only insertion rules, R
D
j
denotes the matrix which consists of
only deletion rules and R
I
i
/D
j
denotes the matrix which consists of both insertion and
deletion rules.
Consequently, for x, y V
we can write x = x
= x
′′
= . . . = y, if y can
be obtained from x by using a matrix consisting of insertion rules (R
I
i
) or deletion rules
(R
D
j
) or insertion and deletion rules (R
I
i
/D
j
). In a derivation step the rules in a matrix
are applied sequentially one after other in order and no rule is in appearance checking
(note that the rules in a matrix are not applied in parallel). The language generated by
Υ is defined by L(Υ ) = {w T
| x =
R
χ
w, for some x A}, where χ
{I
i
, D
j
, I
i
/D
j
} where =
is the reflexive and transitive closure of the relation =.
Note that the string w is collected after applying all the rules in a matrix and also
w T
only. The family of languages generated by Matrix insertion-deletion systems
with weights (n, m : p, q) is given as MAT INS
m
n
DEL
q
p
.
Example 1. Consider the mix language L
ml
= {n
a
(w) = n
b
(w) = n
c
(w) | w
{a, b, c}
}. The language L
ml
can be generated by Υ
ml
= ({a, b, c}, {a, b, c}, {λ}, R)
where R is given as R
I
1
= [(λ, λ/a, λ), (λ, λ/b, λ), (λ, λ/c, λ)]. As no context is used
in insertion rules a, b, c can be inserted anywhere in the derivation and the number of
a, b, c are equal. It is easy to see that Υ
ml
generates L
ml
.
Remark. Note that L
ml
is generated with weight just 2 (1, 1; 0, 0).
51
3.1 Modelling Natural Language Constructs and Bio-molecular Structures
In this section, we represent some of the natural language constraints such as crossed
dependency, copy language, triple agreement, quadruple agreement, center-embedded
structure and discuss their coherences with bio-molecular structures.
In the next lemma, we represent crossed dependency structure in natural languages
which models pseudoknot structure in bio-molecules.
Lemma 1. L
cd
= {a
n
b
m
c
n
d
m
| m, n 1} Υ
cd
.
Proof. The language L
cd
can be generated by the Matrix insertion-deletion system
Υ
cd
= ({a, b, c, d}, {a, b, c, d}, {abcd}, {R
I
1
= [(a, λ/a, λ), (c, λ/c, λ)],
R
I
2
= [(b, λ/b, λ), (d, λ/d, λ)] }). A sample derivation can be given as follows: a
bc
d
=
R
I
1
a
abc
cd =
R
I
1
aaab
cccd
=
R
I
2
aaabbcccdd =
R
I
1
/I
2
a
n
b
m
c
n
d
m
.
The language L
cd
resembles the pseudoknot structure language L
ps
= {uv¯u
R
¯v
R
|
u, v Σ
DN A
} in gene sequences as shown in Fig. 2(a). The language L
ps
can be gener-
ated by the Matrix insertion-deletionsystem Υ
ps
= ({b,
¯
b,
1
,
2
,
3
,
4
}, {b,
¯
b}, {λ,
1
2
3
4
}, R), where b {a, t, g, c},
¯
b is complement of b and R is given as: R
I
1
=
[(λ, λ/b,
1
), (λ, λ/
¯
b,
3
)], R
I
2
= [(λ, λ/b,
2
), (λ, λ/
¯
b,
4
)],
R
D
1
= [(λ,
1
/λ, λ), (λ,
3
/λ, λ)], R
D
2
= [(λ,
2
/λ, λ), (λ,
4
/λ, λ)]
A sample derivation is given as follows:
1
2
3
4
=
R
I
1
a
1
2
t
3
4
=
R
I
2
a
1
g
2
t
3
c
4
=
R
I
2
a
1
ga
2
t
3
ct
4
=
R
D
1
a
ga
2
t
ct
4
=
R
D
2
aga
tct
2
In the next lemma, we model copy language in natural languages which resembles
attenuator structure in gene sequences.
Lemma 2. L
cp
= {ww | w {a, b}
} Υ
cp
.
Proof. The language L
cp
can be generated by the Matrix insertion-deletion system
Υ
cp
= ({a, b,
1
,
2
}, {a, b}, {λ,
1
2
}, {R
I
1
= [(λ, λ/a,
1
), (λ, λ/a,
2
)],
R
I
2
= [(λ, λ/b,
1
), (λ, λ/b,
2
)], R
D
1
= [(λ,
1
/λ, λ), (λ,
2
/λ, λ)] }). A sample
derivation is given as follows:
1
2
=
R
I
1
a
1
a
2
=
R
I
1
aa
1
aa
2
=
R
I
2
aab
1
aab
2
=
R
D
1
aab
aab
.
The language L
cp
resembles the attenuator language L
an
= {u¯u
R
u¯u
R
| u
Σ
DN A
} in gene sequences as shown in Fig.2(b). The language L
an
can be generated by
the Matrix insertion-deletion system Υ
an
= ({a, t, g, c,
1
,
2
}, {a, t , g, c}, {λ,
1
2
}, R),
where R is given as: R
I
1
= [(λ, λ/a,
1
), (
1
, λ/t, λ), (λ, λ/a,
2
), (
2
, λ/t, λ)], R
I
2
=
[(λ, λ/t,
1
), (
1
, λ/a, λ), (λ, λ/t,
2
), (
2
, λ/a, λ)], R
I
3
= [(λ, λ/c,
1
), (
1
, λ/g, λ),
(λ, λ/c,
2
), (
2
, λ/g, λ)], R
I
4
= [(λ, λ/g,
1
), (
1
, λ/c, λ), (λ, λ/g,
2
), (
2
, λ/c, λ)],
R
D
1
= [(λ,
1
/λ, λ), (λ,
2
/λ, λ)]
A sample derivation is given as follows:
1
2
=
R
I
1
a
1
t a
2
t =
R
I
2
at
1
atat
2
at =
R
I
3
atc
1
gatatc
2
gat =
R
I
4
atcg
1
cgatatcg
2
cgat =
R
D
1
atcg
cgatatcg
cgat 2
In the next lemma, we show the relevance between triple, quadruple agreement lan-
guage to their corresponding structure in bio-molecules.
52
Lemma 3. L
ta
= {a
n
b
n
c
n
| n 1} Υ
ta
.
Proof. The language L
ta
can be generated by the Matrix insertion-deletion system
Υ
ta
= ({a, b, c}, {a, b, c}, {abc}, {R
I
1
= [(a, λ/ab, b ), (b, λ/c, c)]}). A sample deriva-
tion can be given as follows: a
b
c =
R
I
1
aa
bb
cc =
R
I
1
aaabbbccc =
R
I
1
a
n
b
n
c
n
. The triple agreement language has relevances to triple-stranded DNA (triple
helix structure) [9]. If we include d in V , T , replace the axiom as a bcd and R
I
1
as
[(a, λ/ab, b), (c, λ/ cd, d)] in Υ
ta
we can see that Υ
ta
generates quadruple agreement
language {a
n
b
n
c
n
d
n
| n 1}. It is mentioned in [9] that the quadruple agreement
language can be regarded as a quadruple-stranded DNA (quadruple helix structure). 2
In the next lemma, we show that center-embedded structure found in natural languages
[19] can be viewed as a RNA stem in bio-molecules.
Lemma 4. The center-embedded language L
ce
= {a
n
b
m
b
m
a
n
| n, m 0} can be
represented by Matrix insertion-deletion system.
Proof. The language L
ce
can be generated by the Matrix insertion-deletion system
Υ
ce
= ({a, b,
1
,
2
,
3
,
4
}, {a, b}, {λ,
1
2
3
4
}, R), where R is given as: R
I
1
=
[(
1
, λ/a, λ), (
4
, λ/a, λ)] R
I
2
= [(
2
, λ/b, λ), (
3
, λ/b, λ)], R
D
1
= [(λ,
1
/λ, λ),
(λ,
4
/λ, λ)] R
D
2
= [(λ,
2
/λ, λ), (λ,
3
/λ, λ)]. It is easy to see that Υ
ce
generates L
ce
.
The center-embedded language resembles the stem construct {u
1
u
2
u
3
¯u
3
R
¯u
2
R
¯u
1
R
|
u
1
, u
2
, u
3
Σ
DN A
} in gene sequences as shown in Fig.1(a). 2
Note. The dumbbell language L
db
= {u¯u
R
v¯v
R
| u, v Σ
DN A
} (refer to Fig.1(c).) is
found relevance with the natural language L = {a
n
b
n
c
m
d
m
| n, m 0}. It is easy to
generate this context-free language using Matrix insertion-deletion system.
4 Computational Completeness
In this section, we prove that the family of recursively enumerable languages can be
characterized by Matrix insertion-deletion grammars with weight 3 (1, 1; 1 , 0). This
universalityresult is achievedby simulating matrix grammars with appearancechecking
in strong binary normal form (SBNF).
Theorem 1. M AT INS
1
1
DEL
0
1
= RE.
Proof. The part MAT IN S
1
1
DEL
0
1
RE is obvious. We now prove the other part.
For each language L RE there is a matrix grammar G with appearance checking in
SBNF, such that L(G) = L. A matrix grammar G is given as (N, T, S, M, F ) where
N is a set of non-terminals, T is set of terminals, S is the start symbol, M is a finite set
of matrices and F is a set of rules (used for appearance checking). A matrix grammar
G is said to be in SBNF if N = N
1
N
2
{$, #} where these three sets are mutually
disjoint and the matrices in M are in one of the following forms:
1. [S XA] X N
1
, A N
2
2. [X Y, A x] X, Y N
1
, A N
2
, x (N
2
T )
, |x| 2; x = x
1
x
2
3. [X Y, A #] X, Y N
1
, A N
2
53
4. [X λ, A x] X N
1
, A N
2
, x T
, |x| 2
There is only one matrix of type 1 and F consists exactly of all rules A # appearing
in matrices of type 3, # is a trap symbol once used in the derivation it cannot be removed
and the rules of matrix 4 are applied only in the last step of the derivation. Observe
that there can be at most only one non-terminal symbol from N
1
at any point of the
derivation.
Next, we construct a Matrix insertion-deletion system Υ which will simulate the
matrix grammars in strong binary normal form with appearance checking. Let Υ =
(V, T, A, R), where V = N
1
N
2
{S, ]
XA
, ]
Y
, [
x
1
x
2
, ]
x
1
x
2
, [
#
, ]
#
, # | X, Y
N
1
, A N
2
, x
1
, x
2
(N
2
T )}, T = T (terminal set in matrix grammars), A = S,
and R is given in simulation process.
The Matrix of type 1 can be simulated by the following Matrix insertion-deletion
system rules R
I
1
/D
1
= [(S, λ/]
XA
, λ), (λ, S/λ, λ), (]
XA
, λ/A, λ), (]
XA
, λ/
X, A), (λ, ]
XA
/λ, λ) | [S XA], X N
1
, A N
2
]. Here after, we omit the corre-
sponding condition matrix rule (here [S XA]) as we are giving the simulation for
each type of the matrix at the appropriate place. The simulation of type 1 is given as
follows:
S
= S]
XA
=
]
XA
=]
XA
A =]
XA
XA = XA
The Matrix of type 2 can be simulated by the following Matrix insertion-deletion
system rules R
I
2
/D
2
= [(X, λ/]
Y
, λ), (λ, X/λ, λ), (]
Y
, λ/Y, λ), (λ, ]
Y
/λ, λ),
(λ, λ/[
x
1
x
2
, A), (A, λ/]
x
1
x
2
, λ), (λ, A/λ, λ), ([
x
1
x
2
, λ/x
2
, ]
x
1
x
2
), ([
x
1
x
2
, λ/x
1
, x
2
),
(λ, [
x
1
x
2
/λ, λ), (λ, ]
x
1
x
2
/λ, λ)]
The simulation is given as follows:
X
= X]
Y
=
]
Y
=]
Y
Y =
Y
A = [
x
1
x
2
A
= [
x
1
x
2
A]
x
1
x
2
= [
x
1
x
2
]
x
1
x
2
= [
x
1
x
2
x
2
]
x
1
x
2
=
[
x
1
x
2
x
1
x
2
]
x
1
x
2
=
x
1
x
2
]
x
1
x
2
= x
1
x
2
Since there is only one symbol from N
1
at any derivation, the rule (λ, X/λ, λ) will
delete the used X N
1
correctly. Note that we introduce another Y N
1
only af-
ter deleting X. Similarly, the rule (λ, A/λ, λ) can delete any A N
2
, the (next) rule
([
x
1
x
2
, λ/x
2
, ]
x
1
x
2
) makes sure that the used A N
2
is deleted. If any other A is
deleted, the rule ([
x
1
x
2
, λ/x
2
, ]
x
1
x
2
) cannot be applied as the used A will be in the
middle of [
x
1
x
2
and ]
x
1
x
2
and the derivation stops.
The Matrix of type 3 can be simulated by the following Matrix insertion-deletion
system rules R
I
3
/D
3
= [(X, λ/]
Y
, λ), (λ, X/λ, λ), (]
Y
, λ/Y, λ), (λ, ]
Y
/λ, λ),
(λ, λ/[
#
, A), (A, λ/]
#
, λ), (λ, A/λ, λ), ([
#
, λ/#, ]
#
), (λ, [
#
/λ, λ), (λ, ]
#
/λ, λ)]
The simulation is given as follows:
X
= X]
Y
=
]
Y
=]
Y
Y =
Y
A = [
#
A
= [
#
A]
#
= [
#
]
#
= [
#
#]
#
=
#]
#
= #
Note that the appearance checking symbol # is not deleted in the derivation.
54
The Matrix of type 4 can be simulated by the following Matrix insertion-deletion
system rules R
I
4
/D
4
[(λ, X/λ, λ), (λ, λ/[
x
1
x
2
, A), (A, λ/]
x
1
x
2
, λ),
(λ, A/λ, λ), ([
x
1
x
2
, λ/x
2
, ]
x
1
x
2
), ([
x
1
x
2
, λ/x
1
, x
2
), (λ, [
x
1
x
2
/λ, λ), (λ, ]
x
1
x
2
/λ, λ)]
The simulation is given as follows:
X = λ
A = [
x
1
x
2
A
= [
x
1
x
2
A]
x
1
x
2
= [
x
1
x
2
]
x
1
x
2
= [
x
1
x
2
x
2
]
x
1
x
2
=
[
x
1
x
2
x
1
x
2
]
x
1
x
2
=
x
1
x
2
]
x
1
x
2
= x
1
x
2
Note that [
x
1
x
2
, ]
x
1
x
2
, ]
Y
, [
#
, ]
#
are considered to be single non-terminals. As the
maximal length of inserted string is 1 (i.e., n = 1), the maximal length of the context
used in insertion rules is 1 (i.e., m = 1), the maximum length of deleted string is 1 (i.e.,
p = 1) and no context is used in deletion rules (i.e., q = 0), the matrix grammars in
SBNF can be simulated by Matrix insertion-deletion systems with a total weight of 3.
This ends the proof. 2
5 Conclusions
In this paper, we discussed the Matrix insertion-deletion grammar systems and using
the system we have generated some context-free and non-context-free languages which
are having some structural relevances with bio-molecules. Thus we identified a promis-
ing grammar system which can suit for both natural language representation and bio-
molecular structural modelling. We have proved that Matrix insertion-deletion systems
with weight 3 (1, 1; 1, 0) can characterize all recursively enumerable languages and the
system uses no contexts for deletion operation. As the insertion rules are context de-
pendent (as we use context for insertion), they are more like context-sensitive and since
deletions are done without looking any context, they are more like context-free. There-
fore, this system uses the nature of both context-sensitiveness and context-freeness, it
seems to be a promising model for representing natural languages. Thus analyzing this
model more towards the properties of MCS formalisms is necessary and is left as a
future work.
As the family of recursively enumerable languages is recognized with a total weight
of 3, the context-free languages should be characterized with weight less than 3 in ma-
trix insertion-deletion systems. As non-terminals are used in the context-free grammars,
the simulated matrix insertion-deletion system must use deletion rules to delete the in-
troduced non-terminals, it looks not possible to characterize context-free languages by
Matrix insertion-deletion systems with weight less than 3. The same holds true for even
regular languages, thus we reached to an hierarchical collapse. To avoid this hierar-
chical collapse, we can introduce two more new weights in Matrix insertion-deletion
systems, namely s, t such that s denotes the total number of matrices and t denotes
the maximum number of rules among all matrices. With the new weights the Matrix
insertion-deletion systems can be represented as MAT
t
s
INS
m
n
DEL
q
p
and we believe
that regular and context-free languages can be characterized with less weights counting
the above said matrix measures. This leads to an interesting future work from generative
power perspective.
55
Acknowledgements
The authors would like to thank Prof. Kamala Krithivasan for motivating us to work in
insertion-deletion systems.
References
1. Akihiro Takaharai., Takashi Yokomori.: On the computational power of insertion-deletion
systems, Natural Computing 2, pp. 321–36: Kluwer Academic Publishers, (2003)
2. Arto Salomaa.: Formal languages, New York and London: Academic Press, (1973)
3. Aravind K. Joshi.: An introduction to tree adjoining grammars. In Manaster-Ramer (ed.),
Mathematics of Languages, John Benjamins, Amsterdam, Philadelphia, pp. 87–114, (1988)
4. Bar-Hillel., Shamir E.: Finite state languages: formal representations and adequacy prob-
lems. In: Bar Hillel (ed.), Lang. and Infn.: Addison-Wesley, Reading, MA, pp. 87–98, (1964)
5. Boullier P.: Range concatenation grammars. Proceedings of Sixth International Workshop on
Parsing Technologies (IWPT), pp. 53–64 (2000)
6. Cristian S. Calude., Gheorghe Pa˘un.: Computing with cells and atoms, An introduction to
Quantum, DNA and Membrane Computing, London: Taylor and Francis, (2001)
7. David B. Searls.: Representing genetic information with formal grammars. In: Proceedings
of the National Conference on Artificial Intelligence. pp. 386–391, (1988)
8. David B. Searls.: The linguistics of DNA. American Scientist. pp. 579–591, (1992)
9. David B. Searls.: The computational linguistics of biological sequences. In: Hunter, L.(ed.)
Artificial Intelligence and Molecular Biology, AAAI Press, pp. 47–120, (1993)
10. Elena Rivas., Sean R. Reddy.: The language of RNA: A formal grammar that includes pseu-
doknots, Bioinformatics, vol 16. pp. 334–340, (2000)
11. Gheorghe P˘aun., Grzegorz Rozenberg., Arto Salomaa.: DNA Computing, New Computing
Paradigms. Springer, (1998)
12. Gheorghe P˘aun.: Membrane Computing-An introduction. Springer, (2002)
13. John E. Hopcroft., Rajeev Motwani., Jeffrey D. Ullman.: Introduction to Automata Theory,
Languages and Computation: Addison-Wesley, (2006)
14. Lakshmanan Kuppusamy., Anand Mahendran., Krishna S.: Matrix Insertion-Deletion Sys-
tems for Bio-molecular Structures, Submitted (to ICDCIT-2011), Aug (2010)
15. Lakshmanan Kuppusamy., Krishna S.N., Rama R.: Internal contextual grammars for mildly
context sensitive languages, Research on Language and Computation, 5, pp. 181-197, (2007)
16. Lila Kari., Gheorghe P˘aun., G. Thierrin., S. Yu.: At the crossroads of DNA computing and
formal languages: Characterizing RE using insertion-deletion systems. Proc. of 3rd DIMACS
Workshop on DNA Based Computing, Philadelphia, pp. 318333, (1997)
17. Maurice Margenstern., Gheorghe P˘aun., Yurii Rogozhin., Sergey Verlan.: Context-free in-
sertiondeletion systems, Theoretical Computer Science, 330, pp.339-348,(2005)
18. Rozenberg., Arto Salomaa.: Handbook of formal languages, Vol 1,2,3, Springer, (1997)
19. Solomon Marcus., Carlos Martin-Vide., Gheorghe P˘aun.: Contextual grammars as generative
models of natural languages, Volume 24 , Issue 2, pp. 245–274, (1998)
20. Yasuo Uemura., Aki Hasegawa., Satoshi Kobayashi., Takashi Yokomori.: Tree adjoining
grammars for RNA structure prediction. Theoretical Computer Science, 210: pp. 277–303,
(1999)
56