Parallel Tree Kernel Computation

Souad Taouti

, Hadda Cherroun

and Djelloul Ziadi

LIM, Universit

e UATL Laghouat, Algeria

Groupe de Recherche Rouennais en Informatique Fondamentale, Universit

e de Rouen Normandie, France

Keywords:

Kernel Methods, Structured Data Kernels, Tree Kernels, Tree Series, Root Weighted Tree Automata,

MapReduce, Spark, Parallel Automata Intersection.

Abstract:

Tree kernels are fundamental tools that have been leveraged in many applications, particularly those based on

machine learning for Natural Language Processing tasks. In this paper, we devise a parallel implementation

of the sequential algorithm for the computation of some tree kernels of two ﬁnite sets of trees (Ouali-Sebti,

2015). Our comparison is narrowed on a sequential implementation of SubTree kernel computation. This

latter is mainly reduced to an intersection of weighted tree automata. Our approach relies on the nature of

the data parallelism source inherent in this computation by deploying both MapReduce paradigm and Spark

framework. One of the key beneﬁts of our approach is its versatility in being adaptable to a wide range of

substructure tree kernel-based learning methods. To evaluate the efﬁcacy of our parallel approach, we con-

ducted a series of experiments that compared it against the sequential version using a diverse set of synthetic

tree language datasets that were manually crafted for our analysis. The reached results clearly demonstrate

that the proposed parallel algorithm outperforms the sequential one in terms of latency.

1 INTRODUCTION

Trees are basic data structures that are naturally used

in real world applications to represent a wide range

of objects in structured form, such as XML doc-

uments (Maneth et al., 2008), molecular structures

in chemistry (Gordon and Ross-Murphy, 1975) and

parse trees in natural language processing (Shatnawi

and Belkhouche, 2012).

In (Haussler et al., 1999), Haussler provides a

framework based on convolution kernels, which ﬁnd

the similarity between two structures by summing the

similarity of their substructures. Many convolution

kernels for trees are presented based on this principle

and have been effectively used to a wide range of data

types and applications.

Tree kernels, which were initially presented

in (Collins and Duffy, 2001; Collins and Duffy, 2002),

as speciﬁc convolution kernels, have been shown to be

interesting approaches for the modeling of many real

world applications. Mainly those related to Natural

Language Processing tasks, e.g. named entity recog-

nition and relation extraction (Nasar et al., 2021),

text syntactic-semantic similarity (Alian and Awajan,

2023), detection of text plagiarism (Thom, 2018),

topic-to-question generation (Chali and Hasan, 2015),

source code plagiarism detection (Fu et al., 2017),

linguistic pattern-aware dependency which captures

chemical–protein interaction patterns within biomed-

ical literature (Warikoo et al., 2018).

Subtree (ST) kernel (Vishwanathan and Smola,

2002) and the subset tree (SST) (Collins and Duffy,

2001) were the initially kernels introduced in the con-

text of trees. The compared segments in ST Ker-

nel, are subtrees, a node and its entire descendancy.

While in the SST Kernel, the considered segments are

subset-trees, a node and its partial descendancy.

The principle of tree kernels, as initially pre-

sented, is to compute the number of shared substruc-

tures (subtrees and subset trees) between two trees t

and t

with m and n nodes, respectively. It may be

computed recursively as follows:

K(t

) =

∑

)∈N

×N

∆(n

) (1)

where N

and N

are the number of nodes in t

and t

respectively and ∆(n

) =

∑

|S|

i=1

).I

)

for some ﬁnite set of subtrees S = {s

,. .. }, and

(n) is an indicator function which is equal to 1 if the

subtree is rooted at node n and to 0 otherwise.

An approach for computing the tree kernels of two

ﬁnite sets of trees was proposed in (Mignot et al.,

Taouti, S., Cherroun, H. and Ziadi, D.

Parallel Tree Kernel Computation.

DOI: 10.5220/0012386500003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 329-336

ISBN: 978-989-758-684-2; ISSN: 2184-4313

329

2015). That makes use of Rooted Weighted Tree Au-

tomata (RWTA) that are a class of weighted tree au-

tomata. Subtree, Subset tree, and Rooted tree kernels

can be computed using a general intersection of RW-

TAs associated with the two ﬁnite sets of trees and

then the computation of weights on the resulting au-

tomaton.

In this paper, we narrow our study to the paral-

lel implementation using MapReduce paradigm and

Spark framework of the construction of the RWTA

from a ﬁnite set of trees (as this step is a common base

for other trees kernels) and to the comparison with the

sequential linear algorithm proposed in (L. Mignot

and Ziadi, 2023) for SubTree kernel computation.

The main motivation behind this proposal is that de-

spite the linear complexity of the sequential version, it

remains complex when we consider Machine Learn-

ing computational cost requirements.

We begin by deﬁning RWTA. Next, we present

the sequential proposed algorithms. Then we provide

our parallel implementation based on MapReduce and

Spark programming model.

The rest of the paper is organized as follows:

Section 2 introduces Tree Kernels and Automata

while presenting the sequential proposed algorithms.

Section3 presents the details of the parallel imple-

mentation of the tree kernel computation based on

MapReduce and Spark. Some experimental results

and evaluations are shown in Section 4. Finally, the

conclusion and perspectives are presented in Section 5

2 TREE KERNELS AND

AUTOMATA

Let Σ be a graded alphabet, t is a tree over Σ, deﬁned

initially as t = f (t

,. .. ,t

) where k can be any integer,

f any symbol from Σ

and t

,. .. ,t

are any k trees

over Σ. The set of trees over Σ is referred as T

. A

tree language over Σ is a subset of T

Let M = (M,+) be a monoid with identity is 0, a

formal tree series P (Collins and Duffy, 2002) (

Esik

and Kuich, 2002) over a set S represents a mapping

from T

to S, where its support is the set Support(P) =

{t ∈ T

|(P,t) ̸= 0}. Any formal tree series is consis-

tent to a formal sum P =

∑

t∈T

∑

(P,t)t, which is in this

case both associative and commutative.

Weighted tree automata can realize formal tree se-

ries. In this paper, we employ speciﬁc automata, and

the weights just indicate the ﬁnality of states. As a

result, the automata we use are a particular subclasses

of weighted tree automata.

2.1 Root Weighted Tree Automata

Deﬁnition 1. Let M = (M,+) be a commutative

monoid. An M-Root Weighted Tree Automaton (M-

RWTA) is a 4-tuple (Σ,Q, µ,δ) with the following

properties:

• Σ =

k∈N

: a graded alphabet,

• Q: a ﬁnite set of states,

• µ: the root weight function, is a function from Q

to M,

• δ: the transition set, is a subset of Q × Σ

× Q

A M-RWTA is referred as RWTA, if there is no am-

biguity.

The root weight function µ is extended to 2

→ M for

each subset S of Q by µ(S) = Σ

s∈S

µ(s). The function µ

is equivalent to the ﬁnite subset of Q × M deﬁned for

any couple (q,m) in Q×M by (q,m) ∈ µ ⇐⇒ µ(q) =

For any subset S of Q, the root weight function µ is ex-

panded to 2

→ M by µ(S) = Σ

s∈S

µ(s). The function

The transition set δ corresponds to the function in

× Q

→ 2

deﬁned for any symbol f in Σ

and

for any k-tuple (q

,. .. ,q

) in Q

q ∈ δ( f ,q

,..., q

) ⇐⇒ (q, f ,q

,. .. ., q

) ∈ δ.

The function δ is extended to Σ

× (2

)

→ 2

follows: for any symbol f in Σ

, for any k-tuple

,. .. ,Q

) of subsets of Q,

δ( f ,Q

,. .. ,Q

) =

,...,q

)∈Q

×···×Q

δ( f ,q

,. .. ,q

Finally, the function ∆ is the function from T

to 2

deﬁned for any tree t = f (t

,. .. ,t

) in T

∆(t) = δ( f ,∆(t

),. .. ,∆(t

)).

A weight of a tree t in a M-RWTA A is µ(∆(t)).

The formal tree series realized by A is the formal

tree series over M denoted by P

and deﬁned by

∑

t∈T

µ(∆(t)), with µ(

0) = 0 where 0 is the iden-

tity of M.

Example 1. Let us consider the graded alphabet Σ

deﬁned by Σ

= {a,b}, Σ

= {g,h} and Σ

= { f }.

Let M = (N, +). The RWTA A = (Σ,Q,µ,δ) deﬁned

• Q = {1, 2,3, 4,5, 6},

• µ = {(1, 1),(2, 4),(3, 3),(4, 1),(5, 2),(6,3)},

• δ = {(1,a), (3,b), (2,h,1),(4,g,3),(5, f ,2, 3),

(6, f ,4, 5)},

This RWTA is represented in Figure 1. It realizes the

following tree series: P

= 1a +3b+4h(a)+1g(b) +

2g(h(a)) + 3 f (g(h(a)),g(b))

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

330

Figure 1: The RWTA A.

A RWTA appears to be a preﬁx tree constructed

in the context of words. A ﬁnite set of trees could be

represented by this compact structure.

In addition, many substructures’ tree can be com-

puted through this compact structure. In what follow,

we introduce Subtree, Rooted tree and SubSet tree

and their related automata (Ouali-Sebti, 2015). How-

ever, in this paper, we narrow our Kernel computation

proposal to the SubTree case.

2.2 Subtree

Deﬁnition 2. Let Σ be a graded alphabet and t =

f (t

,. .. ,t

) a tree in T

. The set SubTree(t) is the

set deﬁned inductively by:

SubTree(t) = {t} ∪

1≤ j≤k

SubTree(t

Let L be a tree language over Σ. The set

SubTreeSet(L) is the set deﬁned by:

SubTreeSet(L) =

t∈L

SubTree(t).

The formal tree series SubTreeSeries(t) is the tree se-

ries over N inductively deﬁned by:

SubTreeSeries(t) = t +

∑

1≤ j≤k

SubTreeSeries(t

If L is ﬁnite, the rational series SubTreeSeries(L) is

the tree series over N deﬁned by:

SubTreeSeries(L) =

∑

t∈L

SubTreeSeries(t).

Example 2. Let Σ be the graded alphabet deﬁned by

= {a,b}, Σ

= {g,h} and Σ

= { f }.

Let t be the tree f (h(a),g(b)), we have:

SubTree(t) = {a, b,h(a), g(b), f (h(a),g(b))}

SubTreeSet(t) = t + h(a) + g(b) + a + b

Deﬁnition 3. Let Σ be a graded alphabet. Let t be a

tree in T

. The Subtree automaton associated with t is

the RWTA A

= (Σ,Q,µ,δ) deﬁned by:

• Q = SubTreeSet(t),

• ∀s ∈ Q, µ(s) = (SubTreeSeries(t),s),

• ∀ f ∈ Σ,∀s

,. .. ,s

k+1

∈ Q,s

k+1

∈

δ( f ,s

,. .. ,s

) ⇐⇒ s

k+1

= f (s

,. .. ,s

This Weighted Tree Automaton (RWTA) requires

less storage space because its states are exactly its

subsets.

Example 3. Given the tree deﬁned by t =

f ( f (h(a),b), g(b)). The RWTA A

associated with the

tree t is represented in Figure 2.

f (h(a), b) g(b)

h(a)

Figure 2: The RWTA A

associated with the tree t =

f ( f (h(a),b),g(b)).

2.3 Rooted Tree

Deﬁnition 4. Let Σ be a graded alphabet and t =

f (t

,. .. ,t

) a tree in T

. We indicate with Σ

′

, the set

Σ ∪ {⊥}, where ⊥∈ Σ

′

and ⊥ /∈ Σ. PreﬁxSet(t) is the

set of trees on T

′

inductively deﬁned by:

PreﬁxSet(t) = {t} ∪ f (⊥,...,⊥)

∪ f (PreﬁxSet(t

),. .. ,PreﬁxSet(t

))

It should be noted that ⊥ is not a preﬁx of t.

Let L be a tree language over Σ. The set

PreﬁxSet(L) is the set deﬁned by:

PreﬁxSet(L) =

t∈L

PreﬁxSet(t).

The formal tree series PreﬁxSeries(t) is the tree

series over N inductively deﬁned by:

PreﬁxSeries(t) =

∑

′

∈PreﬁxSet(t)

′

If L is ﬁnite, the series PreﬁxSeries(L) is the tree se-

ries over N deﬁned by:

PreﬁxSeries(L) =

∑

′

∈L

PreﬁxSeries(t

′

)

If L is not ﬁnite, as Σ is a ﬁnite set of symbols, there

exists a symbol f in Σ

such that f (⊥,...,⊥) occurs

as a preﬁx inﬁnite number of times in L. Therefore,

PreﬁxSeries(L) is a series of trees over N ∪{+∞}.

Parallel Tree Kernel Computation

331

Deﬁnition 5. Let Σ be a graded alphabet. Let t be a

tree in T

. The automaton of preﬁxes associated with

t is the RWTA A

= (Σ

′

,Q, µ,δ) deﬁned by:

• Q = SubTreeSet(t) ∪ {⊥},

• ∀t

′

∈ Q, µ(t

′

) =

(

1, if t

′

= t,

0, else,

• ∀t

′

= f (t

,. .. ,t

) ∈ Q,δ( f ,t

,. .. ,t

) = t

′

• ∀ f ∈ Σ

,δ( f ,⊥, .. ., ⊥) = { f (t

,. .. ,t

) ∈ Q}.

2.4 SubSet Tree (SST)

Deﬁnition 6. Let Σ be a graded alphabet and t =

f (t

,. .. ,t

) a tree in T

. We indicate with Σ

′

, the set

Σ ∪ {⊥}, where ⊥∈ Σ

′

and ⊥/∈ Σ.

The set SSTSet(t) is the set of trees on T

′

deﬁned by:

SST Set(t) = Pre f ixSet(SubtreeSet(t))

Let L be a tree language over Σ. The set

SSTSeries(L) is the set deﬁned by:

SST Series(L) =

∑

′

∈L

SST Series(t)t

′

The formal tree series SSTSeries(t) is the tree series

over N inductively deﬁned by:

SST Series(t) = SubtreeSeries(pre f ixSet(t))

Deﬁnition 7. Let Σ be a graded alphabet. Let t be a

tree in T

. The SST automaton associated with t is the

RWTA A

= (Σ

′

,Q, µ,δ) deﬁned by:

• Q = SubTreeSet(t) ∪ {⊥},

• ∀t

′

∈ Q, µ(t

′

) =

(

1, if t

′

= t,

0, else,

• ∀t

′

= f (t

,. .. ,t

) ∈ Q,δ( f ,t

,. .. ,t

) = t

′

• ∀ f ∈ Σ

,δ( f ,⊥, .. ., ⊥) = { f

,. .. ,t

) ∈

Q|h( f

) = f }.

2.5 Sequential Kernel Computation

In order to compute the kernel of two ﬁnite tree lan-

guages X and Y , we act in three steps:

1. First, we construct both RWTAs A

and A

2. Then we compute the intersection of A

and A

;

3. Finally, the kernel is simply computed through a

sum of all of the root weights of this RWTA.

One can easily observe that the set of states, de-

noted by Q, is equal to SubTreeSet(t) in ST, plus

{⊥} for both Rooted Tree and SubSet Tree. How-

ever, their root weight function (µ ) are different. In

addition, their δ, denoting the transition functions, are

deﬁned from the ST transition table. Consequently,

the RWTA construction step remains the same for the

three tree substructures while the RWTA intersection

step is distinct for each tree substructure.

Let us recall that in this paper we narrow the

RWTA intersection and kernel computation to the ST

kernel case.

Initially, we introduce the sequential step-by-step

procedure that enable us to efﬁciently calculate tree

kernels by utilizing the intersection of tree automata.

2.5.1 RWTA Construction

In this section, we describe through an algorithm the

construction of an RWTA from a ﬁnite set of trees.

Consider the ﬁnite set of trees X , ﬁrst, we extract all

the preﬁxes of each tree in X and sum their number of

occurrences of each tree in X , which is equivalent to

the sum of the subtrees series.

SubTreeSeries(X ) =

∑

t∈X

SubTreeSeries(t)

Algorithm 1 constructs an RWTA from a ﬁnite set

of trees.

Algorithm 1: Computation of Automaton A

from

Input : X: Set of trees

Output: RWTA A

= (Σ,Q, µ,δ)

foreach t ∈ X do

if t ̸∈ Q

then

Add(Q

, t);

(t) ← 1;

else

(s) ← µ

(s) + 1;

end

2.5.2 RWTA Intersection

Deﬁnition 8. Let

∑

be an alphabet. Let X and Y be

two ﬁnite tree languages over Σ. The Tree Series of

(X, Y ) is deﬁned by:

SubTreeSeries((X ,Y )) =

∑

t∈T

∑

(SubTreeSeries(X ),t) × (SubTreeSeries(Y ),t).

Example 4. Let Σ be the graded alphabet deﬁned

by Σ

= {a,b}, Σ

= {g,h} and Σ

= { f }. Let us

consider the three trees t

= f ( f (h(a),b), g(b)), t

f (h(a),g(b)) and t

= f ( f (h(a),b), f (h(a),g(b))).

We have:

SubTreeSeries(t

) = t

+ f (h(a), b) + h(a) + g(b) +

a + 2b

SubTreeSeries(t

) = t

+ h(a) + g(b) + a + b

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

332

SubTreeSeries(t

) = t

+ f (h(a),b) + t

+ 2h(a) +

g(b) + 2a + 2b

SubTreeSeries({t

}) = t

+ t

+ 2 f (h(a),b) +

2h(a) + 2g(b) + 2a + 3b

SubTreeSeries(({t

},{t

})) = t

+ f (h(a), b) +

4h(a) + 4g(b) + 4a + 3b

By deﬁnition, any t in T

, is in Q if and only if

t ∈ SubTreeSet(X) ∩ SubTreeSet(Y ). Moreover, by

deﬁnition of µ, for any tree t, µ(t) = µ

(t) × µ

(t),

as for any tree t, if t /∈ SubTreeSet(X ) (resp. t /∈

SubTreeSet(Y )), then µ

(t) = 0 (resp. µ

(t) = 0).

In order to compute the kernel from two RWTA,

Algorithm 2 below, loops through the states of A

if any state q is present in A

then it is added to the

automaton A

(X,Y)

with µ

(X,Y)

(q) = µ

(q) × µ

(q), re-

spectively to A

Algorithm 2: Computation of Automaton A

(X,Y )

× A

Input : 2 RWTA A

and A

Output: an RWTA A

(X,Y)

= A

× A

(X,Y)

← A

;

foreach s ∈ Q

if s ∈ Q

then

(X,Y)

(s) ← µ

(s) × µ

(s);

else

µ(s) ← 0;

end

2.5.3 Subtree Kernel Computation

The RWTA for both tree sets and their intersection is

constructed, allowing for the computation of the tree

kernel in the subtree case. Two ﬁnite tree languages,

X and Y , are given, and Z is the accessible part of their

intersection tree automaton A

(X,Y)

. Then, the kernel is

simply computed through the sum of the weights:

TreeKernel(X,Y ) =

∑

q∈Z

µ(q).

3 PARALLEL TREE KERNEL

COMPUTATION

In light of the inherent data parallelism within Tree

Kernel Computation, we have developed and imple-

mented a parallel adaptation of the previously sequen-

tial SubTree Kernel Computation, leveraging both

MapReduce and Spark paradigms. The frameworks

facilitate parallel execution in a distributed environ-

ment and offer advanced features for distributed com-

puting, eliminating the need for manual task coordi-

nation. By breaking down large tasks into smaller,

concurrently executable chunks, it streamlines job

scheduling, bolsters fault tolerance, enhances dis-

tributed aggregation, and simpliﬁes other manage-

ment tasks. Kernel Computations bear a striking re-

semblance to the Big Data paradigm, which poses sig-

niﬁcant challenges compared to traditional data pro-

cessing methods. While numerous solutions have

been proposed to address the computational and stor-

age challenges of Big Data, the MapReduce and

Spark frameworks stand out as prominent methods.

Before delving into our parallel implementation, we

will provide an explanation of the MapReduce and

Spark frameworks in the following sections.

3.1 MapReduce Framework

Map-Reduce was created by Google as a parallel

distributed programming approach that works on a

cluster of computers due to large-scale data (Day-

alan, 2004). It is termed parallel because tasks are

executed by dedicating multiple processing units in

a parallel environment, and distributed over distinct

storage. Hadoop is among most popular open-source

MapReduce implementation created primarily by the

Apache Software Foundation.

MapReduce is a popular framework for propos-

ing programs without infrastructure complexity due

to its stable, easy-to-use, abstract, and scalable en-

vironment, with its programming paradigm outlined

through basic MapReduce jobs.

3.1.1 Map Function

A Map-Reduce task involves mapping input data

to speciﬁc reducers, which generate a list of key,

value pairs for each unit of data based on a mapping

schema. Key pairs from the same list are collected in

the same reducer. The mapping schema is the most

crucial element, affecting precision, time complexity,

and space complexity.

3.1.2 Reduce Function

The output of mappers serves as input for the reducer

function, which receives a key associated with a list

of records. The output of each reducer is pairs of

< key,value >, which can produce multiple values

with the same keys. The reducer function is applied

in parallel to the input list of data and written to the

Distributed File System.

Parallel Tree Kernel Computation

333

3.1.3 MapReduce Implementations

Hadoop Framework. Hadoop is an open-source

software framework that enables the efﬁcient storage

and processing of large volumes of data across clus-

ters of computers. It consists of several key compo-

nents, such as the Hadoop Distributed File System

(HDFS) and the MapReduce programming model,

which enable parallel processing across the cluster.

Hadoop also provides tools for data ingestion, pro-

cessing, analysis, and resource management, making

it a popular choice for big data analytics and machine

learning applications.

Spark Framework. Apache Spark is an open-

source distributed data processing framework,

renowned for its speed, adaptability, and versatility

in handling big data tasks. It supports various

data processing tasks, including batch processing,

real-time stream processing and machine learning.

Spark’s in-memory processing enhances computation

speed.

3.2 Parallel RWTA Construction

Let X be a ﬁnite tree language. To construct the

RWTA A

from X based on MapReduce paradigm,

we have to determine the Map and Reduce jobs. The

preﬁxes of X are listed in a ﬁle which is used as an

input.

First, the Map function (Algorithm 3) splits the

preﬁxes list into subtrees. Next, the Map function

distributes the key-value pairs as follows: < key, 1 >,

where the key represents the subtree, and the value is

1 indicating the number of occurrence. At this step it

represents its presence.

Then, the Map function sends the key-value pairs

to reducers by key, so if a subtree appears multiple

times, it will be sent to the same reducer multiple

times. This principle is similar to the popular word

count program where words are subtree.

Next, the reducer (Algorithm 4) sums the value

which is the number of occurrences and is equivalent

to the weight of the subree. Finally, the subtrees are

merged in tree series.

3.3 Parallel RWTA Intersection

Let X and Y be two ﬁnite tree languages, A

and A

are their respective RWTA. In order to compute the

intersection of automata A

and A

using MapReduce

programming model, we have to identify the Map and

Reduce jobs. Let us mention that we have both RWTA

details saved, from the last Construction step, as an

input ﬁle, each of them in a separate line.

First, the Map function (Algorithm 5) splits the

tree series for X and Y resp. into subtrees with their

weights in list. Then, the Map function distributes the

key-value pairs as follows: <subtree, (1,weight)>,

where the key represents the subtree, and the value is

composed of the weight of the subtree and the number

1 that acts as a Boolean indicating whether or not the

subtree is present in the RWTA.

Next, the Map function distributes its output to the

reducers by key i.e if any subtree is present in differ-

ent RWTA, it will be sent to the same reducer.

After that, every reducer (Algorithm 6) sums the

ﬁrst part of the value, which indicates the presence of

the subtree in the RWTA to check if it is present in

both RWTAs. Then, if the presence is equal to 2 the

reducer multiplies the weights of the received subtree

from A

and A

Algorithm 3: Map Function for the construction of

from X.

Input : (X: Set of trees) ﬁle

while f ile ̸= empty do

Split(line,S);

foreach s ∈ S do

Emit(s,1);

end

Algorithm 4: Reduce Function for the construction

of A

from X.

Input : mapped <s,1>

Output: RWTA A

Weight

← 0;

forall all mapped s do

Weight

← Weight

+ 1;

end

Add(A

,(s, Weight

));

Algorithm 5: Map Function for the computation of

the automaton A

× A

Input : (RWTA A

, A

) ﬁle

while f ile ̸= empty do

Split(line,Q);

for s ∈ Q do

Emit(s,(1, µ(s)));

end

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

334

Algorithm 6: Reduce Function for the computation

of the intersection automaton A

× A

Input : <s,(1, µ(s))>

Output: RWTA A

× A

Presence

← 0;

Weight

← 1;

forall mapped s do

Presence

← Presence

+ 1;

Weight

← Weight

× µ(s);

end

if Presence

= 2 then

Add(A

× A

,(s, Weight

));

end

4 EXPERIMENTS AND RESULTS

To analyse our parallel RWTA-based SubTree Kernel

computation we have performed a batch of compar-

ative experiments in order to demonstrate the differ-

ence in terms of latency between our parallel algo-

rithm and the sequential one using MapReduce and

Spark frameworks.

Additionally, we use the absolute acceleration

metric deﬁned by A

abs

= T

seq

par

where T

seq

and T

par

are the running times of the sequential and the parallel

algorithms respectively.

Prior to presenting our ﬁndings, let us initially provide

a description of the benchmark we constructed and

outline the implementation details.

4.1 Dataset

In order to perform the comparative study of both

variants of algorithms, we need a testbed of multi-

ple datasets that cover the variety wide of tree char-

acteristics in order to have a deep algorithm analysis,

which is not the case in the real world datasets that are

standard benchmarks for learning on relatively small

trees. For that purpose, in our experiments, we are

brought in generating synthetic datasets.

For this building dataset task, we have considered

into account mainly three criteria: i) the alphabet size

(varying between 2 and 12), ii) the range of the maxi-

mal alphabet arity (between 1 and 5), and iii) the tree

depth (T D) that we have varied within the range of 10

to 50.

The constructed tree datasets are divided into two

batches according to the alphabet size in [2,12]. The

ﬁrst batch gathered four datasets D1, D2, D3 and D4.

Each of them contains two tree sets generated using

this above principle. Into each batch, each dataset

has different size that we have classiﬁed according to

Table 1: Details on the generated datasets.

Trees Σ Arity T D size Gb

D1 500 [2, 12] [1, 5] [10,20] 1.5

D2 800 [2, 12] [1, 5] [10, 50] 2.5

D3 3000 [2, 12] [2, 5] [10, 50] 4.4

D4 4500 [2, 12] [2, 5] [10, 50] 7

97.2

139.7

197

431

0.87

1.96

3.75

8.67

0.81

1.3

1.84

2.63

Sequential

Hadoop

Spark

Figure 3: Performances of Parallel algorithm vs the Sequen-

tial one in terms of running time (minutes).

their average tree size in three classes (small: less than

2GB, medium: less than 2.5GB, and large: more than

4GB). Table 1 illustrates more details on the gener-

ated datasets.

Both sequential and parallel algorithms are imple-

mented in Java 11. All experiments were performed

on a server equipped with an Intel(R) Xeon(R) Sil-

ver 4216 CPU (2.10GHz) processor with 32 cores and

128GB of RAM running Linux.

Sequential and parallel codes in addition to the

generated datasets are available on Github.

In this study, we have established a ﬁxed cluster

architecture, utilizing Docker containers for conduct-

ing all tests. Our cluster comprises one container des-

ignated as the Master, and ﬁve containers serving as

Slaves. It is important to note that this choice of clus-

ter conﬁguration was made arbitrarily for the purpose

of this research. The Hadoop

V 3.3.0 is installed as

MapReduce implementation platform on our cluster

and Spark

V 3.4.1 to serve as the infrastructure for

running Spark.

4.2 Results

Figure 3 reports the performances of our parallel Sub-

Tree Kernel computation versus the sequential al-

gorithm in terms of running time on the generated

datasets. Let us mention that the sequential time is

obtained on one node (container) of our cluster.

https://hadoop.apache.org/

https://spark.apache.org/

Parallel Tree Kernel Computation

335

It is obvious to observe that our parallel computa-

tion is signiﬁcantly faster compared to the sequential

version across all dataset instances. Furthermore, our

analysis reveals that the average absolute acceleration

achieved using MapReduce is 50.8 times (Table 2),

and the absolute acceleration obtained through Spark

is 94.3 times (Table 3). This substantial acceleration

is notable, which reﬂects the effectiveness of our par-

allel computation approach.

Table 2: Acceleration of Parallel SubTree Kernel computa-

tion on different datasets using MapReduce.

D1 D2 D3 D4 Average

abs

57.2 62.9 42.6 40.7 50.8

Table 3: Acceleration of Parallel SubTree Kernel computa-

tion on different datasets using Spark.

D1 D2 D3 D4 Average

abs

61.4 95 86.8 134.3 94.3

5 CONCLUSION

The preﬁx tree automaton constitutes a common base

for the computation of different tree kernels: SubTree,

RootedTree, and SubSequenceTree kernels (Ouali-

Sebti, 2015). In this paper, we have shown a paral-

lel Algorithm that efﬁciently compute this common

structure (RWTA automaton) and we have used it for

the computation of the SubTree Kernel using MapRe-

duce and Spark frameworks.

Our parallel implementation of the SubTree kernel

computation has been tested on synthetic datasets

with different parameters. The results showed that our

parallel computation is by far more speed than the se-

quential version for all instances of datasets. Despite

that this work has shown the efﬁciency of the paral-

lel implementation compared to the sequential algo-

rithms, three main future works are envisaged. Firstly,

we have to devise some algorithms that generalise

the computation of others kernels such RootedTree,

and SubSequenceTree . . . . Some of them will deploy

tree automata intersection in addition to the associ-

ated weights computation. In fact, while the subtree

kernel is a simple summation of weights, the SubSe-

quenceTree needs more investigation on the weight

computations using the resulted RWTAs intersection.

Secondly, more large datasets have to be generated

and tested to conﬁrm the output-sensitive results of

our solutions. Finally, one can investigate different

cluster architectures in order to give more insights and

recommendations on the cluster’ parameters tuning.

REFERENCES

Alian, M. and Awajan, A. (2023). Syntactic-semantic simi-

larity based on dependency tree kernel. Arabian Jour-

nal for Science and Engineering, pages 1–12.

Chali, Y. and Hasan, S. A. (2015). Towards topic-

to-question generation. Computational Linguistics,

41(1):1–20.

Collins, M. and Duffy, N. (2001). Convolution kernels for

natural language. In Dietterich, T., Becker, S., and

Ghahramani, Z., editors, Advances in Neural Informa-

tion Processing Systems, volume 14. MIT Press.

Collins, M. and Duffy, N. P. (2002). New ranking algo-

rithms for parsing and tagging: Kernels over discrete

structures, and the voted perceptron. In Annual Meet-

ing of the Association for Computational Linguistics.

Dayalan, M. (2004). Mapreduce: simpliﬁed data processing

on large clusters. In CACM.

Esik, Z. and Kuich, W. (2002). Formal tree series. BRICS

Report Series, (21).

Fu, D., Xu, Y., Yu, H., and Yang, B. (2017). Wastk: An

weighted abstract syntax tree kernel method for source

code plagiarism detection. Scientiﬁc Programming,

2017.

Gordon, M. and Ross-Murphy, S. B. (1975). The structure

and properties of molecular trees and networks. Pure

and Applied Chemistry, 43(1-2):1–26.

Haussler, D. et al. (1999). Convolution kernels on discrete

structures. Technical report, Citeseer.

L. Mignot, F. O. and Ziadi, D. (2023). New linear-time al-

gorithm for subtree kernel computation based on root-

weighted tree automata.

Maneth, S., Mihaylov, N., and Sakr, S. (2008). Xml tree

structure compression. In 2008 19th International

Workshop on Database and Expert Systems Applica-

tions, pages 243–247.

Mignot, L., Sebti, N. O., and Ziadi, D. (2015). Root-

weighted tree automata and their applications to tree

kernels. CoRR, abs/1501.03895.

Nasar, Z., Jaffry, S. W., and Malik, M. K. (2021). Named

entity recognition and relation extraction: State-of-

the-art. ACM Comput. Surv., 54(1).

Ouali-Sebti, N. (2015). Noyaux rationnels et automates

d’arbres.

Shatnawi, M. and Belkhouche, B. (2012). Parse trees of

arabic sentences using the natural language toolkit.

Thom, J. D. (2018). Combining tree kernels and text em-

beddings for plagiarism detection. PhD thesis, Stel-

lenbosch: Stellenbosch University.

Vishwanathan, S. V. N. and Smola, A. (2002). Fast kernels

for string and tree matching. In NIPS.

Warikoo, N., Chang, Y.-C., and Hsu, W.-L. (2018). Lptk:

a linguistic pattern-aware dependency tree kernel ap-

proach for the biocreative vi chemprot task. Database,

2018.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

336