Towards a Model Transformation Tool on the Top of
the OpenCL Framework
Tamás Fekete and Gergely Mezei
Budapest University of Technology and Economics, Budapest, Hungary
Keywords: Model Transformation, GPGPU, OpenCL, Pattern Matching, Graph Rewriting.
Abstract: Nowadays, applications must often handle a large amount of data and apply complex algorithms on it. It is a
promising and popular way to apply the computation in parallel in order to meet the performance
requirements. Since GPUs are designed to apply highly parallel computations efficiently, using CPU+GPU
heterogeneous architecture have gained an increasing popularity in computation intensive applications.
Model-driven development (MDE) is a widely used software development methodology in the software
industry. MDE is heavily building on model transformations in converting and processing the models. Graph
transformation-based model transformation is a popular technique in this field. It is based on isomorphic
subgraphs matching, which often require serious computing power. Currently, model transformation tools are
not capable of using the computation power of the GPUs. Our research goal is to create a general model
matching and later a model transformation solution, which can take the advantages of the computation power
of the GPUs. We are now focusing on pattern matching of the transformations. We would like to create a
general solution which is independent of the hardware vendor; therefore, our method is based on the OpenCL
framework. The novelty of this paper is a GPGPU-based pattern matching tool and some accelerating
techniques to achieve faster computation. In this paper we present an overview of the solution and test results
based on one of the biggest freely available movie database (IMDb). The main properties such as the
performance and the scalability are discussed. The applied architecture and the steps towards the final solution
are also included in the paper.
1 INTRODUCTION
Nowadays, software applications process and handle
a huge amount of data. Therefore, the execution of
complex algorithms on this data often becomes a
heavily time consuming operation. Using parallelized
algorithms is a promising way to improve the
performance of the applications. We need a hardware
device for this to supports massive parallelism. GPUs
seem a perfect candidate for this, since nowadays they
tend to have thousands of computations units for
parallel evaluation. The CPU+GPU heterogeneous
architecture have enough power source to develop
extremely fast algorithms. The processing power of
GPUs is already widely applied in several fields like
image or audio processing. This trend can be
extended to new domains as well.
On the market, numerous kinds of GPUs can be
found. Using a vendor, or model specific language
and framework would need a tremendous effort. To
avoid this, the OpenCL framework has been created
in 2009. OpenCL is a platform independent
framework which can be used to handle the most
widely used GPUs uniformly. OpenCL is an interface
defined by Khronos Group (Khronos Group’s
website, 2015) and each product vendor has its own
implementation. Although these implementations
differ from each other, the interface grants the
compatibility between the vendors.
Model-driven engineering (MDE) is a widely
applied software development methodology in the
software industry. Models are not only created for
presentation purposes, but they are transformed,
processed and often used directly or indirectly as the
basis of the code generation. Therefore, to find
efficient model transformation techniques is
important and challenging part of the MDE. Several
techniques exist, one of the most popular is the graph
rewriting-based transformation which is also referred
to as graph transformation. Graph transformation is
based on an NP complete problem (subgraph
isomorphism) and may need serious amount of time
depending on the size of the input model. A solution
Fekete, T. and Mezei, G.
Towards a Model Transformation Tool on the Top of the OpenCL Framework.
DOI: 10.5220/0005792903550360
In Proceedings of the 4th International Conference on Model-Driven Engineering and Software Development (MODELSWARD 2016), pages 355-360
ISBN: 978-989-758-168-7
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
355
for the performance issue can be to use the
aforementioned CPU+GPU architecture.
Our overall goal is to create a general model-
transformation tool using the computing power of
GPUs. The tool is referred to as the GPGPU-based
Engine for Model Processing (GEMP). We have done
several steps in creating GEMP (Fekete and Mezei,
2015). In that paper, we concluded that the usage of
the GPUs in graph transformation tools is a promising
direction. In this paper, we follow this path by
introducing a general, GPU-based solution for
pattern-matching.
The rest of the paper is organized as follows: In
Section 2, the features of the OpenCL framework are
collected and compared against other solutions.
Moreover, a short overview of model transformation
tools can also be found. In Section 3, the input domain
model is described which is used for illustrating the
non-functional properties of the GEMP. In Section 4,
an architectural overview is given. The following two
Sections pay attention to the data mapping to the
OpenCL framework and introduce the core
algorithms. In Section 7, the main properties of the
tool are discussed, namely the performance and the
scalability. Finally, in Section 8, we conclude and
give an outlook for the possible researching
directions.
2 RELATED WORK
There are several possibilities to realize
heterogeneous computation tasks, especially for
CPU+GPU based platforms. One of them is the usage
of the popular OpenCL framework. Another widely
applied way of accessing the GPUs is the usage of the
CUDA (CUDA’s website, 2015). It is necessary to
question which of them is better considering several
viewpoints such as performance, scalability, or the
difficulty of the integration. In a paper, (Veerasamy
et al., 2014), they introduced the usage of both the
CUDA and the OpenCL in deeper details, but they do
not give suggestion which is the better to integrate.
Both of them have their own advantages which
heavily depend on the actual problem that must be
solved. The main reason of using the OpenCL
framework over CUDA is that OpenCL can be used
by many hardware manufacturers realizing the
interfaces.
In a paper, (Yan et al., 2014), there are
benchmarks showing how the OpenCL framework
can be effectively used on different hardware
components. They compared several hardware
components like multi-core CPUs, AMD and
NVIDIA GPUs. They also considered the differences
between their and other existing results. They
measured the GFLOPS both on GPUs and CPUs and
collected their experiments in case of different kinds
of tasks.
The usage of the OpenCL framework is
challenging to those, who do not have experience in
hardware close programming. Probably this is the
main reason why there are so many libraries for
OpenCL and for other multi-platform environments.
In a paper, (Viñas et al., 2015), there is a discussion
about the extension of a Heterogeneous Library with
OpenCL; this modification provides easier access to
OpenCL framework. Significantly decreasing the
number of lines in the source code is achieved and
introduced in the paper.
There are many OpenCL-based graph libraries
and wrappers which can be used in realizing an
OpenCL-based model-transformation tool. Using
GPUs in a graph library is introduced by (Che et al.,
2014). In this paper, solutions of several popular
graph problems are modified to achieve the GPU-
based version of them. They also realized that
programming of the GPU can be difficult for a regular
programmer and the implementation of the graph
application can take a big effort. Therefore, they
created a library which is called BelRed. The software
building blocks are implemented on the top of the
OpenCL framework. The performance of the library
is represented on a case study. The main advantages
of the BelRed library are its portability and the fact
that the programmer does not need care of the kernel
code writing and hardware close working.
In a paper, (Xu et al., 2014), they introduce how
important and critic the graph processing components
nowadays are. The paper also focuses on the
algorithm mapping between the host and the GPU
which is the biggest challenge in the effective usage
of GPUs. They compiled 12 graph applications into
the GPU device, studied the performance and
suggested several approaches to accelerate the
performance of the GPU-based algorithms.
There are several papers and studies which collect
and classify the model transformation tools. In a
paper, (Jakumeit et al., 2014), there are tools which
are described, for example GREAT, IncQuery,
Fujaba, Groove, Henshin, MOLA and Viatra2. None
of these tools can use the power of the GPUs in model
processing. Our current research focuses on the
challenges to create a general purpose pattern
matching tool, namely the first step towards creating
a GPU-based model transformation engine. The
verification and validation of GEMP is applied on
two levels: (1) Low level functional properties are
MODELSWARD 2016 - 4th International Conference on Model-Driven Engineering and Software Development
356
tested using unit tests. (2) For testing and illustrating
the high level non-functional properties, a domain
model with a huge size of source is introduced.
3 INPUT DOMAIN MODEL
In this section, we introduce an input domain model
which is used to illustrate the non-functional
properties of GEMP, especially the performance and
the scalability. The internet movie database (IMDb’s
website, 2015) (IMDb) is used as the input data.
IMDb is the largest film and TV show related
database. It has approximately 3.3 million titles, 6.5
million personalities (actors, directors, etc.). From the
IMDb, a huge input model can be created. Several
interesting and complex patterns can be searched in
this database for testing purposes. In the current
research, a text file-based database is used which
contains information on several domain concepts, like
movies (subtitle, creation time, and rate), actors (with
played film), and producers (with film). Figure 1
shows an example for a pattern to be searched.
Figure 1: Example for a pattern to be searched.
4 ARCHITECTURE
The OpenCL kernel source code must be written in
the C99 language (with some restrictions). The
language of GEMP is C++11. OpenCL versions are
backwards compatible. Since NVIDIA GPUs support
only OpenCL 1.2 version, we decided to use this
version in order to maximize hardware independency.
4.1 Component Model
Each component has a well separated functionality.
We used the interface oriented programming
paradigm to easily access the services of GEMP.
Most frequently used design patterns are the adapter
and the façade. For example, between the OpenCL
API and the base libraries of the tool, the
CGPUAccess component provides the connection
and it is implemented as an adapter, while
CTaskRunner is a façade. The domain model is
created in the CModelManagement component which
can be used for other domains. The most important
components of the framework are shown in Figure 2
and in the list below of it (with their key roles).
(1) CGPUAccess: It creates the main OpenCL
context and the command queue. Some objects are
delegated to other components which need access
to the GPU device. Inside of the component (2)
initializes the connection between the host and the
GPU device using the OpenCL API. At the same
time, this context stores each run-time object
provided by the OpenCL API.
(2) Business Logic (CPrephase1, CPrephase2,
CPhase1, CPhase2): This component realizes the
business logic (BL) of the tool. Each contained
inner components are easily exchangeable and
extendable. The BL uses the kernel source code
and responsible for the compiling process using
other components.
(3) CTaskRunner: The tasks are scheduled here. All
kinds of time scheduling tasks must happen in this
component.
(4) CLogging: It is responsible, not only for logging,
but provides the output models as well.
(5) IMDbAccess: It manages reading of the domain
model and provides the graph data to (6).
(6) CModelManagement: It is responsible for the
domain model creation. It can also create the input
graph and process the results.
Figure 2: The base component model with the most
important blocks.
4.2 Dynamic Behaviour and Testing
In the first place, GEMP provides interfaces to the
task evaluation and to the configuration. As soon as
the task is chosen, it is queued for execution. The
CModelManagement prepares the domain model for
Towards a Model Transformation Tool on the Top of the OpenCL Framework
357
the tasks. In the current case, the IMDb source is the
only one that can be processed. The CTaskRunner
manages the configuration according to the user
settings. Pattern matching is executed in two main
steps. In the first step, we search for topological
matches with the pattern. Nodes are represented by
their ID. At the second step, we use the topological
matching parts of the host model (the results of the
first step) and evaluate attribute constraints on them.
The reason for dividing the matching algorithm into
two steps is that copying all nodes of the host model
with all of their attributes to the GPU memory would
be inefficient. To use two steps, we can reduce the
amount of data to copy. Namely, we need to copy
attributes only for those nodes, which are part of a
topological matching structure.
During the implementation, according to the test-
driven development methodology (TDD) (Canfora et
al., 2006), a unit test is created right at the beginning
of the implementation. Test cases are created and the
expected results are calculated by hand. At every run
of the unit test, the framework compares the received
and the expected results in an automated way. A few
examples on the unit tests: (1) There are cases when
there is no result. Handling an empty result buffer
must not cause failure. (2) There are possible
scenarios when the results overlap. We must find each
of them at this time. (3) There are test cases for error
handling. We must be able to log information about
all unwanted events from inside and outside of the
kernel source.
5 DATA MAPPING
The model graph is represented as a hash table, where
the ID of the vertex and the list of their neighbours
are stored. The advantage of using a hash table is that
finding an element based on its ID requires O(1) time.
Although, by using hash table, a little bit more
memory is needed, the time is more important in the
current case. The result is stored in a different kind of
structure. There are two kinds of data which must be
mapped: At the first step of matching, (topological
check), graphs consist of numerical elements (IDs).
At the second step (attribute check), attributes are
represented by strings (they are serialized to strings).
In case of graph mapping, the graph is converted
as illustrated in Figure 3, in order to achieve the
required format (the passing of 2D arrays is not
allowed in case of OpenCL 1.2). The original, two
dimensional structure of the graph is mapped into two
one dimensional structures: (1) The first structure
contains the list of the neighbours one by one from
the first to the last vertex. (2) The second structure
contains the starting positions of the neighbour lists.
The second part is a helper structure to process the
first structure. Using this two arrays and the size of
the second array, all kinds of graph can be passed.
Advantages of this structure are the degree of
compression and the time of accessing elements in the
graph.
Figure 3: The input of the OpenCL kernel code must be a
one dimensional array.
Figure 4: Concatenate string to OpenCL kernel code.
In the second step, a different data structure is
needed. At this time, the OpenCL buffer must be big
enough for each attribute. To store the beginning of
the attributes, a helper array is used as well. Figure 4
shows this kind of mapping.
Considering the result buffer (output data from
the GPU), there is a common reserved space in the
GPU device global memory. Each thread can reserve
a part from it and fill it any time. As soon as the thread
cannot manage the reserving of memory for a new
result, buffer overflow is occurred and handled.
6 KERNEL SOURCE
There are several important viewpoints for creating
an effective kernel codes, one of them is the memory
management. There are four kinds of memory areas
in the OpenCL programming model: private, local,
global and the host side memory. Another viewpoint
is to choose the number of the working threads
(referred to as the local group number). Choosing the
MODELSWARD 2016 - 4th International Conference on Model-Driven Engineering and Software Development
358
right number depends on the hardware device and
therefore GEMP must recognize the GPU and
manage the configuration according to it. Similarly,
data types are important in case of different GPUs,
using unsigned positive numbers are not worse than
using other types and provide lots of values to store
bigger graphs. We measured several cases.
In the first kernel code, the size of the result (one
searched pattern) is known and must be filled. The
algorithm seeks for candidates for the actual place
and if it has been found, it tries to find the candidate
for the next place (Fig. 5). As soon as there is no
candidate left, the algorithm steps back and tries to
find another matching vertex for the last place. If the
algorithm reaches the first element or the element
after the last one, it ends (with failure, or success
respectively). The algorithm can be illustrated by a
state machine, which is studying the current state, and
makes a decision to accept it and step to the next state.
Searching does not go deeper in the graph than the
size of the pattern.
Figure 5: The task is to find matching result to the pattern.
In the second phase, the attributes (strings) of the
vertices must be passed to the GPU device. Only the
attributes of the result of the first phase are passed and
processed one-by-one. Each thread processes an
attribute (one vertex has one attribute). Processing
means to find the requested condition in the given
attribute. For instance we want to find each pattern
where 4 actors play in the same movie and the actor
has the name Jack, the pattern graph has four node in
the current case and each result find using the first
kernel. In the second step, only the attributes of the
results are copied and processed to find pattern where
the name attribute contains the Jack name.
7 PERFORMANCE AND
SCALABILITY
Both of the time and the memory consumption must
be monitored for studying the performance and the
scalability during the execution. The following time
measurement points are identified: (1) Reading the
input data. (2) Converting the data to OpenCL input
format. (3) Preparing the kernel code and copying the
data to the GPU device, managing the computation
and reading the result. (4) Processing and reading the
result. (5) Creating the string and other buffers to the
second phase. (6) Running the kernel of the second
phase. (7) Giving the result of the second phase.
Considering the memory usage, there are also
predefined points, when the size of the memory is
limited for testing purposes: (1) Allocating memory
to the input model. (2) Creating result buffer on the
GPU device. (3) Creating string input buffer on the
GPU device in the second phase.
7.1 Input Model
Both the first (topological match) and the second
(attribute checking) phase have to be able to deal with
large input graphs. GEMP divides the graphs
(generated from the IMDb database in the current
case) randomly and processes them in several rounds.
The missing results are processed on the host side
parallel to the GPU threads. According to our
measurements, the heavily divided graph results less
findings on the GPU device. Dividing of the input
graph is studied and measurements evaluated in a
earlier paper (Fekete and Mezei, 2015).
In this paper, we introduce one additional step in
the first kernel, namely using a pivot point in pattern
matching. We select the first vertex to match in the
pattern and find all candidates in the host model, thus
creating starting points for matching. Then, possible
starting points are counted and for each vertex, a
worker thread is started. This means that we can
significantly reduce the number of threads (all host
nodes vs. nodes matching the pivot point of the
pattern). We tested with several examples using the
IMDb database. We achieved more than 10%
performance increase in our test cases.
7.2 Result Buffer and Processing
If the optimal number of the worker threads is not
configured for one round (can be needed if the graph
cannot be processed in a single round), the
computation time can slow down. Considering this
kind of importance of the number of the worker
threads, there are three formulas developed in earlier
(Formula 1-3). If the kernel must be started too
frequently (buffer overflow or barely used buffer), the
performance is decreased.
C
fSizecurrentBuf
BuffSize
NuminitThreadumnewThreadN *
max
*
(1)
CNuminitThreadumnewThreadN *
(2)
Towards a Model Transformation Tool on the Top of the OpenCL Framework
359
C
zecurrBuffSi
BuffSize
>
max
(3)
The newThreadNum refers to the number of the
global worker threads that must be used in the
subsequent runs of the kernel code. The number of the
global worker threads used in the test measurement is
denoted by the initThreadNum. C can be any kind of
positive number, which increases or decreases the
speed of the buffer size changing. The maxBufferSize
and the currentBufferSize denote how many numbers
can be stored and the size of the buffer used in the
current round.
Figure 6: One attribute list can be appeared in more results.
One attribute can appear in several results and
copying all of them would be only the waste of the
power source. Instead, an attribute is copied only ones
which is illustrated in Figure 5. In this case an
additional helper structure is applied. The result of the
test measurement heavily depends on the input model
and the pattern to be searched. But in this case, we
also could achieve at least 10% performance gain.
8 CONCLUSIONS
In this paper, a general pattern matching tool (GEMP)
is presented based on our earlier studies and solutions.
The tool contains two main steps with optional pre-
processing steps. This two steps ensures that only the
mandatory attributes are copied to the GPU device
thus reducing time and memory. According to the
new architecture, the applied domain model is much
easier to be exchanged and all kinds of domain
models can be used. Users can access and configure
the tool using interface oriented techniques which
makes GEMP a user friendly and easily testable.
As the pattern matching part of the tool is now
complete, we are going to focus on graph-rewriting in
the future and we also need study how we can
integrate our tool into existing tools. Obviously,
managing graph-rewriting effectively on the GPU
device is not an easy task (consistency, performance,
memory management issues) and we will face several
challenges during the work. Some of them are already
known, e.g. which part of the graph is necessary to
copy to the GPU device and how is the update
applied? Studying and solving these points we can
achieve GPU-based model transformation tool with
full functionality. This is our new step besides
creating other case studies and apply more tests.
ACKNOWLEDGEMENTS
This work was partially supported by the TÁMOP-
4.2.1.D-15/1/KONV-2015-0008 project.
REFERENCES
CUDA website, viewed 30 October 2015.
www.nvidia.com.
E. Jakumeit, S. Buchwald, D. Wagelaar, L. Dan, A.
Hegedus, M. Herrmannsdorfer, T. Hornf, E. Kalnina, C.
Krause, K. Lano, M. Lepper, A. Rensink, L. Rose, S.
Watzoldt & S. Mazanek, 2014. A survey and
comparison of transformation tools based on the
transformation tool contest. Special issue on
Experimental Software Engineering in the Cloud.
G. Canfora, A. Cimitile, F. Garcia, M. Piattini & C. Aaron
Visaggio, 2006. Evaluating advantages of test driven
development: a controlled experiment with
professionals. ISESE '06 Proceedings of the 2006
ACM/IEEE international symposium on Empirical
software engineering, USA.
IMDb website, viewed 30 October 2015,
www.imdb.com/interfaces.
Khronos Group, viewed 30 October 2015,
www.khronos.org/opencl.
Q. Xu, H. Jeon & A. M., 2014. Graph processing on gpus:
Where are the bottlenecks?. Workload
Characterization (IISWC).
S. Che, B. M. Beckmann, S. K. Reinhardt, 2014. Belred:
Constructing gpgpu graph applications with software
building blocks. High Performance and Embedded
Computing (HPEC).
T. Fekete & G. Mezei, 2015. Creating a GPGPU-
accelerated framework for pattern matching using a
case study. Eurocon2015, Salamanca, Spain.
Veerasamy, Bala Dhandayuthapani & Nasira, G. M., 2014.
Exploring the contrast on GPGPU computing through
CUDA and OPENCL. Journal on Software
Engineering.
Viñas, Moiśes; Fraguela, Basilio B.; Bozkus, Zeki &
Andrade, Diego, 2015. Improving OpenCL
Programmability with the Heterogeneous Programming
Library. Procedia Computer Science, International
Conference On Computational Science, ICCS.
Yan, Xin; Shi, Xiaohua; Wang, Lina & Yang, Haiyan,
2014. An OpenCL micro-benchmark suite for GPUs
and CPUs. Journal of Supercomputing.
MODELSWARD 2016 - 4th International Conference on Model-Driven Engineering and Software Development
360