CkTail: Model Learning of Communicating Systems
S
´
ebastien Salva and Elliott Blot
LIMOS - UMR CNRS 6158, Clermont Auvergne University, France
Keywords:
Reverse Engineering, Model Learning, Communicating Systems, IOLTS, Dependency Graphs.
Abstract:
Event logs are helpful to figure out what is happening in a system or to diagnose the causes that led to an unex-
pected crash or security issue. Unfortunately, their growing sizes and lacks of abstraction make them difficult
to interpret, especially when a system integrates several communicating components. This paper proposes
to learn models of communicating systems, e.g., Web service compositions, distributed applications, or IoT
systems, from their event logs in order to help engineers understand how they are functioning and diagnose
them. Our approach, called CkTail, generates one Input Output Labelled Transition System (IOLTS) for every
component participating in the communications and dependency graphs illustrating another viewpoint of the
system architecture. Compared to other model learning approaches, CkTail improves the precision of the gen-
erated models by better recognising sessions in event logs. Experimental results obtained from 9 case studies
show the effectiveness of CkTail to recover accurate and general models along with component dependency
graphs.
1 INTRODUCTION
Using event logs to debug systems in the short or long
term is an approach more and more considered in the
Industry. Logs strongly help investigate issues on pro-
duction environments, and usually increase the de-
velopers’ ability to handle and detect failures. But,
it is well-known that analysing log entries is often a
long and frustrating process as it usually requires to
cover very large files. Several approaches based upon
model learning propose to ease log analysis by build-
ing models that generalise the behaviours of systems
encoded in logs and by making them readable (Krka
et al., 2010; Ohmann et al., 2014; Salva and Blot,
2019; Salva and Durand, 2015; Pastore et al., 2017;
Biermann and Feldman, 1972; Mariani and Pastore,
2008; Beschastnikh et al., 2014). We consider in the
paper passive model learning approaches that infer a
specification by gathering and analysing system exe-
cutions and concisely summarising the frequent inter-
action patterns as state machines that capture the sys-
tem behaviour (Ammons et al., 2002). The obtained
models, even if partial, can serve many purposes, e.g.,
they can be used as documentation, examined by de-
signers to find bugs, or can be given to testing meth-
ods.
This paper focuses on the generation of models
of communicating systems from event logs although
some works already proposed solutions (Mariani and
Pastore, 2008; Beschastnikh et al., 2014; Salva and
Blot, 2019). The algorithm given in (Mariani and Pa-
store, 2008) segments event logs into execution traces
by considering two factors, component identification
and user identification, then models are built using
the kBehavior algorithm (Mariani et al., 2011). The
CSight tool (Beschastnikh et al., 2014) takes as in-
puts trace sets, which have to be manually built by
hands w.r.t. to several restrictions. Models are then
generated and refined by means of invariants. We
also proposed the tool Assess (Salva and Blot, 2019),
which is specialised in the model learning of compo-
nent based systems from which the component inter-
actions are not observable. Assess splits event logs
into traces by looking for longer time delays between
messages. Then, it tries to detect implicit component
calls in traces and builds models encoding these calls
with new synchronisation actions. We observed that
these previous approaches suffer from one main is-
sue related to session recovery in event logs. We call
a session a temporary message interchange among
components forming a behaviour of the whole sys-
tem from one of its initial states to one of its final
states. Recognising sessions in event logs helps ex-
tract “complete” traces and build more precise mod-
els. It is quite straightforward to recognise sessions
when a mechanism based on session identification is
used. Unfortunately, many kinds of systems do not
use such mechanisms.
Salva, S. and Blot, E.
CkTail: Model Learning of Communicating Systems.
DOI: 10.5220/0009327400270038
In Proceedings of the 15th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2020), pages 27-38
ISBN: 978-989-758-421-3
Copyright
c
2020 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
27
Our approach, called Communicating system kTail
shortened CkTail, builds more precise models of com-
municating systems by better recognising sessions in
event logs with respect to 4 properties: association
of responses with their related requests, time delays,
data dependency and component identification. To
design CkTail, we choose to extend the k-Tail tech-
nique (Biermann and Feldman, 1972) with the capa-
bility to build one model called Input Output Labelled
Transition System (IOLTS) for every component of
the system under learning. k-Tail is indeed well-know
to quickly and efficiently build generalised models
from traces. Furthermore, our approach also goes fur-
ther in model learning by proposing the generation of
dependency graphs. The latter show in a simple way
the directional dependencies observed among compo-
nents. We believe that this kind of graph completes
the behavioural models and will be helpful to evalu-
ate different kinds of model properties, e.g., testability
or security. In the paper, we define component depen-
dency over three expressions formulating these situa-
tions: direct component requests, nested requests and
data dependency. This definition aims to avoid the in-
ference of ambiguous dependency relations involving
one component to several potential components.
We performed an empirical evaluation based on
event logs collected from 9 case studies to assess the
precision of the models and of the dependency graphs.
We show that CkTail is more effective than the three
previously cited approaches.
The paper is organized as follows: we recall some
definitions about the IOLTS model in Section 2. Our
approach is presented in Section 3 with a motivating
example. The next section shows some results of our
experimentations. Section 5 discusses related work.
Section 6 summarises our contributions and draws
some perspectives for future work.
2 THE IOLTS MODEL
We express the behaviours of communicating compo-
nents with IOLTSs. This model is defined in terms
of states and transitions labelled by input or output
actions, taken from a general action set L, which ex-
presses what happens. τ is a special symbol encoding
an internal (silent) action; it is common to denote the
set L τ by L
τ
.
Definition 1 (IOLTS). An Input Output Labelled
Transition System (IOLTS) is a 4-tuple hQ,q0,Σ,→i
where:
Q is a finite set of states; q0 is the initial state;
Σ {τ} L
τ
is the finite set of actions, with τ
the internal (unobservable) action. Σ
I
Σ is
the countable set of input actions, Σ
O
Σ is the
countable set of output actions, with Σ
O
Σ
I
=
/
0;
→⊆ Q × Σ {τ} × Q is a finite set of transitions.
A transition (q,a,q
0
) is also denoted q
a
q
0
.
A trace is a finite sequence of observable actions in
L
. For sake of readability, we also write l l
0
when
l is a subsequence of the sequence l
0
; last(l) refers to
the last element of l. Furthermore, to better match
the functioning of communicating systems, we as-
sume that an action has the form a(α) with a a la-
bel and α an assignment of parameters in P, with
P the set of parameter assignments. For example,
switch( f rom := c
1
,to := c
2
,cmd := On) is made up
of the label ”switch” followed by a parameter assign-
ment expressing the components involved in the com-
munication and a parameter of the switch command.
3 THE CkTail APPROACH
Given an event log, collected by monitoring tools on a
system under learning denoted SUL, CkTail extracts
traces, generates one IOLTS for every component or
software of SUL involved in the communications and
builds its dependency graph. Here, an IOLTS aims
at generalising the behaviours encoded in event logs
with inputs and outputs showing the messages re-
ceived and sent by a component.
The ability of CkTail to infer models is dependent
on several realistic assumptions made on SUL:
A1 Event Log: the communications among the
components can be monitored by different tech-
niques, on components, on servers, by means of
wireless sniffers, etc. But, event logs have to be
collected in a synchronous environment made up
of synchronous communications. Furthermore,
the messages have to include timestamps given by
a global clock for ordering them. For simplicity,
we consider having one event log at the end of the
monitoring process;
A2 Message Content: components produce mes-
sages that include two parameter assignments of
the form f rom := c, to := c expressing the com-
ponent source and the destination of a message.
Other parameter assignments may be used to en-
code data. Besides, a message is either identified
as a request or a response. Many protocols al-
low to easily and automatically distinguish both
of them;
ENASE 2020 - 15th International Conference on Evaluation of Novel Approaches to Software Engineering
28
A3 Device Collaboration: components can run
in parallel and communicate with each other. To
recognise sessions of the system in event logs, we
assume that components follow this behaviour:
they cannot run multiple instances; requests are
processed by a component on a first-come, first
served basis. Besides, every response is associ-
ated to a request and vice-versa.
The last assumption helps process event logs with
the aim of extracting the messages of each compo-
nent into disjointed sequences of individual sessions.
It is possible to replace this assumption by the more
classical one stating that messages include an identi-
fier allowing to observe whole collaborations among
components. This last assumption strongly eases the
trace extraction from event logs. Unfortunately, we
have observed that session identification is seldom
used with communicating systems. Instead, we have
chosen to consider the assumption A3. But it is worth
noting that if a session mechanism is used in SUL, the
first step of CkTail may be adapted (simplified) to still
detect dependencies among components.
3.1 CkTail Overview
Figure 1 illustrates an overview of the three steps of
CkTail. Initially, the user gives as inputs an event log
along with regular expressions. The raw messages
of the event log are firstly parsed and analysed with
these expressions to structure them into actions of the
form a(α) with a a label and α some parameter as-
signments. Figure 1 shows a list of 12 actions de-
rived from raw HTTP messages. The action struc-
ture meets the previous assumptions: the assignments
of the variables f rom and to indicate the sources and
destinations of the messages. For sake of readability,
the labels directly show whether an action encodes
either a request or a response. In this example, the
other parameter assignments are data expressing com-
mands (e.g., param:=heating, cmd:=on), temperature
values (e.g., param:=udevice, svalue:=66) or compo-
nent states (e.g., param:=udevice, svalue:=open). At
this stage, we can observe that there are 5 compo-
nents. But interpreting their interactions and what
they can do is still tricky because of lack of readability
and generalisation.
The first step of CkTail is called Trace analysis.
It covers the action list derived from the event log
and aims at segmenting it into traces that capture ses-
sions. A trace is a subsequence of the action list that
meets some constraints formulated from the assump-
tions A1-A3. These constraints are detailed in Section
3.2 and summarised as follows. A response is always
associated to its related request in a trace (A3). Nested
requests (a request to a component that also performs
another request before giving a response) are always
kept together in a trace (A3). A trace gathers mes-
sages exchanged between components interacting to-
gether in a limited time delay (A1). And, a chain of
messages sharing the same data expresses a data de-
pendency among several components. This chain of
messages must be kept in the same trace (A2).
During this process, the component interactions
are also analysed for detecting component dependen-
cies. These dependencies are given under the form
of component lists c
1
c
2
.. .c
k
expressing that a com-
ponent c
1
depends on a component c
2
, which itself
depends on another component and so on. The set
Deps gathers these component lists. The example
of set Deps of Figure 1 captures several dependen-
cies. For instance, d
1
G is given by req1 as c
1
di-
rectly calls G. d
1
Gd
2
is given from the nested re-
quests req1req3, showing that d
1
calls G, which itself
calls d
2
before answering to d
1
. d
2
d
3
is a data de-
pendency observed between d
2
and d
3
related to the
shared value (svalue:=68) exchanged between the two
components.
The second step Dependency Graph Generation
takes back the set Deps, constructs Direct Acyclic
Graphs (DAGs) from the component lists and even-
tually computes the transitive closure of the DAGs.
These DAGs only capture the dependencies given in
Deps, i.e. two different lists c
1
c
2
and c
2
c
3
are not
associated to avoid the representation of false links
among components. The dependency graphs illus-
trated in Figure 1 (top right) show the dependent com-
ponents of every component of SUL. The compo-
nents required by some other ones can be retrieved
though. These graphs also show in our example that
the component G (a gateway) takes a central place as
it is required for d1 and d3, and it depends on d2 and
d4.
The third step Model Generation builds IOLTSs.
In the figure, the Step 3A Trace Partitioning begins by
preparing the set Traces(SUL). Every action is dou-
bled to give a pair output/input by separating the no-
tion of source/destination by means of the new param-
eter id assigned either to the source or to the destina-
tion of a message. During this process, Traces(SUL)
is partitioned into as many trace sets as components
found in SUL. Each trace set T
c
gathers only the
traces related to the component c. The Step 3B IOLTS
Generation transforms every set T
c
into an IOLTS by
converting traces into IOLTS path cycles, which are
joined on the initial state only. In our example, as
we have 5 trace sets, we obtain 5 IOLTSs. Finally,
the Step 3C IOLTS Reduction, applies the k-Tail algo-
rithm to the IOLTSs in order to merge their equiv-
CkTail: Model Learning of Communicating Systems
29
Figure 1: Approach overview.
alent states. With k-Tail, the equivalent states are
those having the same k-future, i.e. the same event
sequences having the maximum length k. In our ex-
ample, the states in white of the two IOLTSs L
2
and
L
3
are merged.
With these IOLTSs and the DAGs, it becomes eas-
ier to observe that SUL is made up of two sensors d
1
and d
3
connected to a gateway G. The first is a motion
detector providing states and the second gives tem-
perature values. The gateway controls two actuators,
here two heating systems d
2
, d
4
, with respect to the
values provided by the sensors. When the d
1
state is
on, the heating systems are turned on. When d
3
sends
a temperature upper to 68, the gateway turns off the
heating system d
2
only and forwards the temperature
to d
2
in the meantime.
The CkTail steps are now detailed in the next sec-
tions.
3.2 Trace Analysis
As stated previously, this first step covers an event log
to extract the trace set Traces(SUL) and the set Deps
of dependency lists. The messages of the event log
are firstly parsed and analysed with regular expres-
sions to retrieve actions of the form a(α). If the user
is not able to give these regular expressions, several
approaches and tools have been proposed to automat-
ENASE 2020 - 15th International Conference on Evaluation of Novel Approaches to Software Engineering
30
ically mine patterns from log files (Fu et al., 2009;
Makanju et al., 2012; Vaarandi and Pihelgas, 2015;
Messaoudi et al., 2018; Zhu et al., 2018). These pat-
terns may be used to quickly derive regular expres-
sions. The resulting actions have to meet the require-
ments given by the assumptions A1-A3. We define
the following notations to express some of these re-
quirements:
Definition 2. Let a(α) be an action in L
τ
.
f rom(a(α)) = c denotes the source of the mes-
sage,
to(a(α)) = c denotes the destination,
components(a(α)) = { f rom(a(α)),to(a(α))},
time(a(α)) denotes the timestamps,
data(a(α)) = α \ components(a(α)) \ time(a(
α)),
isReq(a(α)), isResp(a(α)) are boolean expres-
sions expressing the nature of the message.
Once we have the list of actions, which we denote S,
this step segments it into traces by trying to recover
sessions. As we assume the latter are not identified,
we propose an algorithm that detects them by follow-
ing the assumptions A1-A3. To devise this algorithm,
we thoroughly derived a complete list of constraints
from A1-A3. Some of them are given in Table 1.
Given an action a(α), some constraints forbids start-
ing a new trace from a(α): with C1, a response must
be kept with its associated request in the same trace;
similarly, with C2, nested requests must remain to-
gether in the same trace. On the contrary, other con-
straints imply to cut the action list. For instance, C3
implies that a new component not yet observed that
sends a new request, is starting a new session. We
hence consider that this request starts a new trace.
Other constraints, e.g., C4, reflect incoherent cases
that do not meet our assumptions. C5 is a special con-
straint expressing that a component, which has pre-
viously participated in the current session, can send
a new request. The choice of keeping this new re-
quest or not in the current session is not obvious. To
make this choice, we propose to consider two other
factors, i.e., time delay and data dependency, with the
constraints C5.1 and C5.2. The notion of data depen-
dency will be presented in Section 3.3.
To be used by our trace analysis algorithm, we
have formulated these constraints with the notations
of Definition 2 completed with the following sets
and boolean expressions. We denote Lreq the set
of lists of pending requests made in parallel at a
given time, i.e. lists of actions a
1
(α
1
). ..a
k
(α
k
)
with isReq(a
i
(α
i
))
1ik
for which responses have
not yet been received. KC denotes the list of
Table 1: Some constraints derived from the assumptions
A1-A3.
C1 A response a
i
(α
i
) is associated to a request observed previously in
the same session
C2 A component that receives a request a
i
(α
i
) can call other compo-
nents to produce a response in the same session.
C3 A component not yet encountered that has no pending request can
send a new request a
i
(α
i
) and starts a new session.
C4 A component that previously requested another component, cannot
send a new request a
i
(α
i
), it needs to wait for the response of the
first one.
C5 A component that has completed its request flow (received or sent
requests) in a session, can send a new request a
i
(α
i
) to another
component. C5.1: if the time delay between this request and the
previous request made by this component is short, the request be-
longs to the session. C5.2: if this request shares data with previous
actions, then the request belongs to the same session as these ac-
tions.
Table 2: Constraint formalisation. C1 C2 and C5 are the
constraints allowing an action to be kept in a current trace.
Constraint
C1 isResp(a
i
(α
i
)) and !l Lreq: f rom(a
i
(α
i
)) = to(last(l)) and
to(a
i
(α
i
)) = f rom(last(l))
C2 isReq(a
i
(α
i
)) and Lreq
0
= {l Lreq | f rom(a
i
(α
i
)) =
to(last(l))} 6=
/
0 and ¬pendingRequest( f rom(a
i
(α
i
)))
C5 isReq(a
i
(α
i
)) and f rom(a
i
(α
i
)) KC and l Lreq :
f rom(a
i
(α
i
)) 6= to(last(l)) and ¬pendingRequest( f rom(a
i
(α
i
)))
known components involved in the session so far.
pendingRequest(c) is the boolean expression (c
KC,l Lreq,a(α) l : c components(a(α))) that
evaluates whether the component c has sent (resp. re-
ceived) a request and has not yet received (resp. sent)
the response. From these notations, we have formu-
lated the above constraints, listed their boolean terms
and studied all their possible permutations. Table 2
lists the constraints expressing that an action a
i
(α
i
)
must be kept in the current trace when they hold.
Algorithms 1 and 2 implement the above con-
straints and segment an action list S into traces.
This recursive algorithm is initially called with
KeepSplit(S,S); it returns Traces(SUL) and the fi-
nal component set C, which differs from KC, the set
of known components in a session. If the action list S
starts with a response (without having any request be-
fore), it is deleted (lines 1-3). Otherwise, a new trace
t is started with a first request. The list of pending re-
quests Lreq is updated with the procedure Extend and
the set of known components KC is completed w.r.t.
this request. Then, every action a
i
(α
i
) of the list S is
covered to decide whether the action is kept in the cur-
rent trace (line 9). If the constraint C1 holds (receipt
of a response associated to a previous request), a
i
(α
i
)
is added to the trace t. The associated pending request
is removed from Lreq with the procedure Trim. If C2
CkTail: Model Learning of Communicating Systems
31
holds (receipt of a request that is nested to a previous
request in Lreq), a
i
(α
i
) is also added to the trace t.
To check whether C2 holds, a subset Lreq
0
of Lreq is
constructed: it is composed of the request sequences
ended by a request a(α) such that a(α)a
i
(α
i
) gives
form to nested requests. If Lreq
0
has several candi-
dates, i.e. request lists, we keep the list l ended by the
request having the earliest time-stamp (line 13). With
this condition, we comply with A3 (first-come, first
served basis). This list l is augmented with the current
request a
i
(α
i
) by calling the procedure Extend (line
14). The set of known components KC is updated.
Finally, when C5 holds (line 15), we evaluate the con-
straints C5.1 or C5.2 implemented by the procedures
Checktime and Checkdatadeps. If one of these pro-
cedures returns true, the request a
i
(α
i
) is added to the
trace t. The procedure Extend completes Lreq with
a new request list starting by a
i
(α
i
). The set KC is
still updated w.r.t. the request a
i
(α
i
). For any other
case, we put the action a
i
(α
i
) into a new action list t
2
(line 18). Once all the actions have been covered, the
trace t is added to Traces(SUL). If the action list t
2
is not empty, the algorithm is called again to split it
(line 23).
The procedure Checktime takes a trace t and a re-
quest a
i
(α
i
). Its purpose is to check whether the time
delay between this request and the previous one made
by the same component c is strongly lower than the
average time delay between two requests made by c.
This average time delay is denoted T (c). Its computa-
tion is made on the action list S. In short, we filter out
all the responses along with the requests not made by
the component c in S. Then, we compute the mean of
time delays between every request pairs. Checktime
also filters the trace t to keep the requests made by the
component c = f rom(a
i
(α
i
)). The procedure com-
putes the time delay between the request a
i
(α
i
) and
the previous one in the trace t. If this time delay is
lower to T (c), C5.1 holds, hence the procedure re-
turns true.
The procedure Checkdatadeps implements the
constraint C5.2 and checks whether a data depen-
dency exists in the trace t. The notion of dependency
among components is discussed in the next section.
3.3 Dependency Graph Generation
Before generating dependency graphs, we need to ex-
press what a component dependency is. Definition 3
relies on three expressions to formulate this notion.
The first illustrates that a component c
1
depends on
another component c
2
when c
1
queries c
2
with a re-
quest. The second expression deals with nested re-
quests: if we have successive nested requests of the
Algorithm 1: KeepSplit(S, S
2
).
input : Action sequences S, S
2
output: Traces(SUL), Component set C
1 S
2
= a
1
(α
1
). .. a
k
(α
k
);
2 while isResp(a
1
(α
1
)) do
3 KeepSplit(S,a
2
(α
2
). .. a
k
(α
k
));
4 END;
5 t = a
1
(α
1
); Extend(Lreq,ε,t);
6 t
2
= ε;
7 KC = components(a
1
(α
1
));
8 i=2;
9 while i k do
10 Case C1 : t = t.a
i
(α
i
); Trim(Lreq, l);
11 Case C2 :
12 t = t.a
i
(α
i
);
13 take l Lreq
0
such that
l
2
6= l Lreq
0
: time(last(l) time(last(l
2
);
14 Extend(Lreq,l,t);
15 Case C5 and (Checktime(a
i
(α
i
),t) or
Checkdatadeps(a
i
(α
i
),S,t)) :
16 t = t.a
i
(α
i
);
17 Extend(Lreq,ε,t);
18 Else : t
2
= t
2
.a
i
(α
i
);
19 KC = KCcomponents(a
i
(α
i
));
20 i++;
21 Traces(SUL) = Traces(SUL) {t};
22 C = C KC;
23 if t
2
6= ε then
24 KeepSplit(S,t
2
);
form req1( f rom := c
1
,to := c)req2( f rom := c,to :=
c
2
), we define that c
1
depends on c, which itself de-
pends on c
2
and so on. The last expression refers to
data dependency. We say that c
1
depends on c
2
when
c
2
has sent a message a
1
(α
1
) with some data, if there
is a unique chain of messages a
1
(α
1
). ..a
k
(α
k
) from
c
2
sharing this data and if a
k
(α
k
) is a request whose
destination is c
1
.
Some dependency cases, or patterns, might have
been forgotten in this definition. For instance, when
there are two or more chains of messages sharing the
same data, addressed to the same component c, we ob-
serve that there is a data dependency among compo-
nents but we are unable to decide which dependency
relation is correct. Because of this ambiguity, none of
them is kept.
Definition 3 (Component Dependency).
Let c
1
,c
2
C, c
1
6= c
2
, and S an action list. We write
c
1
depends on c
2
iff:
1. c
1
calls c
2
: a(α) S : isReq(a(α)),
f rom(a(α)) = c
1
,to(a(α)) = c
2
;
2. there is a nested request between c
1
and
c
2
: a
1
(α
1
),a
2
(α
2
) S, a(α) S:
isReq(a
1
(α
1
))
(i=1,2)
, isResp(a(α)), f rom(a
1
(
α
1
)) = c
1
, to(a
1
(α
1
)) = f rom(a
2
(α
2
)),
ENASE 2020 - 15th International Conference on Evaluation of Novel Approaches to Software Engineering
32
to(a
2
(α
2
)) = c
2
, to(a(α)) = c
1
,
¬(a
1
(α
1
)a(α)a
2
(α
2
) S);
3. there is a longest chain of messages from c
2
ended by a request to c
1
sharing the same data α:
l = a
1
(α
1
)a
2
(α
2
). ..a
k
(α
k
) S : DS(l,c
1
,c
2
,α)
and l
0
= a
0
1
(α
0
1
)a
0
2
(α
0
2
). ..a
k
(α
k
)
S : DS(l
0
,c
1
,c
2
,α) and l
0
l, with
DS(a
1
(α
1
). ..a
k
(α
k
)c
1
,c
2
,α) the boolean ex-
pression f rom(a
1
(α
1
)) = c
2
to(a
k
(α
k
)) = c
1
isReq(a
k
(α
k
)) to(a
i
(α
i
)) = f rom(a
i+1
(α
i+1
)
)
1i<k)
α
\
(1ik)
α
i
.
The set Deps gathers component dependencies un-
der the form of component lists c
1
.. .c
k
. Compo-
nent dependencies are detected while Algorithm 1
builds traces by means of the procedures Extend and
Checkdatadeps. The procedure Extend detects the
two first component dependency cases of Definition
3. It uses the set of pending requests Lreq to com-
plete the set Deps. Indeed, Algorithm 1 builds Lreq
in such a way that a sequence of Lreq is either a re-
quest (Case C5) or a list of nested pending requests
(Case C2). The procedure covers the component se-
quences lc = c
1
.. .c
k
c
k+1
of Lreq and adds the depen-
dency lists in Deps (line 8).
The procedure Checkdatadeps(a
i
(α
i
),S,t)
checks whether the last expression of Definition
3 holds. If there is a unique chain of mes-
sages a
1
(α
1
). ..(a
i
(α
i
) sharing the same data
α data(a
i
(α
i
)) and finished by the request a
i
(α
i
)
then the dependency to(a
i
(α
i
)). f rom(a
1
(α
1
)) is
added to Deps (line 15). If this chain of messages is
a subsequence of the current trace t.a
i
(α
i
), then the
constraint C5.2 is satisfied. As a consequence, the
procedure also returns true to Algorithm 1 to indicate
that this request must be kept in the current trace.
Algorithm 3 can now generate dependency graphs
from the set Deps. It partitions Deps to group the de-
pendency lists starting by the same component in the
same subset. This partitioning is performed by the
equivalence relation
c
on C
given by l
1
,l
2
Deps,
with l
1
= c
1
.. .c
k
, l
2
= c
0
1
.. .c
0
k
, l
1
c
l
2
iff c
1
= c
0
1
.
Given a partition C
i
, and a component list l C
i
, Al-
gorithm 3 builds a path of the DAG Dg
i
such that the
n
th
state is labelled by the n
th
component of l. Algo-
rithm 3 finally computes the transitive closure of the
DAGs to make all component dependencies visible.
3.4 Model Generation
This last step, implemented by Algorithm 4, generates
an IOLTS for every component previously encoun-
tered and stored in C. To build IOLTSs, the traces of
Algorithm 2.
1 Procedure Trim(Lreq, l) is
2 l
0
= remove(last(l));
3 Lreq = Lreq \ {l} {l
0
};
4 Procedure Extend(Lreq,l,t) is
5 l
0
= l.last(t) = a
1
(α
1
). .. a
k
(α
k
);
6 Lreq = Lreq \ {l} {l
0
};
7 //Compute component dependencies
8 lc = c
1
.. .c
k
c
k+1
such that c
i
= f rom(a
i
(α
i
))
(1 jk)
,
c
k+1
= to(a
k
(α
k
));
9 Deps = Deps {lc};
10 Procedure Checktime(a
i
(α
i
),t) is
11 t
2
= t\{a(α) | f rom(a
i
(α
i
)) 6= f rom(a(α)) or
isResp(a(α))};
12 return
(time(a
i
(α
i
)) time(last(t
2
))) << T( f rom(a
i
(α
i
))));
13 Procedure Checkdatadeps(a
i
(α
i
),S,t) is
14 if α data(a
i
(α
i
)), l = a
1
(α
1
)a
2
(α
2
). .. a
i
(α
i
) S :
DS(l,to(a
i
(α
i
)), f rom(a
1
(α
1
)),α) and
l
0
= a
0
1
(α
0
1
)a
0
2
(α
0
2
). .. a
i
(α
i
) S :
DS(l
0
,to(a
i
(α
i
)), f rom(a
1
(α
1
)),α) and l
0
l then
15 Deps = Deps {to(a
i
(α
i
)). f rom(a
1
(α
1
))};
16 if l t.a
i
(α
i
)) then
17 return true;
18 else
19 return false;
Algorithm 3: Device Dependency Graphs Generation.
input : Deps
output: Dependency graph set DG
1 foreach C
i
Deps/
c
do
2 foreach c
1
c
2
.. .c
k
C
i
do
3 add the path s
c
1
s
c
2
.. .s
c
k1
s
c
k
to Dg
i
;
4 Dg
0
i
is the transitive closure of Dg
i
;
5 DG = DG {Dg
0
i
};
Traces(SUL) are transformed to integrate the notions
of input and output. Given a trace a
1
(α
1
). ..a
k
(α
k
),
every action is doubled by separating the component
source and the destination of the message. For an
action a
i
(α
i
), the algorithm writes a new trace com-
posed of the output !a
i
(α
i1
) sent by the source of the
message, followed by the input ?a
i
(α
i2
) received by
the destination (line 3). The source and the destina-
tion are now separated by a new assignment on the
parameter id. Next, these traces are segmented into
subsequences in such a way that a subsequence cap-
tures the behaviour of one component only. Subse-
quences are also grouped by component in trace sets
denoted T
c
with c a component of C (lines 4,5).
These trace sets are now lifted to the level
of IOLTSs. Given the trace set T
c
, a trace t =
a
1
(α
1
). ..a
k
(α
k
) T is transformed into the path
CkTail: Model Learning of Communicating Systems
33
q0
a
1
(α
1
)
q
1
.. .q
k1
(a
k
(α
k
)
q0 such that the states
q
1
.. .q
k1
are new states. These paths are joined on
the state q0 to build the IOLTS L
c
:
Definition 4 (IOLTS Generation). Let T
c
= {t
1
,. ..,
t
n
} be a trace set. L
c
= hQ,q0, Σ,→i is the IOLTS
derived from T
c
where:
q0 is the initial state.
Q,Σ, are defined by the following rule:
t
i
=a
1
(α
1
)...a
k
(α
k
)
q0
a
1
(α
1
)
q
i1
...q
ik1
a
k
(α
k
)
q0
Finally, Algorithm 4 applies the k-Tail algorithm to
generalise and reduce the IOLTSs by merging the
equivalent states having the same k-future. We use
k = 2 as recommended in (Lorenzoli et al., 2008; Lo
et al., 2012).
Algorithm 4: IOLTS Generation.
input : Traces(SUL)
output: LTSs L
c
1
.. .L
c
k
1 T = {};
2 foreach t = a
1
(α
1
). .. a
k
(α
k
) Traces(SUL)) do
3 t
0
=!a
1
(α
11
)?a
1
(α
12
). ..!a
k
(α
k1
)?a
k
(α
k2
) such that
α
i1
= α
i
{id := f rom(a
i
(α
i
))},
α
i2
= α
i
{id := to(a
i
(α
i
))}(1 i k);
4 foreach c C do
5 T
c
= T
c
{t
0
\ {a(α) t
0
| id := c / α}
6 foreach T
c
with c C do
7 Generate the LTS L
c
from T
c
;
8 Merge the equivalent states of L
c
with k-Tail(k = 2,L
c
);
4 PRELIMINARY EVALUATION
Our approach is implemented in Java and is released
as open source at https://github.com/Elblot/CkTail.
With this implementation, we conducted some exper-
iments in order to provide answers for the following
questions:
RQ1: Can CkTail derive models that capture cor-
rect behaviours of SUL? RQ1 studies how the
models accept valid traces, i.e. traces extracted
from event logs but not used for the model gener-
ation, compared to the model learning tools spe-
cialised for communicating systems;
RQ2: Can CkTail build models that reject abnor-
mal behaviours? This time, RQ2 investigates how
the models accept invalid traces;
RQ3: Is CkTail able to detect accurate dependen-
cies among components?
4.1 Empirical Setup
RQ1 and RQ2 evaluate the precision of the models
generated by CkTail. We here compare CkTail with
3 other tools, CSight, Assess and the tool suite pro-
posed in (Mariani and Pastore, 2008) based upon the
tool kbehavior, which we denote LFkbehavior. To
perform an unbiased evaluation, we need to take into
consideration how these tools work. Assess is spe-
cialised to the model learning of component-based
systems systems from event logs in which the inter-
actions are not observable. Assess segments an event
log into traces every time it detects longer time delays
between messages. Then, it builds models integrat-
ing special actions expressing component calls and
proposes two strategies to synchronise these models
called Loosely-coupling and Decoupling. The latter
returns more general models than the former by al-
lowing the call of any component any time. We con-
sider both strategies in this evaluation. The approach
given in (Mariani and Pastore, 2008) can infer mod-
els of communicating systems by segmenting an event
log for each component of the system under learning
and by applying kbehavior on the trace sets to gener-
ate models. Finally, CSight only takes trace sets, one
set for each component.
The study has hence been conducted on several
use cases with several configurations. Firstly, from a
set of 7 devices (3 sensors, 2 gateways, 2 actuators),
we built 6 IoT systems denoted Con f 1 to Con f 6, by
varying the number of devices and the behaviours of
the gateway(s) after the receipt of data from the sen-
sors. One or two gateways can be used , themselves
interconnected to at least two sensors and one actu-
ator. In each configuration, the components follow
a known dependency scheme, which will be used to
evaluate RQ3. We also extracted a log from another
IoT system denoted Con f 7, where 8 sensors send data
to a gateway. We were unable to build models from
any of these configurations with CSight after 5 hours
of computation, which was our limit for each exper-
iment. We observed that the first steps of CSight
were achieved, but these were always followed by
successive time-outs while the model refinement step.
Therefore, to compare CSight and CkTail, we took
back two trace sets available with the CSight imple-
mentation. The first one was extracted from TCP logs,
and the second one from logs of the AlternatingBit
protocol. As we have trace sets instead of logs with
these two use cases, we only compare CSight with the
third step of CkTail.
ENASE 2020 - 15th International Conference on Evaluation of Novel Approaches to Software Engineering
34
Figure 2: Percentage of valid traces accepted by the models
generated with CkTail, Assess, Lfkbehavior.
4.2 RQ1: Can CkTail Derive Models
that Capture Correct Behaviours of
SUL?
To measure model correctness, we used a cross vali-
dation process, which partitions the event logs of ev-
ery configuration Conf1 to 7 in one training set for
the model generation and one testing set. We limited
the process to one round, as segmenting an event log
cannot arbitrarily be done: we choose to separate an
event log into two parts with an approximative ratio
of 70% and 30%, taking care not to separate actions
that belong to the same session thanks to our knowl-
edge of the configurations. To avoid any bias, model
correctness is evaluated by generating valid trace sets
from the testing set by considering 3 ways used by the
3 tools. A first trace set is generated by calling Algo-
rithm 1. A second trace set is obtained by splitting
the event log every time a long time delay is detected
among the messages (as in Assess). The last trace set
is achieved by extracting the messages of the event
log that share the same component identification (as
in LFkbehavior). We obtained around 65 to 285 valid
traces for Conf 1 to 7.
Results: Figure 2 shows the percentage of valid
traces accepted for each configuration and tool. The
bar-diagram firstly shows that the two configurations
TCP and AlternatingBit are not sufficient to effi-
ciently compare CkTail and CSight as the models
given by both tools accept all the valid traces. But
these have few states and the valid trace sets are small.
As stated earlier, we were unable to apply CSight on
larger trace sets. With the other configurations, the
models that accept the most of valid traces are always
those generated by CkTail. These models accept an
average of 91.90% of valid traces. The models built
by Assess tend to have close results as they accept
83.61% of the valid traces with the Loosely-coupling
strategy and 86.42% with the Decoupling strategy.
Figure 3: Percentage of invalid traces accepted for each
method and configuration.
The models given by LFkbehavior accept an average
of 37.48% of valid traces in our experiments.
4.3 RQ2: Can CkTail Build Models that
Accept Abnormal Behaviours?
We evaluated how the models generated previously
accept invalid traces. The latter were generated by
applying these mutations on the valid traces used in
RQ1: repetition of actions, inversion of a request
with its associated response, and permutation of one
request in a sequence of nested requests. We built
about 100 invalid traces for the configurations Conf1
to Conf7, and 20 invalid traces for TCP and Alternat-
ingBit. Then, we measured the proportions of invalid
traces accepted by the models inferred from the same
configurations and training sets used for RQ1.
Results: Figure 3 shows the proportions of invalid
traces accepted by the models given by each tool in
each configuration. The bar-diagram reveals that only
the models produced by Assess accept invalid be-
haviours. The Decoupling strategy of Assess gives
the highest percentages (up to 80%). The main reason
behind these results is that Assess is not designed to
learn models from message exchanged among com-
ponents. After studying the models, we indeed ob-
served that many request/response pairs are separated
into different models; as the Assess strategies allow
repetitive component calls, the models accept succes-
sive requests or responses. It results that these models
accept many invalid traces.
The results given with RQ1 and RQ2 tend to show
that the models produced by CkTail offer the best pre-
cision: not only they accept the highest ratio of valid
traces, but also reject all the invalid ones (as CSight
and LFkbehavior).
CkTail: Model Learning of Communicating Systems
35
4.4 RQ3: Is CkTail Able to Detect
Accurate Dependencies among
Components?
As CkTail is the only model learning tool able to infer
dependency graphs, we chose to evaluate this feature
by comparing the DAGs returned by CkTail to the de-
pendency graphs we manually built with our knowl-
edge of the systems under learning. For this study, we
took back the configurations for which we have event
logs. We listed the component dependency relations
found by CkTail and studied them to also provide the
incorrect dependencies.
Table 3: Recall and Precision of CkTail to detect compo-
nent dependencies; Recall is the percentage of the real de-
pendencies that are detected; Precision is the percentage of
detected dependencies that are correct.
Recall
of the dependencies
found by CkTail
Precision
of the dependencies
found by CkTail
Con f 1 100% 100%
Con f 2 61.5% 100%
Con f 3 83% 100%
Con f 4 73% 100%
Con f 5 64% 100%
Con f 6 80% 100%
Con f 7 100% 100%
Results: Table 3 shows the recall and precision of
the component dependencies detected by CkTail. On
average, CkTail finds 80.27% of the real dependencies
and all of them are correct. The missing dependencies
(recall below 100%) stem from Definition 3, which
strictly specifies what a dependency is in our context,
but also restricts the possible dependency patterns to
avoid ambiguity. We indeed observed in these exper-
iments that CkTail was able to find several chains of
messages sharing the same data addressed to the same
component c at the same time, meaning there are sev-
eral components that potentially have a dependency
with c. But, as we are unable to state which compo-
nent relationship is right, our algorithm leaved them
aside.
5 RELATED WORK
We observed in the literature that few approaches
were proposed to learn models from communicating
systems. Groz et al. introduced in (Groz et al., 2008)
an algorithm to generate a controllable approximation
of components through active testing. This kind of ac-
tive technique implies that the system is testable and
can be queried. The learning of the components is
done in isolation. A recent work lifts this constraint
by testing a system with unknown components by
means of a SAT solving method (Petrenko and Avel-
laneda, 2019). In contrast, our approach is passive,
and only learns models from logs. Requirements are
hence quite different.
Mariani et al. proposed in (Mariani and Pastore,
2008) an automatic detection of failures in log files by
means of model learning. Their approach segments
an event log with two strategies: per component or
per user. The former generates one model for each
component. Our evaluation shows that the trace seg-
mentation algorithm of CkTail, which relies on more
properties, improves the model precision.
CSight (Beschastnikh et al., 2014) is another tool
specialised in the model learning of communicat-
ing systems, where components exchange messages
through synchronous channels. It is assumed that
both the channels and components are known. Be-
sides, CSight requires specific trace sets, which are
segmented with one subset by component. CSight
follows five stages: 1) log parsing and mining of in-
variants 2) generation of a concrete FSM that captures
the functioning of the whole system by recomposing
the traces of the components; 3) generation of a more
concise abstract FSM; 4) model refinement with in-
variants that must hold in models, and 5) generation
of Communicating FSM (CFSM). With CkTail, we
do not assume that the trace sets are already prepared.
We are given an event log from which CkTail tries
to detect sessions. Regarding the model generation,
CSight allows the use of internal actions (not used
in the communications). The current requirements
of CkTail prevent from considering internal actions
to build models, but we believe that this feature could
be added to CkTail in a future work. In theory, CSight
should yield more precise models than those given by
CkTail because CFMS are refined by checking the sat-
isfiability of invariant. But, in practice, we observed
that invariant mining and satisfiability checking are
both costly and prevent CSight from taking as input
medium to large trace sets.
Compared to the previous approaches, CkTail also
has the capability of detecting component dependen-
cies and expresses them with DAGs.
Prior to this paper, we proposed in (Salva and
Blot, 2019) the approach and tool called Assess. Its
assumptions are different from those required with
CkTail or CSight. The main difference lies in the fact
that the communications among components are as-
sumed hidden (not available in event logs). Therefore,
Assess tries to detect implicit calls of components and
adds new synchronisation actions in the models to ex-
ENASE 2020 - 15th International Conference on Evaluation of Novel Approaches to Software Engineering
36
press them. Its algorithm is hence specific to this
assumption. We compared CkTail with Assess and,
as expected, we showed that Assess builds imprecise
models when event logs include communications.
6 CONCLUSION
This paper has proposed CkTail, an approach that
learns models of communicating systems from event
logs. Our algorithm improves the model precision
by integrating the identification of dependency rela-
tions among components and by better detecting ses-
sions in event logs to extract traces. Unlike CSight,
which targets the same kind of systems, CkTail re-
quires as inputs one event log only. Then, it builds ex-
ecution traces while trying to recognise complete ses-
sions with respect to 4 constraints, whereas the other
approaches rely on one or two rules for the trace seg-
mentation. The constraints used by CkTail are specif-
ically related to communicating systems and restrict
the trace generation w.r.t. the association of request-
s/responses, time delay, data dependency, component
identification. Besides, CkTail infers DAGs show-
ing the component dependencies. They offer another
viewpoint of the component interactions and system
architecture, and they may be used to different pur-
poses, e.g., testability measurement, or security anal-
ysis.
As future work, we firstly plan to evaluate CkTail
on further kinds of systems, e.g., Web service com-
positions. The trace analysis step relies upon some
assumptions for finding sessions in event logs when
these are not identified by means of a session mech-
anism. But, if sessions are clearly identified in mes-
sages, these assumptions can be relaxed and the algo-
rithm reduced. We will investigate this possibility in a
future work to redesign the first step of CkTail so that
it also supports session identification.
ACKNOWLEDGEMENT
Research supported by the French Project
VASOC (Auvergne-Rh
ˆ
one-Alpes Region):
https://vasoc.limos.fr/
REFERENCES
Ammons, G., Bod
´
ık, R., and Larus, J. R. (2002). Mining
specifications. SIGPLAN Not., 37(1):4–16.
Beschastnikh, I., Brun, Y., Ernst, M. D., and Krishna-
murthy, A. (2014). Inferring models of concurrent
systems from logs of their behavior with csight. In
Proceedings of the 36th International Conference on
Software Engineering, ICSE 2014, pages 468–479,
New York, NY, USA. ACM.
Biermann, A. and Feldman, J. (1972). On the synthesis of
finite-state machines from samples of their behavior.
Computers, IEEE Transactions on, C-21(6):592–597.
Fu, Q., Lou, J.-G., Wang, Y., and Li, J. (2009). Execu-
tion anomaly detection in distributed systems through
unstructured log analysis. 2009 Ninth IEEE Interna-
tional Conference on Data Mining, pages 149–158.
Groz, R., Li, K., Petrenko, A., and Shahbaz, M. (2008).
Modular system verification by inference, testing and
reachability analysis. In Suzuki, K., Higashino, T.,
Ulrich, A., and Hasegawa, T., editors, Testing of Soft-
ware and Communicating Systems, pages 216–233,
Berlin, Heidelberg. Springer Berlin Heidelberg.
Krka, I., Brun, Y., Popescu, D., Garcia, J., and Medvi-
dovic, N. (2010). Using dynamic execution traces and
program invariants to enhance behavioral model infer-
ence. In Proceedings of the 32Nd ACM/IEEE Interna-
tional Conference on Software Engineering - Volume
2, ICSE ’10, pages 179–182, New York, NY, USA.
ACM.
Lo, D., Mariani, L., and Santoro, M. (2012). Learning ex-
tended fsa from software: An empirical assessment.
Journal of Systems and Software, 85(9):2063 2076.
Selected papers from the 2011 Joint Working IEEE/I-
FIP Conference on Software Architecture (WICSA
2011).
Lorenzoli, D., Mariani, L., and Pezz
`
e, M. (2008). Auto-
matic generation of software behavioral models. In
Proceedings of the 30th International Conference on
Software Engineering, ICSE’08, pages 501–510, New
York, NY, USA. ACM.
Makanju, A., Zincir-Heywood, A. N., and Milios, E. E.
(2012). A lightweight algorithm for message type ex-
traction in system application logs. IEEE Transactions
on Knowledge and Data Engineering, 24(11):1921–
1936.
Mariani, L. and Pastore, F. (2008). Automated identification
of failure causes in system logs. In Software Reliabil-
ity Engineering, 2008. ISSRE 2008. 19th International
Symposium on, pages 117–126.
Mariani, L., Pastore, F., and Pezze, M. (2011). Dy-
namic analysis for diagnosing integration faults. IEEE
Transactions on Software Engineering, 37(4):486–
508.
Messaoudi, S., Panichella, A., Bianculli, D., Briand, L., and
Sasnauskas, R. (2018). A search-based approach for
accurate identification of log message formats. In Pro-
ceedings of the 26th Conference on Program Compre-
hension, ICPC ’18, pages 167–177, New York, NY,
USA. ACM.
Ohmann, T., Herzberg, M., Fiss, S., Halbert, A., Palyart,
M., Beschastnikh, I., and Brun, Y. (2014). Behavioral
resource-aware model inference. In Proceedings of
the 29th ACM/IEEE International Conference on Au-
tomated Software Engineering, ASE ’14, pages 19–
30, New York, NY, USA. ACM.
Pastore, F., Micucci, D., and Mariani, L. (2017). Timed k-
tail: Automatic inference of timed automata. In 2017
CkTail: Model Learning of Communicating Systems
37
IEEE International Conference on Software Testing,
Verification and Validation (ICST), pages 401–411.
Petrenko, A. and Avellaneda, F. (2019). Learning commu-
nicating state machines. In Tests and Proofs - 13th
International Conference, TAP 2019, Held as Part of
the Third World Congress on Formal Methods 2019,
Porto, Portugal, October 9-11, 2019, Proceedings,
pages 112–128.
Salva, S. and Blot, E. (2019). Reverse engineering be-
havioural models of iot devices. In 31st International
Conference on Software Engineering & Knowledge
Engineering (SEKE), Lisbon, Portugal.
Salva, S. and Durand, W. (2015). Autofunk, a fast and
scalable framework for building formal models from
production systems. In Proceedings of the 9th ACM
International Conference on Distributed Event-Based
Systems, DEBS ’15, pages 193–204, New York, NY,
USA. ACM.
Vaarandi, R. and Pihelgas, M. (2015). Logcluster - a data
clustering and pattern mining algorithm for event logs.
In 2015 11th International Conference on Network
and Service Management (CNSM), pages 1–7.
Zhu, J., He, S., Liu, J., He, P., Xie, Q., Zheng, Z., and Lyu,
M. R. (2018). Tools and benchmarks for automated
log parsing. CoRR, abs/1811.03509.
ENASE 2020 - 15th International Conference on Evaluation of Novel Approaches to Software Engineering
38