Behavior Modeling of a Distributed Application for Anomaly Detection

Amanda Viescinski

1 a

, Tiago Heinrich

1 b

, Newton C. Will

2 c

and Carlos Maziero

1 d

Federal University of Paran

a, Curitiba, Brazil

Federal University of Technology - Paran

a, Dois Vizinhos, Brazil

Keywords:

Intrusion Detection, Distributed Computing, Security.

Abstract:

Computational clouds offer services in different formats, aiming to adapt to the needs of each client. This

scenario of distributed systems is responsible for the communication, management of services and tools

through the exchange of messages. Thus, security in such environments is an important factor. However,

the implementation of secure systems to protect information has been a difﬁcult goal to achieve. In addition

to the prevention mechanisms, a common approach to achieve security is intrusion detection, which can be

carried out by anomaly detection. This technique does not require prior knowledge of attack patterns, since

the normal behavior of the monitored environment is used as a basis for detection. This work proposes a

behavioral modeling technique for distributed applications using the traces of operations of its nodes, allowing

the development of a strategy to identify anomalies. The chosen strategy consists of modeling the normal

behavior of the system, which is arranged in sets of

-grams of events. Our goal is to build functional and

effective models, which make it possible to detect anomalies in the system, with reduced rates of false positives.

The results obtained through the evaluation of the models highlight the feasibility of using

-grams to represent

correct activities of a system, with favorable results in the false positive rate and also in terms of accuracy.

1 INTRODUCTION

Distributed architectures are responsible for perform-

ing communications among services, processes, and

applications, in order to carry out different operations

that can be shared with different users and services.

This type of communication requires data sharing be-

tween operations in distinct networks, environments,

or even infrastructures. Considering that, data will

travel through several networks, some of which may be

unknown, raising concerns about security, and impor-

tant factors to guarantee the conﬁdentiality, integrity,

and availability of the these data.

One of the strategies used to achieve these guaran-

tees is intrusion detection, which aims to monitor the

system components, such as the network (Mishra et al.,

2017; Borkar et al., 2017; Jose et al., 2018; Khraisat

et al., 2019) and local storage and processes (Lano

et al., 2019), allowing to identify abnormal operations

and incidents. Often, this monitoring is carried out by

an Intrusion Detection System (IDS), which can con-

https://orcid.org/0000-0002-9125-1598

https://orcid.org/0000-0002-8017-1293

https://orcid.org/0000-0003-2976-4533

https://orcid.org/0000-0003-2592-3664

sist of software and/or hardware deployed at different

points of the system (Benmoussa et al., 2015).

The main goal of an IDS is to distinguish between

benign and malicious behavior. Thus, these systems

are based on pattern recognition. The recognition ap-

proaches most adopted by IDS are: signature detection,

which is based on the identiﬁcation of standard threat

strategies; and anomaly-based detection, in which the

normal behavior of the monitored system are used as

a basis for detection.

In the anomaly-based detection approach, models

are constructed from the observation of legitimate or

expected behaviors of the system components (for ex-

ample, users, network connections, or applications).

In this way, a behavioral model is used as a reference

during the monitoring of the system, in order to enable

the identiﬁcation of unknown patterns (Callegari et al.,

2016).

This identiﬁcation procedure will rely on the use

of a global clock, to verify the sequence of events in a

system. In a distributed system, the ordering of events

ends up being a complex task, as a result, approaches

end up applying solutions that do not require the syn-

chronization of events since the identiﬁcation process

performed by an IDS is aimed at only one host. Which

Viescinski, A., Heinrich, T., Will, N. and Maziero, C.

Behavior Modeling of a Distributed Application for Anomaly Detection.

DOI: 10.5220/0011267200003283

In Proceedings of the 19th International Conference on Security and Cryptography (SECRYPT 2022), pages 333-340

ISBN: 978-989-758-590-6; ISSN: 2184-7711

333

will only notify a higher entity if any anomaly is found

(Totel et al., 2016).

A frequent issue in the behavioral modeling of

systems is related to the volume of data to be manipu-

lated. When monitoring even small software systems,

an enormous amount of data is produced (Lorenzoli

et al., 2006). Therefore, dealing with the collected

data in such a way as to generate enough information

to identify the normal behaviors of a system may be

challenging. In this sense, one of the main goal is to

properly extract, store, and interpret meaningful data.

The contributions usually found in the literature

focus on the construction of representative models of a

distributed system using automata. This technique con-

sists of modeling known behaviors by states and tran-

sitions representing possible events and state changes

in the system. Aiming at the identiﬁcation of anoma-

lies and intrusion detection, this paper proposes a new

technique for the construction of behavioral models of

distributed applications through the operation traces of

their nodes. The partial models obtained are arranged

in sets of

-grams of events and combined to obtain

more generic, yet representative models.

Thus, the contributions of the present work are:

•

The representation of normal behaviors of a dis-

tributed system through n-grams of events;

•

The construction of models that effectively repre-

sent normal behaviors;

•

A strategy to assess the ability to detect anomalous

behavior using the models generated;

•

A validation of the model using real-world data

(extracted from a distributed application); and

•

A detailed assessment of the feasibility of the pro-

posed model.

The reminder of this paper is structured as follows:

Section 2 presents works that have been carried out in

the context of this research; Section 3 describes our

proposal; Section 4 presents the experimental evalua-

tion of the proposed solution; and Section 5 concludes

the paper and discusses future work.

2 RELATED WORK

Intrusion detection in distributed applications through

the identiﬁcation of abnormal behaviors is widely stud-

ied in the literature, using mainly automata. In this

context, the identiﬁcation of anomalous subsequences

in sequences of application events based on the Com-

mon Object Request Broker Architecture (CORBA) ar-

chitecture is explored by Stillerman et al. (1999). Text

messages in log ﬁles are used by Fu et al. (2009) in

the construction of automata that represent the normal

ﬂow of each component of the system and a Gaus-

sian model that checks the execution time between

state transitions, allowing to detect anomalies related

to reduced bandwidth and also delay between state

transitions.

Automata and temporal properties are combined by

Totel et al. (2016) to model the possible states that the

system can assume and the sequence in which these

states occur, creating a lattice that identiﬁes new se-

quences that were not perceived with the ordering of

events. These sequences are then joined and general-

ized and as the number of traces to create the automata

increases, more states are learned, but fewer temporal

properties are met. The work is extended by Lano

et al. (2019), seeking to increase efﬁciency in the con-

struction of the automaton, through the use of the most

signiﬁcant sequences in the lattice, to the detriment

of the whole, and the generation of several interme-

diate models that are generalized several times until

obtaining a ﬁnal and stable model.

Automata are also combined with n-grams for the

detection of anomalies in Web systems. An

-gram is

a contiguous sequence of

items from a given sample,

typically text, and are used in various areas in com-

puter science, mainly in natural language processing,

sequence analysis, and data compression (Broder et al.,

1997). In the approach presented by Jiang et al. (2006),

-grams represent the contiguous subsequences of re-

quests between the components of a trace and the au-

tomata are used to connect the

-grams to characterize

those traces. A technique based on the payload of

network packets capable of detecting content-based

attacks, carried out in application protocols such as

Hypertext Transfer Protocol (HTTP) and File Transfer

Protocol (FTP), is presented in Angiulli et al. (2015),

with the

-grams generated from sliding windows and

applied to learn frequent byte sequences from each

block. The authors conclude that a low value of

makes it difﬁcult to recognize attacks while a high

value of

makes it easier to identify an anomaly but

also produces more false positives. Other proposals

that present the use of

-grams to detect anomalies in

Web systems are described by Zolotukhin et al. (2012)

and Vartouni et al. (2018).

In addition, Wressnegger et al. (2013) seek to iden-

tify environments where anomaly detection is more

favorable than classifying

-grams as benign and mali-

cious, or vice versa. The study discusses types of data

that can be represented in

-grams, such as text and

binary network protocols, as well as traces of system

calls, evaluating the criteria cited for each type of data.

Anomaly detection techniques in distributed appli-

cations are strongly focused on automata and temporal

SECRYPT 2022 - 19th International Conference on Security and Cryptography

334

properties, with the use of

-grams for this purpose be-

ing explored only in Web applications and application

protocols. These works present satisfactory results and

demonstrate the feasibility of using

-grams in mod-

eling systems for anomaly-based intrusion detection.

However, they do not show whether their solutions are

applicable in distributed systems and do not consider

the ordering between the information that is present in

the dataset. Consequently, the relationship between the

sequences of states of the system is also ignored and

the distributive characteristics of the system are not

considered by these works. Thus, the purpose of the

present work is to explore this gap in the current litera-

ture, applying

-grams in the construction of models

for distributed applications, using system event logs

as a basis for modeling, which is not covered by the

presented related work.

3 ANOMALY DETECTION

BASED ON n-GRAMS

Intrusion Detection Systems (IDS) have been widely

used to automatize the intrusion detection procedure.

Among the detection techniques used, the anomaly-

based approach is more effective when identifying new

attacks, compared to the signature-based one (Liao

et al., 2013). Anomaly-based methodology can be

split into two phases: (1) the learning phase, in which

the model that deﬁnes the normal behavior for the ap-

plication is built, using data from the real environment

or synthetically generated data; and (2) the detection

phase, where the current behavior of the system is

compared to the previously built model, searching for

anomalous behavior. Thus, any deviation from the

normal behavior model is considered an attack.

The modeling technique proposed in this work con-

sists in representing the system behaviors by n-grams

of events. These are extracted from a dataset of correct

event sequences collected from the system under study.

The general objective of this work is to achieve the

construction of models of normal behaviors of real dis-

tributed applications that allow the creation of effective

anomaly-based intrusion detection systems.

We consider a distributed application as a set of

processes that are coordinated through the message

exchange over a network. In each process, sending

or receiving a message is considered an event. In

addition, each process records its events at the node

that executes it, respecting the local ordering.

Another relevant premise is that the message

exchange between system processes is carried out

through reliable communication channels. This can

be made possible, for example, through the use of the

Transmission Control Protocol (TCP) (Beschastnikh

et al., 2014), and spurious messages are expected to

occur.

We do not assume the presence of a global physical

clock. Our time assumption is based on a logical clock,

which, through the happen-before relation (Lamport,

1978), allows us to order events in a global order of

causality of events. Thus, the

-grams used in our

model are built using the causal information extracted

from the nodes’ logs. Such

-grams capture the causal

relations among the events, which are more suited to

represent the behavior of a distributed system than

simpler event sequences observed in each node. The

use of a global logical clock allows the evaluation of a

distributed structure effortlessly, since there is no need

to synchronize the entire architecture by a physical

global clock. Finally, our strategy differs from related

work in the ﬁeld that focus only on individual logs and

not global ones.

3.1 Modeling

The modeling technique proposed in this work as-

sumes that there is a dataset with the records of events

that occurred in a given distributed application. This

dataset will be composed of a set of executions which

different duration times. The main motivation is to cap-

ture the greatest possible diversity of heterogeneous

behaviors. The more representative the dataset is, the

better to understand how a model is adapting and how

well it behaved in the environment. The set of all

registered executions in the dataset is deﬁned as E.

An execution

represents possible behaviors that

may occur in the system. Thus, an execution

∈ E

is deﬁned as the set of traces

of the processes that

are part of it:

= {T

, T

, . . .}

. In each execution

of a distributed application, multiple processes are in-

volved (

, . . . , p

). The number of processes involved

may vary depending on the execution. Each trace

corresponds to a process, regardless of whether they

are on the same host or not. Therefore,

deﬁnes the

maximum number of processes involved in executing

In this paper, a trace is the event log ﬁle of each

process. As such, a trace

i j

is deﬁned by the sequence

of events that occurred in the process

during execu-

tion

in chronological order:

i j

= [e

i j1

, e

i j2

, . . . ]

where

i jk

is the

-th event that occurred in the pro-

cess

during execution

. In simpliﬁed form, we

can write

T = [e

, e

, . . .]

, subtending process

and

execution

when they are not needed. Consequently,

corresponds to a single distributed execution of the

system, which contains j traces.

The construction of a model depends on ensuring

Behavior Modeling of a Distributed Application for Anomaly Detection

335

that only correct behaviors of the system are used,

that is, traces without attacks or any type of anomaly.

Therefore, the set of correct traces is deﬁned by

being a subset of

(

C ⊆ E

). The correct trace col-

lected individually by the process must be composed

of events that characterize ﬁnite executions of the ap-

plication to be modeled. This implies that each ex-

ecution

must have a proper termination (without

deadlocks). Stored events must not be encrypted and

must contain relevant information, for example, source

and destination nodes.

3.2 n-Grams Deﬁnition

-gram is deﬁned as a sequence of consecutive

causally-related events of length

, which can be de-

scribed as

seq = [e

. . . e

]

. The set of all the dis-

tinct

-grams of size

observed in a given execution

is deﬁned by

G(E, n)

. It is obtained using a three-step

algorithm applied to the traces present in E:

The traces

are analyzed to ﬁnd the send-receive

event pairs, i.e. the events corresponding to send-

ing a message

by a node and receiving

by an-

other node; messages that does not have a sending

event or a receiving event are considered spurious

and ignored;

The logical timestamps of all events

in all traces

T ∈ E

are calculated, using the Lamport’s algo-

rithm (Lamport, 1978), and events on the same

process are considered causally linked; this is cru-

cial to deﬁne inter-node causal relations;

For each event

in each trace

, all the possible

sequences of

consecutive causally-related events

starting at

are found, possibly involving more

than one node. Each of such sequences constitutes

an n-gram.

Fig. 1 shows an example of an execution with

two clients and one server, containing all the events

(identiﬁed by letters) and two sequences with

-grams

that can be captured in this execution. From event a, it

is possible to construct four sets with

-grams: abcd,

aefb, abci and aefg. Similarly, taking event m as the

beginning, it is possible to build the

-gram sets mghn,

mnok e mnop, besides the set mghi already highlighted

in the ﬁgure. In the execution of Fig. 1, there are in

all 27 distinct

-grams of with

n = 4

. In addition, we

can form several other event sequences using different

values for n.

3.3 Model Deﬁnition

The

-gram calculation technique presented in the Sec-

tion 3.2 allows to extract the behaviors of a single

Physical

Time

m n o

b c d

Figure 1: Example of n-grams from an execution.

execution

. Considering the characteristics of the

dataset, this procedure must be repeated for each ex-

ecution

. Only after obtaining the

-grams of all

executions can global models of the system be built.

The proposed modeling technique includes two

strategies, based on elementary operations of set the-

ory, for building global models. Each strategy aims

to build a

model to represent the normal behavior

of the system. Therefore, a

model is simply a set

-grams of same length that attempts to represent

the correct behavior of the system under study. Only

the correct (

E ∈ C

) executions of the system should

be used to build such a model.

The ﬁrst proposed strategy comprises grouping

all the

-grams of the

executions that compose the

dataset, removing the

-gram replicas to keep only

one instance of each

-gram. In summary, a union

model

M ∪ (n)

deﬁned by the set of the

-grams of

size

present in at least one correct execution will be

constructed: M ∪ (n) = ∪G(E, n) ∀E ∈ C.

In the second strategy, we group them into a sin-

gle model, each unique

-gram that appears in all

correct executions. Thus, is deﬁned a simple inter-

section model

M ∩ (n)

by the set of the

-grams of

size

present in all correct executions:

M ∩ (n) =

∩G(E, n) ∀E ∈ C.

A desired feature of the dataset is that each cor-

rect execution describes a speciﬁc behavior. These

executions are expected to be distinct from each other.

Therefore, the requirement that an

-gram be present

in all of them can be very restrictive, making the model

small (with few

-grams) and consequently very selec-

tive. Theoretically, this can lead to a high false-positive

rate in detection using it as a baseline.

Seeking to reduce the excessive selectivity of the

simple intersection model, we divided the set of correct

executions into subsets. After that, partial models of

them were built using the union logic. Finally, the

intersection logic was applied on top of the built partial

models to get a more comprehensive ﬁnal intersection

model, what we call the wide intersection model.

Formally explaining the construction of the wide

intersection model: ﬁrst, the set of correct executions

is divided into

partitions

1... p

, balanced with each

other (about the number and size of runs of each). For

each partition P

, we calculated a partial model union

SECRYPT 2022 - 19th International Conference on Security and Cryptography

336

from their executions

(n) = ∪G(E, n) ∀E ∈

. The partial models

are combined into a wide

intersection model

, using intersection logic; in

this way, this model will contain the

-grams present

in all partial models: M u (n) = ∩M

(n) ∀i ∈ {1... p}.

A wide intersection model

encompasses the

others, being a hybrid between the union model

M∪

and the simple intersection model

M∩

and can repre-

sent both: if we made a single partition (

p = 1

), we

have

Mu = M∪

; if we made a partition for each cor-

rect execution (

p = |C|

), we have

Mu = M∩

. The

choice of the value of

(

1 ≤ p ≤ |E|

) is a parameter

to be evaluated in future work.

In this paper, each

-gram of the model represents

a small part of the correct behavior of the system. In

this way, the model construction phase is performed

ofﬂine, as well as the detection phase. Hence, ofﬂine

anomaly detection using a

model is performed as

follows: i) Given an execution

to be analyzed, of

it extracts the set of

-grams

G(E, n)

; if all

-grams

G(E, n)

are present in the model

(i.e., if

contained in

), the execution is normal; otherwise, it

is anomalous:

G(E, n) ⊆ M?



yes → correct execution

no → anomalous execution

4 VALIDATION

Considering the validation of the proposed models,

it proposed a validation scenario using data collected

from a real distributed application and, for this purpose,

we choose the dataset built by Lano

e et al. (2019) that

represents a real distributed application with several

components. The dataset represents the normal and

anomalous behavior, which interests us in this work.

Finally, it is important to note that is well-adjusted,

the data representation used to extract the information

needed to build the n-grams.

4.1 Scenario

The scenario presented in the dataset consists of ex-

ecutions of

XtreemFS

(Quobyte Inc, 2020), a fault-

tolerant distributed ﬁle system. Four main components

are needed to provide its services:

•

Object Storage Device (OSD), responsible for stor-

age;

•

Metadata and Replica Catalog (MRC), performs

metadata storage, directory tree management, and

access control;

•

Directory Service (DIR), connects all other com-

ponents and register system services; and

•

Clients that perform system requests and manage

ﬁle states (such as volume creation/mounting and

ﬁle creation).

The dataset produced by Lano

e et al. (2019) con-

sists of 126 executions; each execution contains traces

for all nodes in the system. The data gathering lasts

between one minute and ﬁve hours, resulting in traces

of an average size of 39 KB. The number of nodes in

the system also varies. In some of them, there may

be up to two active clients (making requests for the

application) and in others, none.

All nodes in the system were conﬁgured to record

their individual traces, containing information about

events of sending or receiving messages. No infor-

mation regarding the local physical time was stored

with the events, only their sequence, thus causality

information should be later rebuilt. For this, the local

events at each node are stored sequentially, following

their occurrence. During the traces analysis, the order

of local events and the relationship between matching

sending and receiving events is used to rebuild causal-

ity dependencies among them and to deﬁne a partial

order on the events (Totel et al., 2016).

Two types of behavior are represented, (i) situa-

tions of normal behavior of the system, in which in-

formation was collected during a period of execution

with expected behaviors, and (ii) situations of abnor-

mal behavior of the system, in which information was

collected during a period when the system was under

attack, with 16.3% of the collected data represents

abnormal behavior. In order to collect traces of exe-

cutions with anomalous behaviors of the system, an

insecure version of

XtreemFS

was implemented by

Lano

e et al. (2019), so that four known attacks against

its integrity could be carried out:

•

NewFile attack, which adds metadata of a ﬁle on

the MRC server but does not insert its contents on

the OSD server;

•

DeleteFile, which deletes the ﬁle’s metadata on

the MRC server but keeps its content on the OSD

server;

•

Chmod, which changes the ﬁle access policy on

the MRC server, even without authorization; and

•

Chown attack, which modiﬁes the owner of the ﬁle

in the MRC metadata.

Each attack was carried out in four different con-

texts: (c1) no active clients; (c2) with customers active

in the environment, but before carrying out operations;

(c3) after the operations are carried out and; (c4) orig-

inated from an address that does not belong to the

customer. The traces of these executions allow the

evaluation of the detection efﬁciency of each model

Behavior Modeling of a Distributed Application for Anomaly Detection

337

previously built based on the normal behavior of the

system.

4.2 Results

For each correct execution,

present in the dataset

was calculated grams with different values of

, from

. Grams smaller than

cannot describe a correct

behavior adequately, and those larger than

present

a high computational cost during the model building

and anomaly detection phase.

To evaluate the effectiveness of the built models,

it is necessary to analyze their classiﬁcation capacity

in the face of correct executions that were not used

during their construction. For this, after obtaining the

grams of all executions,

was separated into training

and validation data. So we have

segmented into

validation partitions

, p = 1 . . . p

individual. We

distributed the executions in each

to balance the

size and number of traces. In summary,

and

are partitions

, i.e.,

|PV

| = C

C ∪ A = E

C ∩ A =

Seeking to increase the efﬁciency of the model cre-

ation algorithm, was created a single ﬁle containing all

the grams found in the traces that make up each parti-

tion, i.e., we built partial union models for validation

from the E executions of each partition.

Next, was generated the validation models

-fold cross-validation scheme (Arlot and Celisse,

2010). For the evaluations presented in this paper,

we consider

k = 5

. In this scheme, we separated a

partition of executions for validation and combined

k − 1

partitions to produce the models, resulting in

validation models (

1...k

). We combined the parti-

tions following the logic of union and intersection, as

described in Section 3.3.

To evaluate

MV ∪

and

MV u

, each

was tested

using all correct traces

that were not adopted to

compose the model, separated during the model con-

struction. The result returned by the test of each

is the percentage of

-grams found in the models in

relation to the total of

-grams present in the analyzed

trace.

As the relationship

⊆

only deﬁnes whether a trace

belongs to the model or not, that is, a binary decision

(correct or suspicious), it becomes necessary to adopt

a less rigid criterion. For this reason, an acceptance

rate was adopted, seeking to make the classiﬁcation of

normal and anomalous behaviors more ﬂexible. Thus,

it was considered as a result returned by the test of

each model

, the percentage of

-grams present in

the analyzed trace that are also found in the models.

A set X can be divided into disjoint subsets called par-

titions P

such that ∀i ∪ P

= X and ∀(i, j) P

∩ P

This parameter will reﬂect the limit to what the system

will deﬁne as an anomaly, mirroring an identiﬁcation

behavior that would occur in an IDS.

In order to understand which acceptance rate is

most appropriate for each type of model, the accuracy

was evaluated for each model according to the varia-

tion in its acceptance rate. For the union validation

models

MV ∪

, an adequate acceptance rate of 99.4%

was obtained, with

n = 3

and

n = 6

. Thus, if a trace

has 99.4% of its known grams, it is considered normal

according to MV ∪.

The same evaluation was performed for the wide

intersection validation models

MV u

. In this scenario,

the accuracy was lower when compared to

MV ∪

with 60% being the maximum amount reached, for

n = 9

and

n = 12

. For an acceptance rate of 100%

(

t ⊆ MV u

), in almost all cases the accuracy achieved

was 46%. This was due to the models classifying any

as suspect, that is, anomalous traces were classiﬁed

correctly, however, all the correct traces were erro-

neously classiﬁed also as suspicious. Otherwise, for

≤ 55%

acceptance rate, the models classify any

normal. Thus, the acceptance rate adopted for these

models was 75%.

With the appropriate acceptance rate for both mod-

els built, the next assessment refers to the false positive

rates (

). The results presented are an arithmetic

average of the

validation models built, a practice

commonly adopted for the

-fold cross-validation. Fig.

2 shows the

achieved by the models

accord-

ing to the values of

. As expected,

MV u

had the

highest rate compared to

MV ∪

, reaching a 54% dif-

ference at

n = 15

. This shows that collecting only the

most popular

-grams generates very restricted models

that are unable to distinguish correct behavior from an

anomaly. Another point that can be observed in Figure

2 is that the higher the value of

, the more speciﬁc

the

-grams become. Consequently, the amount of

behavior represented is reduced, making the models

more prone to errors.

Seeking to analyze how satisfactory the results re-

turned by the built models are, the true positive rate

T P

was veriﬁed, shown in Fig. 3. A ﬁrst point to note

is that small

(

≤ 4

) values are not able to adequately

represent a behavior, classifying any

as a normal ac-

tivity. Still analyzing the size variation of the

-grams,

MV u

evolves its detection capacity signiﬁcantly as

the values attributed to

increase. This behavior in-

dicates that

MV u

allows the identiﬁcation of attacks

more easily, but, on the other hand, the difﬁculty of

correctly identifying normal traces increases.

In this scenario,

MV ∪

proved to be stable regard-

less of the variation in

, remaining at

T P

≈ 58%

However, there is a difﬁculty in detecting

(no active

SECRYPT 2022 - 19th International Conference on Security and Cryptography

338

Figure 2: Rate of false positives for

different amounts of

MV ∪

and

MV u.

Figure 3: True positive rate for differ-

ent amounts of

MV ∪

and

MV u

Figure 4: Accuracy for different

amounts of

MV ∪

and

MV u

clients), where in none of the attacks the correspond-

ing traces were classiﬁed as suspect. This highlights

two points: (i) it is presumable that the traces used for

the construction of the models do not contain a suf-

ﬁcient number of normal behaviors within that same

context (however, without the attack) to adequately

represent the system; (ii) the models are able to de-

tect only anomalies in scenarios where there are active

customers, abnormal traces that precede correct cus-

tomer activities are classiﬁed as normal behaviors. On

the other hand, the detection capacity was satisfactory,

since with a relatively small size of

the results are

similar to larger grams and higher processing costs can

be avoided.

Finally, it is necessary to ﬁnd out how accurate the

models are, i.e., what is the relevance of the classi-

ﬁcations that use them as a basis. Fig. 4 shows the

precision (

) obtained by each model as a function of

. Analyzing

MV ∪

, it is possible to state that the in-

crease in the size of the

-grams is not a good strategy,

as the amount of incorrect results tends to increase. As

for

MV u

, it is not possible to make the same state-

ment, since its behavior remained stable in most of the

tested cases.

Considering all the analysis performed, the most

suitable size for the

-grams in union-based models

n = 6

, since: it presented the best accuracy (

75%

)

compared to the other sizes; the

is reasonable

(

), in addition to being among the lowest presented

by the model; its detection capacity (

T P

) is similar to

that of n-grams larger; and the dispersion of results is

minimal (

86%

), that is, it returns more relevant than

irrelevant results.

Thus, it can be veriﬁed that the union strategy gen-

erates models with good capacity to differentiate a

correct behavior from an anomaly. On the other hand,

the intersection strategy generates more restricted mod-

els, tending to classify most behaviors as anomalous.

Regarding the sizes of the

-grams, it is noted that

larger sizes of

are more effective in detecting attacks,

due to representing more speciﬁc behaviors.

5 CONCLUSION AND FUTURE

WORK

This paper proposed a technique to model the normal

behavior of a distributed application using

-grams

of events. The procedures for obtaining the

-grams

through the execution traces of a real distributed appli-

cation and the construction of normality models of the

system, through set operations, were described. The

proposed technique does not depend on a global clock

to understand the relationship between the events that

occurred in the system, being one of the contributions

of this work.

Two strategies for constructing normality models

(union and wide intersection) were evaluated. The

models were evaluated in relation to their ability to

represent the normal behavior of the system and de-

tect anomalies. For this evaluation, we used a dataset

containing real executions of a distributed application

and a set of real attacks executed in different contexts.

The results obtained show that the union-based models

offer a good capacity to model normal behavior, show-

ing promising results in terms of accuracy and true

positive rates. The intersection-based models proved

to be more sensitive to detect anomalies, but they do

not model the normal behavior as well as the union-

based. In addition, we observed that

-grams with

greater value to

detect attacks more easily, but are

more prone to false positives. Within the context of

the research, the analysis around the relationship of

total pertinence between the sets proved to be an im-

practicable approach, since the models are unable to

represent all possible behaviors that can occur within

a system.

The initial results are promising, but some ques-

tions remain open and will be explored in future re-

search. There is a visible need to increase the number

of normal executions and attacks based on traces, in or-

der to have a broader and more balanced representation

of possible behaviors. Thus, we planned to increase

dataset size for both normal executions and attacks,

Behavior Modeling of a Distributed Application for Anomaly Detection

339

seeking to obtain a broader and more balanced repre-

sentation of the possible behaviors of the distributed

application.

Finally, the construction of intersection-based mod-

els presented a signiﬁcant constraint, the ﬂexibility

during the combination of these subsets is an oppor-

tunity to circumvent this limitation and to build less

restricted models. Another approach is to introduce

the notion of generalization in the representation of

the

-grams, in order to obtain a more comprehensive

(and at the same time more compact) representation of

the normal behavior of the system.

ACKNOWLEDGEMENTS

This study was ﬁnanced in part by the Coordena

c¸

ao de

Aperfei

c¸

oamento de Pessoal de N

ıvel Superior - Brazil

(CAPES) - Finance Code 001. The authors also thank

the UFPR and UTFPR Computer Science departments.

REFERENCES

Angiulli, F., Argento, L., and Furfaro, A. (2015). Exploiting

-gram location for intrusion detection. In Intl Conf on

Tools with Artiﬁcial Intelligence, Vietri sul Mare, Italy.

IEEE.

Arlot, S. and Celisse, A. (2010). A survey of cross-validation

procedures for model selection. Statistics Surveys, 4.

Benmoussa, H., Abou El Kalam, A., and Ait Ouahman,

A. (2015). Distributed intrusion detection system

based on anticipation and prediction approach. In Intl

Conf on Security and Cryptography, Colmar, France.

SciTePress.

Beschastnikh, I., Brun, Y., Ernst, M. D., and Krishnamurthy,

A. (2014). Inferring models of concurrent systems

from logs of their behavior with CSight. In Intl Conf

on Software Engineering, Hyderabad, India. ACM.

Borkar, A., Donode, A., and Kumari, A. (2017). A survey on

intrusion detection system (IDS) and internal intrusion

detection and protection system (IIDPS). In Intl Conf

on Inventive Computing and Informatics, Coimbatore,

India. IEEE.

Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig,

G. (1997). Syntactic clustering of the web. Computer

Networks and ISDN Systems, 29(8).

Callegari, C., Pagano, M., Giordano, S., and Berizzi, F.

(2016). A novel histogram-based network anomaly

detection. In Intl Conf on Security and Cryptography,

Lisbon, Portugal. SciTePress.

Fu, Q., Lou, J.-G., Wang, Y., and Li, J. (2009). Execution

anomaly detection in distributed systems through un-

structured log analysis. In Intl Conf on Data Mining,

Miami, FL, USA. IEEE.

Jiang, G., Chen, H., Ungureanu, C., and Yoshihira, K. (2006).

Multiresolution abnormal trace detection using varied-

length

-grams and automata. IEEE Transactions on

Systems, Man, and Cybernetics, 37(1).

Jose, S., Malathi, D., Reddy, B., and Jayaseeli, D. (2018).

A survey on anomaly based host intrusion detection

system. Journal of Physics: Conference Series, 1000.

Khraisat, A., Gondal, I., Vamplew, P., and Kamruzzaman,

J. (2019). Survey of intrusion detection systems: tech-

niques, datasets and challenges. Cybersecurity, 2(1).

Lamport, L. (1978). Time, clocks, and the ordering of events

in a distributed system. Communications of the ACM,

21(7).

Lano

e, D., Hurﬁn, M., Totel, E., and Maziero, C. (2019).

An efﬁcient and scalable intrusion detection system

on logs of distributed applications. In Intl Conf on

ICT Systems Security and Privacy Protection, Lisbon,

Portugal. Springer.

Liao, H.-J., Lin, C.-H. R., Lin, Y.-C., and Tung, K.-Y. (2013).

Intrusion detection system: A comprehensive review.

Journal of Network and Computer Applications, 36(1).

Lorenzoli, D., Mariani, L., and Pezz

e, M. (2006). Inferring

state-based behavior models. In Intl Wksp on Dynamic

Systems Analysis, Shanghai, China. ACM.

Mishra, P., Pilli, E. S., Varadharajan, V., and Tupakula, U.

(2017). Intrusion detection techniques in cloud envi-

ronment: A survey. Journal of Network and Computer

Applications, 77.

Quobyte Inc (2020). XtreemFS - fault-tolerant distributed

ﬁle system. http://www.xtreemfs.org.

Stillerman, M., Marceau, C., and Stillman, M. (1999). Intru-

sion detection for distributed applications. Communi-

cations of the ACM, 42(7).

Totel, E., Hkimi, M., Hurﬁn, M., Leslous, M., and Labiche,

Y. (2016). Inferring a distributed application behav-

ior model for anomaly based intrusion detection. In

European Dependable Computing Conf, Gothenburg,

Sweden. IEEE.

Vartouni, A. M., Kashi, S. S., and Teshnehlab, M. (2018).

An anomaly detection method to detect web attacks

using stacked auto-encoder. In Iranian Joint Congress

on Fuzzy and Intelligent Systems, Kerman, Iran. IEEE.

Wressnegger, C., Schwenk, G., Arp, D., and Rieck, K.

(2013). A close look on

-grams in intrusion detec-

tion: anomaly detection vs. classiﬁcation. In Wksp on

Artiﬁcial Intelligence and Security, Berlin, Germany.

ACM.

Zolotukhin, M., H

ainen, T., and Juvonen, A. (2012).

Online anomaly detection by using n-gram model and

growing hierarchical self-organizing maps. In Intl Wire-

less Communications and Mobile Computing Conf, Li-

massol, Cyprus. IEEE.

SECRYPT 2022 - 19th International Conference on Security and Cryptography

340