Proﬁling and Discriminating of Containerized ML Applications in Digital

Data Marketplaces (DDM)

Lu Zhang

, Reginald Cushing

, Ralph Koning

, Cees de Laat

and Paola Grosso

Multiscale Networked Systems (MNS), University of Amsterdam, Amsterdam, The Netherlands

Complex Cyber Infrastructure (CCI), University of Amsterdam, Amsterdam, The Netherlands

Keywords:

Digital Data Marketplaces (DDM), System Calls, N-gram, Proﬁling, Containers.

Abstract:

A Digital Data Marketplace (DDM) facilitates secure and trustworthy data sharing among multiple parties.

For instance, training a machine learning (ML) model using data from multiple parties normally contributes

to higher prediction accuracy. It is crucial to enforce the data usage policies during the execution stage. In

this paper, we propose a methodology to distinguish programs running inside containers by monitoring system

calls sequence externally. To support container portability and the necessity of retraining ML models, we also

investigate the stability of the proposed methodology in 7 typical containerized ML applications over different

execution platform OSs and training data sets. The results show our proposed methodology can distinguish

between applications over various conﬁgurations with an average classiﬁcation accuracy of 93.85%, therefore

it can be integrated as an enforcement component in DDM infrastructures.

1 INTRODUCTION

A Digital Data Marketplace (DDM) provides a digi-

tal infrastructure to facilitate data sharing in a secure

and trustworthy manner. The collaborating parties,

e.g. data providers and algorithm providers, normally

come to a DDM with a pre-agreed policy describing

which algorithm can work on which dataset. A DDM

should include a component to enforce those policies

(Zhang et al., 2019).

The compute algorithms in DDMs are normally

encapsulated in containers to gain better portability

(Canon and Younge, 2019). Multiple containers may

run on the same platform with permissions to execute

on different data sets. It is crucial to ensure each data

set is accessed by the container running the authorized

algorithm. To avoid information leakage, the DDM

infrastructures are not allowed to access the contain-

ers or the algorithm source code directly. This raises a

question: ”How to distinguish containerized applica-

tions only depending on external monitoring?”, which

we answer in this paper.

System calls provide an interface to the services

provided by an operating system. The behaviors of

a computer program can be well modelled as sys-

tem call sequences (Forrest et al., 1996). In this pa-

per, we propose an architecture to distinguish running

containers by establishing system call proﬁles. This

can be implemented as an enforcement component

in a DDM. An authorized party builds an authorized

proﬁle of a computing algorithm after verifying the

source code. The program behavior is modelled with

the occurrence of ﬁxed length subsequence of system

calls (n-grams). The execution platform monitors sys-

tem calls in real time and can, at the end of execu-

tion, distinguish programs running inside containers

based on system call proﬁles. The dissimilarity be-

tween proﬁles is computed with cross entropy. The

system will trigger an alarm if there is a mismatch be-

tween the policy and classiﬁcation results.

However, system calls are highly localized and the

generated trace ﬁles depend on the conﬁgurations of

the execution platforms (Forrest et al., 2008). Also, an

authorized algorithm is retrained multiple times with

different training data sets. To support portability of

containers, an authorized proﬁle may be reused over

platforms with different OSs. To address these points,

we investigate the suitability of our methodology over

different Linux distributions and with different train-

ing sets with 7 typical applications in DDMs. Based

on our experimental results, the proposed proﬁling

and distance calculation methodology show stable re-

sults over different platform conﬁgurations, e.g. OSs

and training data sets. Finally, we demonstrate that

the proposed methodology can gain an average clas-

siﬁcation accuracy of 93.85%.

508

Zhang, L., Cushing, R., Koning, R., de Laat, C. and Grosso, P.

Proﬁling and Discriminating of Containerized ML Applications in Digital Data Marketplaces (DDM).

DOI: 10.5220/0010254105080515

In Proceedings of the 7th International Conference on Information Systems Security and Privacy (ICISSP 2021), pages 508-515

ISBN: 978-989-758-491-6

2 RELATED WORK

There are recent studies modelling program behaviour

with system calls. The work in (Paek et al., 2006) pro-

posed to use frequency of individual system calls to

build the normal proﬁles of an application. Such pro-

ﬁles may lose some important information about the

application behaviours because they do not capture

the sequential relationships among systems calls. In

addition, this method is vulnerable to mimic attacks.

Exploiting the proﬁles, the adversaries can mimic the

benign program to perform malicious actions (Vargh-

ese and Jacob, 2007). (Forrest et al., 1996) proposed

to proﬁle normal behaviors of a running process as a

set of short sequences of system calls. The proﬁle is

an enumeration of all ﬁxed-length subsequence and

an anomaly is ﬂagged if a sufﬁcient number of new

short subsequences occur. (Hofmeyr et al., 1998) pro-

posed a similar version of program proﬁling method-

ology, which is called STIDE. In this case the authors

use Hamming distances to calculate the dissimilari-

ties to detect anomalies. These works provide us with

a good starting base for our research.

(Suratkar et al., 2019) proposed in their recent

work to use Hidden Markov Chain to model normal

behaviors of applications. The work in (Xiao et al.,

2019) proposed to train an LSTM model with be-

nign behaviors of a running program. However, those

methods are very computationally expensive and re-

quire a large amount of data.

The work in (Das et al., 2017) aimed to analyze

Android App behaviors using system calls. They built

the App proﬁle based on the frequency distribution

of individual system calls. Their conclusion was that

system calls are not sufﬁcient to classify application

behavior. In our work we will demonstrate that this

possible as long as the proﬁling methodology is more

reﬁned than the one adopted by these authors.

3 ARCHITECTURE

We propose an architecture to enforce the data access

and usage policy among collaborating parties during

the execution phase of a data sharing application. The

architecture can be implemented as one of the en-

forcement components for DDM infrastructures.

As shown in Figure 1, the data providers and al-

gorithm providers ﬁrst agree on a policy, which de-

scribes the purpose of the computing algorithm, e.g.

the algorithm can perform on which speciﬁc data set.

For portability, the computing algorithms in DDMs

are containerized.

Figure 1: A DDM enforcement component to distinguish

and verify running algorithms inside containers with system

call proﬁles.

Secondly, the algorithm provider sends its com-

pute algorithm, one or multiple images, and the policy

to an authorized party for veriﬁcation. It checks the

source code and veriﬁes whether the given algorithm

complies to the policy. If so, the authority party gen-

erates proﬁles of the application images with system

call traces and sends them to the execution platform.

These proﬁles are digitally signed and encrypted with

the public key of the execution platform. Thus the

risk of mimic attacks for proﬁling is highly reduced

(Varghese and Jacob, 2007).

Thirdly, both data objects and compute objects

are sent to the execution platform. To keep conﬁ-

dentiality of the algorithm, the compute containers

are normally executed by its owners remotely. The

platform monitors system calls of running containers

and feeds the information to a classiﬁer. The classi-

ﬁer discriminates running programs inside each con-

tainer and computes the dissimilarities between the

observed traceﬁles with all authorized proﬁles.

4 PROFILE GENERATION

It is important to consider the behavioural variance

of system call traces when programs are running with

different conﬁgurations. For instance, the traces of

a speciﬁc ML algorithm may change with different

training data sets or with multiple runs. This may

cause false alarms when we use system call proﬁles

to distinguish compute applications. Good proﬁling

and similarity computation methodology will reduce

the noise and only extract the information which ac-

tually represents the algorithm behaviours.

Proﬁling and Discriminating of Containerized ML Applications in Digital Data Marketplaces (DDM)

509

We deﬁne self-variance of a proﬁling methodol-

ogy to be the dissimilarities among system call pro-

ﬁles of the same application. It represents how sen-

sitive a proﬁling methodology is to the intrinsic vari-

ability of the running application. The quantitative

deﬁnition of self-variance depends on dissimilarity

measures in the classiﬁer. Higher self-variance is

likely to cause a higher false negative rate when mak-

ing the decisions.

To investigate our proposed methodologies, we set

up an experimental environment illustrated in Figure

2. We want to emulate the action of the authorized

3rd party in Figure 1 when it generates the proﬁle as

well as the actions of the execution platform when it

is monitoring running applications.

Figure 2: Proﬁle generation setup: a single physical host

running Docker containers and the Sysdig probe in the

Linux Kernel.

We run containers within a physical node. We use

Sysdig to monitor the generated system calls because

it is speciﬁcally designed for containers. The probe is

placed in the Linux kernel of the host machine and

it traces all the system calls generated by the con-

tainer. The input data is accessed as the volume of the

Docker container. As shown in Figure 2, the result is

the raw trace ﬁle which serves as the input of Filter-

ing component. The Filtering component ﬁlters out

all system calls generated by the container runtime.

Finally, the Proﬁle Generator will generate proﬁles

with the proposed methodology, which will be intro-

duced in Section 6.

5 CLASSIC N-GRAM PROFILES

Recently, there are plenty of work using ﬁxed

length subsequence to detect characterize behav-

iors of running processes (Khreich et al., 2017;

Varghese and Jacob, 2007; Subba et al., 2017).

They segment a system call trace into ﬁxed length

sub-sequence with a sliding window of length n

(normally 3 - 6). The subsequences are called

n-grams, with n being the length of the subsequence.

Suppose we have a sequence of system calls such

as f stat, mmap, close, open, read, write ·· ·. This

can be segmented into a list of n-grams with length

3: { f stat, mmap, close}, {mmap, close, open},

{close, open, read}, {open, read, write},

{read, write ···}.

The traditional n-gram proﬁling methodology was

proposed in (Forrest et al., 1996). It builds the proﬁle

as an enumeration of all occurring n-grams and the

deviation between two distinct proﬁles is computed

as the number of n-grams that are distinct in two pro-

ﬁles. With the setup illustrated in Figure 2, we aim

to investigate the self-variance of traditional n-gram

proﬁles. The application is to train a fraud detector.

The operating system is Ubuntu 18.04. We ﬁrst run

a docker container with the same training data set for

10 times and investigate the self-variance of the gen-

erated n-gram proﬁles. As seen in Figure 2, Sysdig

traces are generated by an entire container runtime. It

is interesting to ﬁlter out the runtime generated sys-

tem calls from the raw trace ﬁle. We ﬁlter out these

calls with a text processing script we wrote for this

purpose, as illustrated in the Filtering module in Fig-

ure 2.

Table 1 shows the average number of n-gram in

the 10 proﬁles. We distinguish between proﬁles gen-

erated by the entire container and proﬁles with con-

tainer runtime calls ﬁltered out. The similarity of two

n-gram proﬁles is measured as the proportion of n-

gram entries that are common in both ﬁles. We do a

pair-wise comparison of the n-grams in the 10 traces

and show the average number of pair-wise common

n-grams in the middle column. Finally we present the

average number of overall common n-grams in all 10

traces in the last column.

We ﬁrst focus on unﬁltered traces. We can see that

the average number of unﬁltered n-grams is 1520; this

reduces to 1310 when we do the pair-wise comparison

and ﬁnally only 1205 n-grams are common to all 10

traceﬁles. There are only around 86% of the n-gram

entries are common between arbitrary two proﬁles, as

this is the ratio between 1205/1520. In addition, the

overall similarity is only 79%. Looking at the ﬁltered

proﬁles we see that the pair-wise similarities are ap-

proximately 88% and the overall similarity is 82%.

This is only slightly better than the unﬁltered case and

the likelihood of false negatives is still high.

From this we can conclude that the classic n-gram

proﬁling and distance computation methodology are

likely to generate false alarms if we adopt it for dis-

criminating containerized applications. In the next

sections we will propose a different proﬁling method-

ology that can solve this issue.

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

510

Table 1: Average number of n-grams for the fraud detection application, before and after ﬁltering of runtime generated system

calls in three categories: all n-grams, pair-wise common n-grams and overall common n-grams.

No. n-grams No. pair-wise common n-grams No. overall common n-grams

Before Filtering 1520 1310 1205

After Filtering 1370 1200 1125

6 N-GRAM FREQUENCY

DISTRIBUTIONS

The traditional n-gram method presented in section 5

builds the proﬁle with distinct n-grams without occur-

rence frequency, which conveys information for de-

scribing the behavior of a program. Using the same

application (fraud detection) and using the 10 ﬁltered

proﬁles we have obtained in the previous experiment

we produce the occurrence distribution.

Figure 3 shows the CDF of the occurrence distri-

bution of each n-gram in one run of our application.

The axis is the number of occurrence of each n-gram.

Figure 3: CDF of occurrence distribution of n-grams in one

ﬁltered traceﬁle - a traceﬁle containing only the container

runtime calls.

There are in total 1350 distinct n-grams in this

proﬁle. The occurrence numbers of those n-grams

range from 1 to as high as 10

. As shown in the Fig-

ure, more than 50% of the n-grams occur very rarely

with occurrence numbers of 1. We expect that those

rarely occurred n-grams contain little or at least less

important information about the actual program be-

haviors. On the other hand, n-grams which occur of-

ten provide a clear signature of the application.

This observation is in fact conﬁrmed when we

compare the n-gram entries that are different between

two proﬁles of our application. We observe that al-

most all of those distinct n-gram entries occur only

once in the proﬁle. In this sense, we conﬁrm that the

traditional n-gram proﬁling methodology which treats

all the distinct n-grams entries equally does not model

the actual behaviour of an application well and may

result in a lower classiﬁcation rate.

Hence we propose to proﬁle a containerized pro-

gram with the frequency distribution of n-grams.

We do expect that the distinctive signatures of the

application’s proﬁles will result in lower variance.

To conﬁrm our assumption we need to identify a

method to calculate the proﬁle distances when they

are expressed as frequency distributions of n-grams.

The dissimilarity of two proﬁles will then be com-

puted using the cross entropy. Suppose OD

{nc

(p)

, nc

(p)

, · ·· , nc

(p)

} denotes occurrence distribu-

tion in traceﬁle tr

. nc

(p)

indicates occurrence counts

for i the n-gram in the traceﬁle tr

and N denotes the

total number of distinct n-grams. Similarly, OD

{nc

(q)

, nc

(q)

, · ·· , nc

(q)

} denotes occurrence count of

M n-gram entries in traceﬁle tr

. In most scenarios,

we have two distributions with M 6= N.

To obtain the cross entropy, two input distributions

need to have an equal number of entries. To accom-

plish this we deﬁne a procedure to adjust the sets. We

ﬁrst compute the union set of OD

and OD

= OD

[

= {nc

, nc

, · ·· , nc

} (1)

L ≥ M, L ≥ N (2)

L denotes the cardinality of the union set OD

. We

create two sets

and

with L entries; we add

all the n-grams contained in OD

but not in the orig-

inal OD

with an occurrence of zero to form the new

set

. We do the same procedure to form

= {nc

, nc

, · ·· , nc

, 0, 0} (3)

= {nc

, · ·· , nc

, 0, 0, 0} (4)

To avoid zero probability, we adopt Laplace

smoothing, speciﬁcally add-one smoothing, to calcu-

late the frequency distribution.

laplace smoothing =

+ 1

∑

+ L

(5)

laplace smoothing :

→ FD

(6)

laplace smoothing :

→ FD

(7)

= { f d

(p)

, f d

(p)

, · ·· , f d

(p)

} denotes the fre-

quency distribution of each n-gram entry after ap-

plying laplace-smoothing for traceﬁle tr

. Similarly,

Proﬁling and Discriminating of Containerized ML Applications in Digital Data Marketplaces (DDM)

511

denotes the smoothed frequency distribution for

After this procedure, we can compute the cross

entropy of two traceﬁles tr

and tr

as:

C(tr

, tr

) =

∑

i=1

( f d

(p)

− f d

(q)

) · log

f d

(p)

f d

(q)

(8)

The value of cross entropy has a lower bound of

0 if two distributions are identical. A small value of

cross entropy indicates a large similarity between two

distributions.

7 SELF VARIANCE AND

MUTUAL DISTANCE

In this section we validate our frequency distribution

proﬁle methodology by looking at the self-variance

and the distance of proﬁles calculated with cross-

entropy, which we deﬁne as follows.

Suppose T

= {t

(M)

, t

(M)

, · ·· , t

(M)

} represents the

set of traceﬁles for Application M. Suppose T

(N)

, t

(N)

, · ·· , t

(N)

} represents the set of traceﬁles for

Application N.

The self-variance of Application M with traceﬁle

set T

is the average value of the cross entropy for

any pair of traceﬁles in set T

× T

self variance(T

) = average(C(t

(M)

, t

(M)

)) (9)

∀(t

(M)

, t

(M)

) ∈ T

× T

(10)

Similarly, the average mutual distance between

two applications M and N is calculated as:

Mutual Distance(T

, T

) = average(C(t

(M)

, t

(N)

))

(11)

∀(t

(M)

, t

(N)

) ∈ T

× T

(12)

To validate our methodology we select 7 typi-

cal DDM applications in the related ﬁeld of machine

learning. Each application is encapsulated into a

Docker container (Merkel, 2014). All the algorithms

are written in Python and primarily rely on Tensor-

Flow, an open source library that helps to develop and

train ML models. The Docker images of all appli-

cations have a common underlying image of Python

Strech 3.6 and the algorithm script is running on top

of it. We run containers of all 7 applications as the

experimental setup depicted in Figure 2. The kernel

operating system is Ubuntu 18.04 and we run 4 times

for each application.

We compute self variance and average mutual dis-

tances among all 7 containerized applications. The

application names and computation results are shown

in Table 2.

The self variance of each application is shown as

the value on the diagonal line. The self variance is

small compared to the mutual distances. This implies

our proposed classiﬁcation methodology can provide

a lower false negative rate than the classic n-gram

method.

When we look at the mutual distances between ap-

plication we observe a number of interesting features.

First we see that distances range from a value of 0.38

(App3-App5 pair) to 23 (App4-App6 pair). This large

variation can be explained by observing that a number

of our applications use the same libraries; we there-

fore expect that those applications are more difﬁcult

to distinguish. This is in fact the case for App4 which

has a very unique set of libraries and therefore gives

larger distances.

8 STABILITY

The system call trace ﬁles of a running program de-

pend on conﬁgurations of the execution environment.

In this section, we will investigate the stability, quan-

tiﬁed as self-variance, of generated proﬁles over dif-

ferent OSs and training sets. We mainly focus on 3 of

them: App 1 - Unbalanced classiﬁer, App2 Text clas-

siﬁcation and App3 - Collaboration ﬁltering as they

are widely used ML algorithms in DDMs.

8.1 Stability over Different Platform

OSs

One beneﬁt of containerization is the portability over

different platforms. In DDMs the same application

will run at different times on different platforms, and

that the OS of the execution platforms chosen to run

on may vary. It would be obviously very convenient

if we could determine that a baseline proﬁle gener-

ated by an application in a certain OS can be used for

classiﬁcation in another OS. To assess this we need

to determine what is the stability of generated proﬁles

over different platforms as this inﬂuences the classi-

ﬁer accuracy. In our experiment we ran the three ap-

plications on 3 host machines, each one conﬁgured

with the following OS: Ubuntu 18.04, CentOS 7 and

Debian GNU Linux 9. For each application, the con-

tainer is running 5 times with the same training data

set. We then calculated the cross entropy for all ap-

plication proﬁles produced in one OS and the cross

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

512

Table 2: Self variance and mutual distances among all 7 containerized applications.

Application name No. APP 1 APP 2 APP 3 APP 4 APP 5 APP 6 APP 7

n-grams

APP 1 Unbalanced classiﬁer 1370 0.064 5.21 1.42 12.5 3 3.35 5.56

APP 2 Text classiﬁcation 5491 - 0.038 4.81 20 6.09 6.68 9.23

APP 3 Collaborative ﬁltering 1394 - - 0.002 18 0.38 0.66 5.2

APP 4 Federated learning 1140 - - - 0.045 22.3 23 19

APP 5 S2S learning 1452 - - - - 0.0007 0.58 2.76

APP 6 Train a LSTM 1948 - - - - - 0.0009 4.74

APP 7 Train a quasi-svm 2204 - - - - - - 0.02

entropy for pairwise comparison of application pro-

ﬁles in different OSs.

Figure 4: Stability of the proﬁles, expressed as cross en-

tropy, for 3 model applications over execution platforms

running three OSs: Ubuntu, Debian and CentOS.

As shown in Figure 4, there are in total of 6 groups

of boxes. Each group contains the results for the runs

of the same application: App1 in red, App2 in black

and App3 in blue. The 3 groups on the right show

the values of cross entropy of proﬁles generated on

machines with a speciﬁc OS. The 3 groups on the left

show the cross-platform variance of the proﬁles when

running on two host machines with a different OS.

Not surprisingly, we observe that the variance

among different platforms is application-dependent.

According to Figure 4, the proﬁles of App2 suffers

from more variance compared to the other two. One

explanation is that App2 is more resource intensive

and that more device management system calls are

inserted into the program behavioral traces. This gen-

erates more rarely occurring n-grams that contribute

to higher variance. Also, we can observe that Debian

GNU Linux 9 provides the most stable proﬁles for all

3 applications.

When we compare the variances for runs in the

same OSs, with values ranging from 0 to 0.27, to

the runs with different OSs pairs, with values rang-

ing from 0 to 0.34, the cross-platform variance is only

slightly larger. In addition, as shown in Table 2, the

mutual distances among the 3 applications are 5.21

(APP1- APP2), 1.42 (APP1 -APP3) and 4.81 (APP2 -

APP3). The absolute values of the variance cross plat-

forms are smaller than the mutual distances. This in-

dicates the variance of proﬁles over different platform

os is unlikely to cause false negatives. The results

strongly suggest that our proposed methodology will

support container portability with quite stable perfor-

mance, at least for typical DDM ML algorithms.

8.2 Stability over Different Training

Data Sets

Not only it is important to determine the stability of

proﬁles across OS-es, it is likewise essential to in-

vestigate the stability over different training data sets.

There are two reasons for this. First, to reduce the

risk of data leakage in a DDM, the authorized party is

not allowed to access the data objects directly and to

execute the compute algorithm on them. This means

that the data used to generate the proﬁle in the autho-

rized party is normally different from the one used for

classiﬁcation in the execution platform. Secondly, a

DDM customer may retrain a machine learning model

multiple times with different training data sets for

higher accuracy. To determine the stability of each

application in this case we trained the ML model with

the data sets. For each application there are 5 training

data sets of various sizes.

Figure 5 shows the stability of the application pro-

ﬁles expressed as cross entropy values for all pairs in

the Cartesian product of the traceﬁle set. The cross

entropy between proﬁles of the same application is

small, in the large majority of cases with values lower

0.1. In particular, the proﬁles of App1 are the most

stable across the 5 training data sets, and the cross

entropy has values ranging from 0 to 0.15 includ-

ing outliers. App2 and 3 have a few outliers with

cross entropy higher than 0.1. App 2 has outliers with

higher entropy values, ranging from 0.17 to 0.7. As

explained in Section 8.1 this is because more device

management system calls are generated in this case

and inserted into the program behavioral traces.

From our two experiments, stability over different

Proﬁling and Discriminating of Containerized ML Applications in Digital Data Marketplaces (DDM)

513

Figure 5: Stability of the three model application proﬁles

over all data sets expressed as cross entropy.

OSs and different data sets, we can conclude that pro-

ﬁles are fairly closely clustered together, as seen by

the low value of cross entropy. This means that we

expect that our methodology can deliver good results

independently from the OSs and training data set the

reference proﬁles has been generated with.

9 CLASSIFICATION ACCURACY

In this section, we will investigate how well our pro-

posed classiﬁer works for distinguishing container-

ized applications. For each round, we randomly

choose one proﬁle per application to be the refer-

ence proﬁle and measure how accurate our classiﬁer

can determine the remaining application proﬁles cor-

rectly. Table 3 shows the accuracy confusion matrix

of the classiﬁer for 6 containerized ML algorithms.

In each row we report the classiﬁcation result of the

speciﬁc application. The last column shows the mean

value and standard variation of classiﬁcation accuracy

rates for each application among 40 rounds.

We can observe that the classiﬁer can always

achieve an accuracy rate of 100% for App2, App6 and

App7 no matter how we built the authorized proﬁles.

The prediction accuracy is only 86.7% for App1, with

209 samples classiﬁed as App3 and 22 samples clas-

siﬁed as App6. False classiﬁcations mainly occur be-

tween App3 and App5, App5 and App6. This is be-

cause of their mutual distance, as shown in Table 2,

are only 0.38 and 0.58. For those applications with

lower average classiﬁcation accuracy, App1, App3

and App5, the std is also higher. This indicates that

the selection of authorized proﬁles plays an important

role in the performance of the classiﬁer. The classi-

ﬁer can predict proﬁles with 100% accuracy for some

applications sharing the same libraries and overall ac-

curacy for all applications is as high as 93.85%.

10 DISCUSSION

The work presented so far mainly focuses on distin-

guishing applications running inside the containers

based on system call monitoring. We must stress our

methodology is not concerned with the maliciousness

of the code. The focus is on whether the code is autho-

rized to run. Detecting malicious code and intrusions

in real time is out of the scope for the current paper,

but we will extend our architecture in Fig. 1 to include

this as a separate component.

The performance of our proposed methodology is

lower for applications that are pretty similar to each

other. It will be a focus of our future work to de-

termine if more ﬁne-grained classiﬁers, namely Sup-

port Vector Machine (SVM), Decision Tree, KNN (K-

nearest neighbour), can improve the current classiﬁ-

cation accuracy. These more reﬁned methods require

many more traceﬁles than the ones we currently. Still,

in many DDMs this larger data set traceﬁles is difﬁ-

cult to collect, hence our methods will still be the one

adopted, and we believe deliver more than sufﬁcient

discrimination power, as seen by its overall accuracy

of 93.85%.

Another interesting aspect to consider is the ef-

fect of the traceﬁle size, i.e. the number of distinct

n-grams contained in it. As we can see from Ta-

ble 2 the six applications we chose produce trace-

ﬁles which contain between 1140 and 5941 distinct

n-grams. When we look at Table 3 we can see that

there is no obvious correlation between number of

n-grams and accuracy, which is also a good perfor-

mance indicator for the suitability of our method as it

is application size agnostic.

11 CONCLUSION AND FUTURE

WORK

In this paper we propose an architecture to distin-

guish programs running inside containers by moni-

toring system calls. We propose to proﬁle an algo-

rithm container with n-gram occurrence distribution

and compute dissimilarity with cross entropy. The

methodology allows proﬁle reuse across execution

platforms with different OSs and different training

data sets. We showed that we can gain high classi-

ﬁcation accuracy with typical ML applications. Our

method can be easily incorporated in DDMs currently

under development.

In the future we want to proﬁle a compute con-

tainer with multi-dimensional metrics, e.g. in and out-

going network trafﬁc and the CPU usage. This will

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

514

Table 3: The confusion matrix of the classiﬁer for 6 applications running with various platform OSs and training data sets.

APP 1 APP 2 APP 3 APP 5 APP 6 APP 7 mean (%) ± std

APP 1 1529 0 209 0 22 0 86.7 ± 0.15

APP 2 0 1760 0 0 0 0 100 ± 0

APP 3 0 0 1623 137 0 0 92.2 ± 0.15

APP 5 0 0 61 1483 216 0 84.2 ± 0.21

APP 6 0 0 0 0 1760 0 100 ± 0

APP 7 0 0 0 0 0 1760 100 ± 0

93.85

allow us to build richer signatures and reduce inaccu-

racy of classiﬁcation. Additionally we want to extend

our architecture to include intrusion detection mod-

ules, which will not simply classify an application as

being authorized or not, but also try to determine its

behaviour at runtime.

ACKNOWLEDGEMENTS

This paper builds upon the work done within the

Dutch NWO Research project ‘Data Logistics for Lo-

gistics Data’ (DL4LD, www.dl4ld.net), supported by

the Dutch Top consortia for Knowledge and Innova-

tion ‘Institute for Advanced Logistics‘ (TKI Dinalog,

www.dinalog.nl) of the Ministry of Economy and En-

vironment in The Netherlands and the Dutch Commit-

to-Data initiative (https://commit2data.nl/).

REFERENCES

Canon, R. S. and Younge, A. (2019). A case for porta-

bility and reproducibility of hpc containers. In 2019

IEEE/ACM International Workshop on Containers

and New Orchestration Paradigms for Isolated Envi-

ronments in HPC (CANOPIE-HPC), pages 49–54.

Das, P. K., Joshi, A., and Finin, T. (2017). App behavioral

analysis using system calls. In 2017 IEEE Confer-

ence on Computer Communications Workshops, IN-

FOCOM WKSHPS 2017, pages 487–492. Institute of

Electrical and Electronics Engineers Inc.

Forrest, S., Hofmeyr, S., and Somayaji, A. (2008). The evo-

lution of system-call monitoring. In 2008 annual com-

puter security applications conference (acsac), pages

418–430. IEEE.

Forrest, S., Hofmeyr, S. A., Somayaji, A., and Longstaff,

T. A. (1996). A sense of self for unix processes. In

Proceedings 1996 IEEE Symposium on Security and

Privacy, pages 120–128. IEEE.

Hofmeyr, S. A., Forrest, S., and Somayaji, A. (1998). Intru-

sion detection using sequences of system calls. Jour-

nal of computer security, 6(3):151–180.

Khreich, W., Khosravifar, B., Hamou-Lhadj, A., and Talhi,

C. (2017). An anomaly detection system based on

variable n-gram features and one-class svm. Informa-

tion and Software Technology, 91:186–197.

Merkel, D. (2014). Docker: lightweight linux containers for

consistent development and deployment. Linux jour-

nal, 2014(239):2.

Paek, S.-H., Oh, Y.-K., Yun, J., and Lee, D.-H. (2006). The

architecture of host-based intrusion detection model

generation system for the frequency per system call.

In 2006 International Conference on Hybrid Informa-

tion Technology, volume 2, pages 277–283. IEEE.

Subba, B., Biswas, S., and Karmakar, S. (2017). Host based

intrusion detection system using frequency analysis of

n-gram terms. In TENCON 2017-2017 IEEE Region

10 Conference, pages 2006–2011. IEEE.

Suratkar, S., Kazi, F., Gaikwad, R., Shete, A., Kabra, R.,

and Khirsagar, S. (2019). Multi hidden markov mod-

els for improved anomaly detection using system call

analysis. In 2019 IEEE Bombay Section Signature

Conference (IBSSC), pages 1–6. IEEE.

Varghese, S. M. and Jacob, K. P. (2007). Process proﬁling

using frequencies of system calls. In The Second In-

ternational Conference on Availability, Reliability and

Security (ARES’07), pages 473–479. IEEE.

Xiao, X., Zhang, S., Mercaldo, F., Hu, G., and Sangaiah,

A. K. (2019). Android malware detection based on

system call sequences and lstm. Multimedia Tools and

Applications, 78(4):3979–3999.

Zhang, L., Cushing, R., Gommans, L., De Laat, C.,

and Grosso, P. (2019). Modeling of collaboration

archetypes in digital market places. IEEE Access,

7:102689–102700.

Proﬁling and Discriminating of Containerized ML Applications in Digital Data Marketplaces (DDM)

515