Needles in a Haystack: Using PORT to Catch Bad Behaviors within

Application Recordings

Preston Moore

, Thomas Wies

, Marc Waldman

, Phyllis Frankl

and Justin Cappos

New York University, U.S.A.

Manhattan College, U.S.A.

Keywords:

Domain Speciﬁc Languages, Event Processing, Environmental Bugs.

Abstract:

Earlier work has proven that information extracted from recordings of an application’s activity can be tremen-

dously valuable. However, given the many requests that pass between applications and external entities, it has

been difﬁcult to isolate the handful of patterns that indicate the potential for failure. In this paper, we propose

a method that harnesses proven event processing techniques to ﬁnd those problematic patterns. The key addi-

tion is PORT, a new domain speciﬁc language which, when combined with its event stream recognition and

transformation engine, enables users to extract patterns in system call recordings and other streams, and then

rewrite input activity on the ﬂy. The former task can spot activity that indicates a bug, while the latter produces

a modiﬁed stream for use in more active testing. We evaluated PORT’s capabilities in several ways, starting

with recreating the mutators and checkers utilized by an earlier work called SEA to modify and replay the

results of system calls. Our re-implementations achieved the same efﬁcacy using fewer lines of code. We also

illustrated PORT’s extensibility by adding support for detecting malicious USB commands within recorded

trafﬁc.

1 INTRODUCTION

‘‘Actions speak louder than words...” - Unknown

It is a well established principle that, in the wake

of an application failure, its actions during execution

can provide clues to the root cause. Such information

cannot only help correct the cause of failure, but also

prevent its repetition through the creation of better test

methods. The challenge is how to identify and extract

this data from large and detailed sources like appli-

cation logs, system call traces, or application record-

ings. In other words, how does one accurately de-

scribe what activity is important and what you should

do when you ﬁnd it?

In considering this question, we drew inspira-

tion from two sources. The ﬁrst is our recent study

that conﬁrmed the value of monitoring and modi-

fying an application’s interactions with its environ-

ment (Moore et al., 2019). Using a technique known

as SEA (Simulating Environmental Anomalies), the

study demonstrated that when an application fails, the

causal properties will be visible in the results of the

system calls it made. Further, the study afﬁrmed these

results could be captured and simulated for testing

against other applications. The second source was

the signiﬁcant amount of literature supporting the use

of event processing techniques over large streams of

data (Agrawal et al., 2008; Hirzel, 2012; Hirzel et al.,

2013; Dayarathna and Perera, 2018). We posited

that techniques currently used to identify problems in

manufacturing environments, or patterns in network

outages, could also be used to accurately recognize

target sequences in large application activity streams.

Building upon these successes, we introduce

a tool that utilizes event processing techniques to

identify behaviors that may cause applications to

fail. What makes this possible is PORT (Pattern

Observation, Recognition, and Transformation), a

new domain speciﬁc language we designed with the

goal of describing these behaviors in a briefer and

more easily understood manner than conventional

languages. The descriptions can be used to search

recordings of an application’s actions across a variety

of “activity representations.” That is, it can search ac-

tivities like system calls or remote procedure calls and

determine if an application either executed a desired

behavior or avoided an undesired one. Further, PORT

can specify a set of modiﬁcations to be made if a par-

ticular activity sequence is encountered. By combin-

ing passive monitoring and active activity modiﬁca-

tion, PORT can aid in identifying bugs in a wide vari-

ety of programs that might be missed by other testing

strategies.

In order to illustrate PORT’s usefulness, we built

Moore, P., Wies, T., Waldman, M., Frankl, P. and Cappos, J.

Needles in a Haystack: Using PORT to Catch Bad Behaviors within Application Recordings.

DOI: 10.5220/0011142300003266

In Proceedings of the 17th International Conference on Software Technologies (ICSOFT 2022), pages 137-145

ISBN: 978-989-758-588-3; ISSN: 2184-2833

137

a prototype compiler for the language and carried

out a three part evaluation. The ﬁrst part consisted

of creating more concise, readable and maintainable

implementations of the anomalies described in our

earlier work on the SEA technique (Moore et al.,

2019). Next, we demonstrate how PORT allowed us

to quickly add support for recorded streams of USB

trafﬁc enabling us to detect and simulate BADUSB-

style attacks (Hak5, 2022) and device ID conﬂicts.

And, ﬁnally, we demonstrate how quickly PORT pro-

grams can process recordings taken from real world

network and compression applications.

The main contributions in this work can be sum-

marized as follows:

• We create a new domain speciﬁc language, PORT,

that allows for concise descriptions of patterns

that may be found and transformed in an appli-

cation’s activity stream.

• We show how PORT allows for concise descrip-

tions of anomalies.

• We demonstrate that PORT can be extended to

other activity representations.

• We provide an open source implementation

of PORT at: https://github.com/pkmoore/

crashsimlang.

2 BACKGROUND AND

MOTIVATION

Our decision to construct a large tool like PORT was

not one made lightly. It came about after initial ef-

forts revealed existing tools could not meet our re-

quirements. In this section we explain how creating

a distinct language, helped us expand SEA’s applica-

bility to other activity types, such as calls to library

functions and remote procedure calls

2.1 Our Motivating Example

The initial impetus for this work was a desire to

expand the use of our Simulating Environmental

Anomalies (SEA) technique (Moore et al., 2019).

This effort afﬁrmed that problematic environmental

properties, known as anomalies, are visible in com-

munications between an application’s components.

We found that, once captured, these anomalies could

be simulated and tested against other applications.

Results of system calls made during execution were

recorded, modiﬁed, replayed and the application’s re-

sponse was noted. Using this strategy, the authors

were able to identify a number of bugs in major ap-

plications.

As a concrete example of the above consider the

“Unusual File types” anomaly discussed in the SEA

paper. This anomaly may be problematic when an

application running under Linux attempts to open and

read data from a ﬁle on disk, as Linux requires special

procedures to handle these tasks. The SEA technique,

allows a user to modify the return value of a stat call

to simulate the presence of such a ﬁle. PORT simpli-

ﬁes the simulation by using proven event processing

techniques and adding features that make the tool ag-

nostic to the way an application’s activity has been

recorded.

2.2 Building a New Domain Speciﬁc

Language

For it to work as we wanted it to, PORT needed to

meet some speciﬁc criteria. For starters, it had to

be able to identify speciﬁc patterns as they appear in

a recording of application activity. Simple models,

such as a deterministic ﬁnite automaton (i.e. a DFA

or a FSA) enhanced to operate on complex structures

do not work because they cannot match certain pat-

terns. In addition, we needed a language that could

easily capture and manipulate the internal contents

of events, like argument data, pointer addresses, for

reuse in subsequent operations.

One possibility is to deploy more expressive au-

tomata models, such as register automata (Kamin-

ski and Francez, 1994) or session automata (Bollig

et al., 2014) that can output modiﬁed streams. Sev-

eral feature-rich event processing languages and li-

braries do have these capabilities, but the process is

by no means straightforward. In many cases, produc-

ing such an output stream would require falling back

on the fully-featured nature of a host language (e.g.

Java) – a situation we hoped to avoid.

In the initial stages of PORT’s development,

we also evaluated several complex event processing

(CEP) languages. These languages could provide

some of the pattern and predicate matching primitives

that we wished to incorporate. Sadly, these languages

either lacked other required features or were too com-

plex for the easy-to-use system we wanted to offer.

Typically, programs for these complex event process-

ing engines are written in the engine’s build or host

language, such as Java, Scala or Python that bring

with them a great deal of boilerplate code. Recent

studies have afﬁrmed that excessive and complicated

code patterns can harm understanding and maintain-

ability (Gopstein et al., 2017). Further, it means that

the author and future maintainers of a program must

be ﬂuent in this host language.

ICSOFT 2022 - 17th International Conference on Software Technologies

138

3 LANGUAGE OVERVIEW

The PORT language allows its users to completely

describe a mutator, or a program that can recognize

a particular event stream and optionally produce a

modiﬁed version of that stream. We compile a given

PORT programs into an enhanced transducer that can

operate over complex data structures. The transducer

consists of states, and a series of rules that govern

when the current state should changes and also what

modiﬁcations should be made to the input stream.

The transducer rules consist of logical compar-

isons between an event’s parameter values, and the

values stored in the transducer’s registers or the liter-

als speciﬁed directly in a program’s code. Using event

processing techniques, PORT maps activity onto a

stream of events, that each consist of a unique identi-

ﬁer (e.g. the name of the function being called) and a

list of parameter values (e.g. the argument and return

values of the called function).

Parameter values are drawn from a set of basic

data values, such as strings and numbers, as well as

user-deﬁnable record types. PORT allows users to ig-

nore parameter values that are irrelevant for the partic-

ular task at hand and to create a single abstract event

from several semantically related, but different event

identiﬁers. In this manner, the transducer can pro-

cess abstract events rather than individually specify-

ing each event.

A PORT transducer processes an input event

stream as follows. Each item in an input sequence

is examined and the transducer’s internal state is up-

dated accordingly. Output is then produced based on

the rules described in the transducer’s program. In

this way, the transducer itself can be thought of as

a sequence of actions that may or may not be exe-

cuted based on the values in the input stream. The

state of the transducer keeps track of the next action

to be executed, as well as a valuation of a ﬁnite num-

ber of registers that hold data values. Whether or not

an action is executed depends on a combination of the

transducer state and the values of the current event.

Therefore, executing an action consists of reading the

next event from the input stream, which may update

some of the registers, and then writing the next event

in the output stream.

Figure 1 shows the grammar of PORT’s core lan-

guage. A program is split into two parts: the preamble

consisting of type and event deﬁnitions, and the body

of the program which consists of an action expression.

Preamble. An abstract event consists of a name and

a record of named ﬁelds that hold data values of inter-

est. Consider the event deﬁnition:

event rd {read fdesc: Number@0}

program ::= (typedef | eventdef )

∗

action

typedef ::= type id{id:t @ n (,id:t @n)

∗

};

eventdef ::= event id variant (| variant)

∗

;

variant ::= {id id:t @ n (,id: t @n)

∗

}

action ::= pattern -> id (e) | not pattern

| action;action | action*

pattern ::= id(p) with b

p ∈ pexp ::= {id: p (,id: p)

∗

}

| regid | c

e ∈ vexp ::= regid | c | e + e | . . .

b ∈ bexp ::= true | false

| e == e | p and p | . . .

t ∈ texp ::= String | Number

| id | . . .

Figure 1: Grammar of PORT’s core language.

| {recv fdesc: Number@0};

This deﬁnition maps the concrete events named read

and recv to the abstract event rd. The parameter val-

ues of the concrete events are abstracted to a record

consisting of a single ﬁeld fdesc that holds a value of

type Number. The notation fdesc: Number@0 in each

variant indicates that the value of fdesc is the 0th pa-

rameter of the corresponding concrete event. Note

that all variants must map their parameters to the same

record type. If an event deﬁnition describes an ab-

stract event in terms of a single variant for which the

concrete event name coincides with the name of the

abstract event, then the former can be omitted in the

variant.

The abstraction mechanism used for these param-

eter lists can also be applied to the values themselves.

For instance, the 1st parameter of an fstat system call

is a status buffer that consists of a list of values. Type

deﬁnitions can be used to abstract such compound val-

ues into records. The following PORT code deﬁnes an

abstract fstat event that tracks only the device iden-

tiﬁer, inode number, and mode of the status buffer:

type SB {dev: String@0, ino: String@1,

mode: String@2};

event fstat {fdesc: Number@0, sbuf: SB@1};

Body. The action expression in the body of a PORT

program describes how the input event stream is trans-

formed to the output event stream. An individual in-

put event is matched and transformed by an atomic

action of the form

(p) with b -> id

(e)

This action matches the next (abstract) input event

against the pattern id

(p) subject to the constraint b.

The action is triggered if the name of the matched

event is id

and its record satisﬁes the constraints im-

posed by p and b. The semantics of pattern matching

is similar to the way expressions are pattern matched

in functional programming languages. In particular, a

variable occurring in a pattern refers to a register of

the transducer. If the pattern matches the event, then

the register is assigned to the corresponding value.

For example, consider the pattern:

fstat({fdesc: fd2, sbuf: {dev:rdev, stino:

rino2}})

Needles in a Haystack: Using PORT to Catch Bad Behaviors within Application Recordings

139

When matched against the event

fstat({fdesc: 4, sbuf: {dev:"st_dev=makedev(0,

4)", ino: 42, mode="S_IFCHR|0666"}})

the match would succeed and assign the value 4 to

rdev, and "S_IFCHR|0666" to register rino2.

The Boolean expression b is evaluated after the

initial match of id

(p) succeeds. If b evaluates to true

then the action takes effect. Otherwise, the match

fails and the registers are reset to their original values

before id

(p) was matched.

When an atomic action takes effect, the matched

input event is consumed and an output event is ap-

pended to the output stream. This output event is de-

scribed by id

(e) in the output clause of the action. If

= id

, then e can be a partial record expression,

describing only those parts of the input record that

should be modiﬁed by the action. Record ﬁelds that

are not speciﬁed by e are copied from the input event

to the output event. If the record types of the input

and output events differ, e must describe the output

record completely.

For example, consider the atomic action:

fstat({fdesc:rfd2, sbuf:{dev:rdev, ino:rino2}})

with rfd2 == rfd and rdev == "st_dev=makedev

(0, 4)" -> fstat({sbuf:{ino: rino}});

This action matches the fstat event given above,

assuming the register rfd has value 4 before the

match. Moreover, if the register rino has value

43 before the action is executed, then the action

produces the output event:

fstat({fdesc:4, sbuf:{dev:"st_dev=makedev(0,4)"

, ino: 43, mode="S_IFCHR|0666"}})

For convenience, we add a syntactic short-hand that

allows one to more compactly express common

pattern types. First, one often needs to express that

the value of a matched record ﬁeld is equal to the

current value of a register. In the action above, the

ﬁeld fdesc of the matched fstat event must be equal

to rfd for the match to succeed. This constraint can

be expressed more succinctly by replacing rfd2 with

?rfd in the pattern of the action. This ensures that the

matched value is equal to rfd without changing the

value of rfd. The equality rsd2 == rsd can then be

omitted from the with clause.

Next, the with clause can also be omitted alto-

gether from an atomic action, in which case b defaults

to true. Likewise, the output clause can be omitted

and the matched input event can simply be copied to

the output stream. Finally, if an action is replacing

only the value of a ﬁeld, and the action is not depen-

dent on the old value, then the modiﬁed ﬁeld value

can be speciﬁed directly in the pattern using the nota-

tion -> e. Here, the expression e determines the new

value to be stored in the ﬁeld.

Using this syntactic short-hand, the action given

above can be expressed more compactly as:

fstat({fdesc: ?rfd, sbuf: {dev:"st_dev=makedev

(0, 4)", ino: -> rino}});

Sequencing, Implicit Repetition, and Negation.

Atomic actions can be sequenced to form com-

pound action expressions that match and transform

sequences of events, action

;action

. PORT simpli-

ﬁes the handling of unbounded event sequences by us-

ing implicit repetition semantics. If the next event in

the input stream is not matched by the current atomic

action in the action sequence, the event is simply

copied to the output stream. The transducer moves on

by attempting to match the next input event against

the current atomic action.

Sometimes, it is necessary to constrain this im-

plicit repetition by disallowing the appearance of

certain events in the input stream before an event

matched by the current atomic action is encountered.

This can be done by negated patterns, which take the

form not id(p) with b. If an event that matches the

pattern id(p) with b is encountered before the next

atomic action in the program takes effect, then the

transducer aborts.

Explicit Repetition. PORT also supports explicit

repetition of actions, which is indicated using a

Kleene star, action*, similar to standard regular ex-

pression syntax. The generated transducer accepts

zero or more repetitions of the speciﬁed sequence of

events. Output is only produced if a complete repeti-

tion of sequence is encountered.

4 EVALUATION

Once we had an implementation of PORT, we de-

signed a set of experiments to evaluate its effective-

ness in real world situations. Speciﬁcally, we aim to

answer the following questions:

• Can PORT express the anomalies used by SEA to

identify bugs?

• How easy is it to extend PORT to support activity

representations other than system calls?

• What problems can be addressed by employ-

ing PORT on non-system-call activity representa-

tions?

• Can PORT process input streams in a reasonable

amount of time?

ICSOFT 2022 - 17th International Conference on Software Technologies

140

1 event Statbuf {mode: String@2};

2 event anystat {stat sb: Statbuf@1}

3 | {lstat sb: Statbuf@1}

4 | {fstat sb: Statbuf@1};

5 anystat({sb: {mode: -> "st_mode=S_IFBLK"}});

Figure 2: A PORT program that identiﬁes a stat, lstat,

or fstat call and modiﬁes the ST_MODE member of its

statbuf output parameter to contain the value "S_IFBLK".

This indicates that the ﬁle being examined is a block device

rather than a regular ﬁle.

4.1 Expressing SEA Anomalies

Given that this work is motivated in large part by

a desire to expand the utility of the SEA technique,

our ﬁrst experiment aims to reproduce the anomalies

described in (Moore et al., 2019). Speciﬁcally, we

test PORT’s ability to recreate the study’s unusual ﬁle

type mutator, and its cross-disk ﬁle move checkers,

which were used to identify the bulk of the bugs that

were found.

Creating the Unusual File type Mutator. For the

ﬁrst part of this experiment, we used PORT to imple-

ment an “unusual ﬁle type” mutator. As illustrated

in Figure 2,this mutator takes an input trace that con-

tains a call to either stat(), fstat(), or lstat()

and modiﬁes its result data structure so its ST_MODE

member will indicate an unusual ﬁle type. As can be

seen in Figure 2, this task can be expressed with only

a few lines of PORT code. Lines 1 through 4 deﬁne

what stat(), fstat(), and lstat() calls look like,

and which parameter contains the result buffer. Line

6 generates an accepting state that, when entered, out-

puts a system call with a modiﬁed value in the return

structure’s st_mode ﬁeld. The output can then be used

to modify the results of a running application’s system

calls to complete the SEA technique.

The original implementation of this mutator

in (Moore et al., 2019) consisted of 55 lines of Python

code, much of it error-prone state management code.

Our shorter PORT mutators offers several major ad-

vantages. First, they omit boilerplate code associ-

ated with general purpose languages because com-

mon functions are generically implemented, eliminat-

ing the need for users to do so manually. Next, each

statement deﬁnes a speciﬁc state which ignore any

system calls not dealt with in the PORT program. Fi-

nally, instead of relying on fragile string manipula-

tion, PORT’s operators make it easier to modify the

parameters of system call.

Supporting Cross-Disk Move Checkers. In the

second part of this experiment we tested whether

PORT can implement the “checkers” used in SEA

1 event usbhid { src: String@0, dst: String@1,

2 data: String@13, transfertype:

String@10 };

3 num1 <- "00:00:1e:00:00:00:00:00";

4 num2 <- "00:00:1f:00:00:00:00:00";

5 src <- "2.1.1";

6 dst <- "host";

7 usbhid({src: ?src, dst: ?dst, data: ?num1})

-> usbhid({data: ->num2});

Figure 3: A demonstration PORT program that matches

USB activity indicating the ’1’ key is being pressed and

transforming it to a new frame where the ’2’ key is being

pressed.

to determine if an application can correctly move a

ﬁle from one disk to another. This task is a com-

mon source of bugs in Linux applications because the

Linux rename() system call does not support moving

ﬁles from one disk to another. Moore et al. identiﬁed

the steps required to correctly perform such a move

by examining the source code of the “mv” command.

The team then implemented a set of checkers to iden-

tify situations where an application does not complete

this task correctly. In real world applications, these

checkers were able to identify bugs in many popular

applications and libraries that offer ﬁle movement ca-

pabilities.

We evaluated each of the four checkers listed in

Moore et al.’s work and determined that PORT could

implement three of them. For example, we were able

to replace the 45 lines of difﬁcult to read and main-

tain Python code in the “File Replaced During Copy”

checker with a clearer 7 line PORT program. This ex-

ercise did expose one of PORT’s shortcomings, which

is it cannot currently implement the “Extended File

Attributes” checker. Because PORT does not include

a list data structure, it can not capture the values

getxattr() and ensure they have all been applied

with a corresponding call to setxattr(). Without

this, an application cannot preserve a ﬁle’s extended

attributes and re-apply them after the move. Though

we are considering such a feature for future imple-

mentation, we do not currently support it. Such an ex-

tension could hurt program clarity and make it harder

to reason about mutator behavior.

4.2 Extending PORT to Other Activity

Representations

A key feature of PORT is it easy to add support

for new activity representations. To demonstrate this

we implemented support for streams of USB activ-

ity. This format was chosen because of its reliance on

numerous parties correctly implementing a standard

protocol. Using PORT on streams of USB activity re-

Needles in a Haystack: Using PORT to Catch Bad Behaviors within Application Recordings

141

quired implementing an an extension that lets PORT

process USB frames and developing some way to

capture communications between USB devices. For

the latter, we used Wireshark because of its excellent

trafﬁc capture and dissection capabilities (Wireshark,

2022). For the former, implementing such a trans-

former was a straightforward task taking only around

three and a half. Together, these two components al-

lowed us to write PORT programs that could both

identify patterns and transform streams of USB ac-

tivity in minutes.

BADUSB. As one test scenario, we settled

upon the recent type of USB-based attack known

as BADUSB (Hak5, 2022). These attacks utilize

small USB devices that resemble thumb drives, but

when plugged into a computer, register themselves

as human interface devices, and then rapidly send

keystrokes to execute malicious commands. Our goal

was to construct PORT programs to both recognize

these attacks within a recording of a machine’s USB

trafﬁc and transform an innocent recording into one

containing the attack. The modiﬁed program could be

replayed to assess if a computer’s defensive measures

are able detect the attacks.

The ﬁrst PORT program we wrote detects a USB

device attempting to bypass powershell’s security pol-

icy. This is a common starting point for BADUSB

attacks that seek to execute complex payloads. The

program we wrote detects USB frames that contain

a sequence of “scan codes” which spell out “pow-

ershell -Exec bypass.” Detecting this string is criti-

cal because it explicitly disables security controls, a

step that should only be taken under special circum-

stances. Using this program (see abbreviated version

in Figure 3) we were able to detect the target se-

quence in streams of USB trafﬁc recorded from a real

computer using a standard USB keyboard. A more

advanced example of PORT’s capabilities with USB

streams involves simulating a BADUSB attack. This

program also identiﬁes and transforms USB human

interface device frames to yield key presses that spell

out “powershell -Exec bypass,” achieving our goal of

transforming an innocent stream into a malicious one.

Device ID Conﬂicts. Our second test involved us-

ing SEA to simulate a USB device receiving an inap-

propriate “vendor ID” or “product ID” from its man-

ufacturer. Incorrect identiﬁers can cause a device to

not work correctly (wrongid, 2014). This malfunction

has been problematic enough these devices must be

disabled (barscanner, 2009). We wrote a PORT pro-

gram that monitors a stream of USB trafﬁc for USB

device registrations and stores the ﬁrst vendorID and

productID ﬁeld it encounters into registers. When

subsequent registrations are encountered, their iden-

Utility and Operation Exec. Time No. Syscalls

gzip compress ﬁle 0.110 17

gzip decompress ﬁle 0.107 35

rar compress ﬁle 0.112 109

rar decompress ﬁle 0.109 87

bzip decompress ﬁle 0.102 25

ncat server 0.103 43

socat server 0.108 71

http.server server 0.114 21

rsync client 0.132 274

ssh client 0.159 850

ftp client 0.160 891

scp client 0.135 490

telnet client 0.106 23

BADUSB 0.111 1116 lines

ID Conﬂict 0.117 18992 lines

Figure 4: Average time required to process the speciﬁed

recording based on 100 executions.

tiﬁers are rewritten using these stored values. This

produces a new stream of USB activity where many

devices share incorrect device identiﬁers that can be

used after the fashion of SEA to test a system’s re-

sponse to such a mis-conﬁguration.

4.3 PORT’s Performance

Our ﬁnal experiment evaluates the time required for

PORT to identify speciﬁc patterns within a realistic

set of test traces from eight widely used network ap-

plications and four popular compression utilities. Five

of the applications are clients that operate by con-

necting to an appropriate service. Three of the ap-

plications were servers, and were recorded as they

accepted a connection from an appropriate client.

The compression utilities were recorded as they com-

pressed or decompressed a ﬁle. These recordings

were made with strace and then processed using a

PORT program. For the network applications the pro-

gram identiﬁes the sequence of system calls that im-

plement a client or server’s request handling loop.

The compression utility recordings were processed

using a separate program that ﬁnds the read/write loop

responsible for carrying out a compression or decom-

pression operation

. Table 4 shows the average time

required to complete the speciﬁed operation based on

one hundred executions, as well as the number of sys-

tem calls being processed in the recording. This per-

formance evaluation was run on a laptop using a four

core processor running at 3.4 ghz with 16 gigabytes

of memory. Our PORT compiler comes with a script

to reproduce these results with one command.

The results in Table 4 show that PORT’s process-

ing time increases in line with the total number of sys-

Recordings are pre-processed to remove system calls

related to executable loading and process creation.

ICSOFT 2022 - 17th International Conference on Software Technologies

142

tem calls in the recording. We anticipate that much of

this processing cost is associated with initializing the

Python interpreter and it is likely that PORT’s perfor-

mance is closely tied to disk throughput.

4.4 Threats to Validity

While we conducted this evaluation as rigorously as

possible, there are a few areas where some ambigu-

ity may exist. First, in our work with USB activ-

ity, we limited ourselves to US English keyboards.

Other keyboard languages and designs may require

enhancements to our transformer or programs. Ad-

ditionally, our performance evaluation samples from

only a handful of programs that were selected by pop-

ularity rather than at random. There may be programs

that would diverge from the performance trend we re-

port above. Future work can determine how widely

this phenomena occurs and if any subsequent modiﬁ-

cations are required.

5 RELATED WORK

One of the ultimate goals of developing PORT was to

make it easier for developers to create tools capable

of conducting program-level testing. To design such

a language, we consulted previous work in process-

ing sequences of events, such as system calls, RPC

invocations or web-browser events. Below, we dis-

cuss some of the more signiﬁcant work in these areas.

System Call Stream Processing Applications.

System call based intrusion detection systems fall

into two categories: misuse and anomaly detection.

The former search for known patterns of application-

speciﬁc system call sequences known as intrusion sig-

natures (García-Teodoro et al., 2009), while the lat-

ter assumes that any deviations from “normally ob-

served” system call sequences are malicious (Forrest

et al., 1996).

Forrest et al. (Forrest et al., 1996) proposed an

anomaly detection system that catalogs witnessed pat-

terns within a database. An application’s system call

stream is monitored and any deviation triggers a pre-

deﬁned security policy.

Ko et al. (Ko et al., 1994) proposes converting

each system call in a stream to a standard audit-policy

record format that can be matched against program

policy. However, the audit-policy can only be applied

to one system call at a time, and does not support rules

to recognize speciﬁc chains of system calls. Another

alternative is Systrace (Provos, 2003), which uses an

associated policy language to describe any action pre-

scribed when a rule evaluates to true. Phoebe (Zhang

et al., 2020) identiﬁes patterns of system call failures

during normal program execution to test the reliabil-

ity of an application when a failure occurs. The down-

side is that more elaborate fault-injection tests cannot

be generated from these sequences.

It is likely that the previously cited FSA-based

programs can be improved by applying recent ad-

vances in inference modeling algorithms (Mariani

et al., 2017; Walkinshaw et al., 2013; Emam and

Miller, 2018; Beschastnikh et al., 2014). Yet, these

algorithms lack the conciseness and ﬂexibility found

in PORT. PORT does not require training sets and is

expressive enough to specify both frequent and “nee-

dle in the haystack” event sequences with just a few

lines of code.

Event Stream Processing Languages and Algo-

rithms. PORT can be categorized as a stream pro-

cessing language, which means it is domain-speciﬁc

and designed for expressing streaming applications.

In this section we look at previous work in this area

Pattern matching over event streams is a paradigm

that looks for possible matches against a previously

deﬁned set of rules. Collectively, these matches form

a pattern. Languages written for this purpose are sig-

niﬁcantly richer than those used for regular expres-

sion matching (Agrawal et al., 2008), and typically

provide automatic support for naming, type check-

ing, ﬁltering, aggregating, classifying and annotation

of incoming events. They also provide many bene-

ﬁts over traditional stream-based text processing lan-

guages, such as sed (McMahon, 1979) and awk (Aho

et al., 1979).

Though PORT is a stream processing language,

it does not require all of the features typically in-

cluded in this sort of system (Dayarathna and Perera,

2018). Rather PORT seems to ﬁt within the special

case known as complex event processing (CEP). Data

items in input streams of these systems are referred to

as raw events, while items in output streams are called

composite (or derived) events. A CEP system uses

patterns to inspect sequences of raw events and gen-

erate a composite event for each match (Hirzel et al.,

2013)

MatchRegex (Hirzel, 2012) is a CEP engine for

IBM’s Stream Processing Language. Predicates de-

ﬁned on the individual events appearing in the stream

can be utilized in the regular expression-based pattern

matching engine. MatchRegex supports regular ex-

pression operators, such as “Kleene star” and “Kleene

plus” over patterns consisting of predicates (boolean

expressions).

Needles in a Haystack: Using PORT to Catch Bad Behaviors within Application Recordings

143

Though these CEP systems are capable of recog-

nizing the same stream patterns as PORT, they do not

incorporate the transformation primitives required by

the applications envisioned for PORT. CEP systems

are meant to be used solely to recognize additional

patterns. It is the combination and interplay of pat-

tern matching and transformation primitives that dis-

tinguishes PORT from CEP systems.

6 CONCLUSION

One can gain a lot of value from analyzing an appli-

cation’s activity. Unfortunately, the volume of activ-

ity an application produces makes it difﬁcult to sep-

arate out unimportant sequences. In this work, we

demonstrate how our new domain speciﬁc language,

PORT, offers a way to write concise, yet, expressive

descriptions of application activity sequences. These

descriptions can be compiled into programs that both

recognize the described activity sequence and modify

its contents in order to facilitate more active testing.

We used this capability to recreate the successful pro-

grams from earlier work on the SEA technique and

showed that SEA can be extended to other activity

representations, such as recorded USB trafﬁc.

REFERENCES

Agrawal, J., Diao, Y., Gyllstrom, D., and Immerman, N.

(2008). Efﬁcient pattern matching over event streams.

In Wang, J. T., editor, Proceedings of the ACM SIG-

MOD International Conference on Management of

Data, SIGMOD 2008, Vancouver, BC, Canada, June

10-12, 2008, pages 147–160. ACM.

Aho, A. V., Kernighan, B. W., and Weinberger, P. J.

(1979). Awk-a pattern scanning and processing lan-

guage. Softw. Pract. Exp., 9(4):267–279.

barscanner (2009). Barscanner Stopped Functioning. https:

//bugzilla.kernel.org/show_bug.cgi?id=13411.

Beschastnikh, I., Brun, Y., Ernst, M. D., and Krishna-

murthy, A. (2014). Inferring models of concurrent

systems from logs of their behavior with csight. In

Jalote, P., Briand, L. C., and van der Hoek, A., ed-

itors, 36th ICSE, Hyderabad, India - May 31 - June

07, 2014, pages 468–479. ACM.

Bollig, B., Habermehl, P., Leucker, M., and Monmege, B.

(2014). A robust class of data languages and an appli-

cation to learning. Log. Methods Comput. Sci., 10(4).

Dayarathna, M. and Perera, S. (2018). Recent advance-

ments in event processing. ACM Comput. Surv.,

51(2):33:1–33:36.

Emam, S. S. and Miller, J. (2018). Inferring extended

probabilistic ﬁnite-state automaton models from soft-

ware executions. ACM Trans. Softw. Eng. Methodol.,

27(1):4:1–4:39.

Forrest, S., Hofmeyr, S. A., Somayaji, A., and Longstaff,

T. A. (1996). A sense of self for unix processes. In

1996 IEEE Symposium on Security and Privacy, May

6-8, 1996, Oakland, CA, USA, pages 120–128. IEEE

Computer Society.

García-Teodoro, P., Díaz-Verdejo, J., Maciá-Fernández, G.,

and Vázquez, E. (2009). Anomaly-based network

intrusion detection: Techniques, systems and chal-

lenges. Computers & Security, 28(1):18–28.

Gopstein, D., Iannacone, J., Yan, Y., DeLong, L., Zhuang,

Y., Yeh, M. K.-C., and Cappos, J. (2017). Understand-

ing misunderstandings in source code. In Proceedings

of the 2017 11th Joint Meeting on Foundations of Soft-

ware Engineering, ESEC/FSE 2017, page 129–139,

New York, NY, USA. Association for Computing Ma-

chinery.

Hak5 (2022). Usb rubber ducky. https://docs.hak5.org/

usb-rubber-ducky-1/.

Hirzel, M. (2012). Partition and compose: parallel complex

event processing. In Bry, F., Paschke, A., Eugster,

P. T., Fetzer, C., and Behrend, A., editors, Proceed-

ings of the Sixth ACM International Conference on

Distributed Event-Based Systems, DEBS 2012, Berlin,

Germany, July 16-20, 2012, pages 191–200. ACM.

Hirzel, M., Andrade, H., Gedik, B., Jacques-Silva, G.,

Khandekar, R., Kumar, V., Mendell, M. P., Nasgaard,

H., Schneider, S., Soulé, R., and Wu, K. (2013). IBM

streams processing language: Analyzing big data in

motion. IBM J. Res. Dev., 57(3/4):7.

Kaminski, M. and Francez, N. (1994). Finite-memory au-

tomata. Theor. Comput. Sci., 134(2):329–363.

Ko, C., Fink, G., and Levitt, K. N. (1994). Automated de-

tection of vulnerabilities in privileged programs by ex-

ecution monitoring. In 10th ACSAC 1994, 5-9 Decem-

ber, 1994 Orlando, FL, USA, pages 134–144. IEEE.

Mariani, L., Pezzè, M., and Santoro, M. (2017). Gk-tail+

an efﬁcient approach to learn software models. IEEE

Trans. Software Eng., 43(8):715–738.

McMahon, L. E. (1979). SED: a Non-interactive Text Edi-

tor. Bell Telephone Laboratories.

Moore, P., Cappos, J., Frankl, P. G., and Wies, T. (2019).

Charting a course through uncertain environments:

SEA uses past problems to avoid future failures. In

Wolter, K., Schieferdecker, I., Gallina, B., Cukier, M.,

Natella, R., Ivaki, N. R., and Laranjeiro, N., editors,

30th IEEE International Symposium on Software Re-

liability Engineering, ISSRE 2019, Berlin, Germany,

October 28-31, 2019, pages 1–12. IEEE.

Provos, N. (2003). Improving host security with system call

policies. In Proceedings of the 12th USENIX Secu-

rity Symposium, Washington, D.C., USA, August 4-8,

2003. USENIX Association.

Walkinshaw, N., Taylor, R., and Derrick, J. (2013). Infer-

ring extended ﬁnite state machine models from soft-

ware executions. In Lämmel, R., Oliveto, R., and

Robbes, R., editors, 20th Working Conference on Re-

verse Engineering, WCRE 2013, Koblenz, Germany,

October 14-17, 2013, pages 301–310. IEEE Computer

Society.

ICSOFT 2022 - 17th International Conference on Software Technologies

144

Wireshark (2022). Wireshark.org. https://www.wireshark.

org/.

wrongid (2014). wrong Vendor-Id and Product-Id. https:

//bugzilla.kernel.org/show_bug.cgi?id=87631.

Zhang, L., Morin, B., Baudry, B., and Monperrus, M.

(2020). Realistic error injection for system calls.

CoRR, abs/2006.04444.

Needles in a Haystack: Using PORT to Catch Bad Behaviors within Application Recordings

145