Glassbox: Dynamic Analysis Platform for Malware Android
Applications on Real Devices
Paul Irolla and Eric Filiol
Laboratoire de Cryptologie et Virologie Op
´
erationnelles (CVO Lab),
´
Ecole d’Ing
´
enieurs du Monde Num
´
erique (ESIEA),
38 Rue des Docteurs Calmette et Gu
´
erin, 53000 Laval, France
Keywords:
Dynamic Analysis, Android, Malware Detection, Automatic Testing.
Abstract:
Android is the most widely used smartphone OS with 82.8% market share in 2015 (IDC, 2015). It is therefore
the most widely targeted system by malware authors. Researchers rely on dynamic analysis to extract malware
behaviors and often use emulators to do so. However, using emulators lead to new issues. Malware may detect
emulation and as a result it does not execute the payload to prevent the analysis. Dealing with virtual device
evasion is a never-ending war and comes with a non-negligible computation cost (Lindorfer et al., 2014). To
overcome this state of affairs, we propose a system that does not use virtual devices for analysing malware
behavior. Glassbox is a functional prototype for the dynamic analysis of malware applications. It executes
applications on real devices in a monitored and controlled environment. It is a fully automated system that
installs, tests and extracts features from the application for further analysis. We present the architecture of
the platform and we compare it with existing Android dynamic analysis platforms. Lastly, we evaluate the
capacity of Glassbox to trigger application behaviors by measuring the average coverage of basic blocks on
the AndroCoverage dataset (AndroCoverage, 2016). We show that it executes on average 13.52% more basic
blocks than the Monkey program.
1 INTRODUCTION
Google reacted to the rise of malware with a dy-
namic analysis platform, named Bouncer (Lock-
heimer, 2012), that analyzes applications before the
release on Google Play. This security model is cen-
tralized and acts before the distribution of applica-
tions. Whereas this system suffers from limitations
like virtual device evasion it has helped to reduce
the malware invasion by 40% (Lockheimer, 2012).
Android antivirus companies use another central-
ized security model which acts after the distribution
of applications. Because applications have access to
restricted resources and permissions, antivirus pro-
grams cannot perform their analysis as it often
requires root permissions and extensive resources.
Hence, the static analysis is externalized onto the
company servers. As a result, it can give quick re-
sponses — each application being analyzed just once.
This is a shift of security model for the common user
toward centralization.
Users have been used to decentralized security
model, i.e. personal antivirus. This security model
does not allow much room for manoeuvre because
any antivirus needs to be quick enough for not both-
ering the user otherwise another quicker antivirus
will be chosen. Whereas antivirus have implemented
heuristic algorithms, they are rather limited by the se-
curity model. Hence, the security model shifting is
an opportunity for building more complex systems
that require more resources to run. It enables secu-
rity systems to use advanced research techniques like
behavioral detection with dynamic analysis, or detec-
tion with features similarity from static analysis.
Malware authors made their strategy evolve with
the rise of Bouncer and other dynamic analysis sys-
tems. They have started to hide the payload execu-
tion with emulation detection or/and the requirement
of a user interaction. For example the reverse of the
sample (Dharmdasani, 2014) shows that malware are
currently using emulation evasion. Emulators settings
can be modified to mimic the appearance of a real de-
vice but there are lots of ways of detecting Android
emulation. Actually the tool Morpheus (Jing et al.,
2014) proves us this war is already lost as the authors
found around 10 000 heuristics to detect Android em-
ulation. So trying to modify the emulator to look real
is probably a waste of time. In such condition, we
610
Irolla, P. and Filiol, E.
Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices.
DOI: 10.5220/0006094006100621
In Proceedings of the 3rd International Conference on Information Systems Security and Privacy (ICISSP 2017), pages 610-621
ISBN: 978-989-758-209-7
Copyright
c
2017 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
need to redefine the problematic.
This is why we are presenting Glassbox, a dy-
namic analysis platform for Android malware appli-
cations on real devices. Glassbox is an environment
for the controlled execution of applications, where the
Android OS and the network are monitored and have
the capacity to block some actions of the analyzed ap-
plication. This environment is paired with a program
that automates the installation, the testing of applica-
tions and the cleaning of the environment afterwards.
We called this program Smart Monkey, as a reference
to Monkey, the Android tool for generating pseudo-
random UI event. The objective of Glassbox is to col-
lect features for machine learning algorithm, to clas-
sify applications as malware or as benign. In the fol-
lowing sections we will present the related work about
dynamic analysis systems, we will expose the archi-
tecture of both Glassbox and Smart Monkey. Finally,
we will present the average coverage of basic blocks
of Smart Monkey on the AndroCoverage Dataset (An-
droCoverage, 2016).
2 RELATED WORK
Such dynamic analysis systems started being de-
signed since 2010 (Bl
¨
asing et al., 2010) by the aca-
demic research, to circumvent the limitations of static
analysis namely, code morphism and obfuscation.
Since that time, many systems have been released.
For this study we have built a classification of a part of
these systems, presented in Table 1 and Table 2. The
classification takes into account three categories: the
dynamic features collected by the analysis, the strate-
gies set in order to automate application testing and
finally the use of real devices in dynamic analysis sys-
tems history. We discuss the results on the following
sub-sections.
2.1 Features Analyzed
Since the rise of Android dynamic analysis systems,
the use of system calls have been the leading ap-
proach. System calls are the functions of the kernel
space, available to the user space. It gives the capacity
to manipulate hard drive files or to control processes.
System calls can describe a program behaviors, from
a low level perspective. The retrieval of those calls
can be achieved in two ways, mainly:
Virtual Machine Introspection This is a tech-
nique available for emulators, which enables the
host to monitor the guest. It cannot be detected by
the guest since it is out of its reach and it is there-
fore convenient for security analysis. Andrubis
(Lindorfer et al., 2014), CopperDroid (Tam et al.,
2015) and DroidScope (Yan and Yin, 2012) take
advantage of VMI to retrieve, unseen by the tar-
get malware, all systems calls done by the guest
Android virtual machine.
Strace/ptrace Strace is a Linux utility for de-
bugging processes. It can monitor system calls,
signal deliveries and changes of process state.
Strace use the ptrace system call to monitor an-
other process memory and registers. This sec-
ond method is by far the simplest and the most
straightforward one as the only task here is the
automation of the strace execution. Moreover, it
targets directly the system calls of the application
we need to. That is why this method has been
adopted in most of the literature, namely Crow-
droid (Burguera et al., 2011), Maline (Dimja
ˇ
sevi
´
c
et al., 2016), (Canfora et al., 2015) and (Afonso
et al., 2015). We have also chosen to use the
strace utility for system calls monitoring. Despite
the theoretical possibility of a malware to detect
that it is being debugged, we found no evidence
about this.
System calls seem to give great results for classifica-
tion. Maline reported 96% accuracy rate, and (Can-
fora et al., 2015) reported 94.9% accuracy rate on un-
seen applications with syscalls frequencies only. Ac-
tually syscalls capture low level behaviors of both
Java code and native code.
The second most collected feature is taint track-
ing information as it reveals data leakage. It works
by the instrumentation of the Dalvik VM interpreter.
The information we do not want to leak is called a
source. Some source of personal data are tainted, like
the phone number or the contacts list. Each time a
tainted source or value is used in a method call, the
DVM interpreter taints the returned value. With this
simple mechanism, we can observe the propagation
of the tainted information regardless of its transfor-
mations. A function that enable to transmit an infor-
mation outside of the system is called a sink, like net-
work requests or SMS. If a tainted value is used in a
sink, it means data source has leaked. It enables to de-
tect data leakage even if this data have been ciphered
or encoded. An application that leaks data is not nec-
essarily a malware, as data leakage is the business of
both malware and user tracking frameworks in com-
mercial applications which constitutes essentially
a large part of goodware applications. Whereas this
feature gives useful insights on the application behav-
iors for manual analysis, its utility for automatic mal-
ware detection needs to be proved. Moreover the im-
plementation and execution of taint tracking is costly,
which leads us not to choose this feature for now.
Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices
611
Table 1: Comparative state of the art of dynamic analysis systems.
Reference Tool name Dynamic features used App testing strategies Objectives & comments
Thomas Bl
¨
asing
et al. 2010
AASandbox System calls (name) Monkey
Data for malware/benign
classification
# Virtual device
Iker Burguera
et al. 2011
Crowdroid System calls (name) Crowdsourced app interactions
Data for malware/benign
classification
# Real device
Cong Zheng
et al. 2012
SmartDroid Taint tracking, +?
UI brute force
Restriction of execution
to targeted activities
Data for classification
or manual analysis
# Virtual device
Lok Kwong Yan
et al. 2012
DroidScope
System call (all content)
Java calls (all content)
Taint tracking
-
Data for classification
or manual analysis
# Virtual device
Vaibhav Rastogi
et al. 2013
AppsPlayground
Taint tracking
Targeted Android API Java calls
Monkey
UI brute force
Broadcast events
Text fields filling
Malware/benign classification
# Virtual device
Martina Lindorfer
et al. 2014
Andrubis
App Java calls (all content)
System calls (name, +?)
Shared libraries targeted calls
(name, +?)
Taint tracking
DNS/HTTP/FTP/SMTP/IRC
(all content)
Monkey
Broadcast events
All possible app services
All possible app activities
Data for classification
or manual analysis
# Virtual device
Mingyuan Xia
et al. 2015
AppAudit Taint tracking
Malware/benign classification
Data leaks detector
# Symbolic execution
Vitor Monte Afonso
et al. 2014
-
Targeted Android API Java calls
(name)
System calls (name)
Monkey
Broadcast events
Malware/benign classification
96.66% accuracy
# Virtual device
Kimberly Tam
et al. 2015
CopperDroid
System calls (all content)
Binder data
Broadcast events
Text fields filling, +?
Data for classification
or manual analysis
# Virtual device
Gerardo Canfora
et al. 2015
- System call (name) Monkey
Malware/benign classification
94.9% accuracy (unseen applications)
# Virtual device
Marko Dimja
ˇ
sevi
´
c
et al. 2016
Maline System call (name)
Monkey
Broadcast events
Malware/benign classification
96% accuracy
# Virtual device
Michelle Y. Wong
et al. 2016
IntelliDroid Taint tracking
Targeted inputs leading to
suspicious Android API calls
Data for classification
or manual analysis
# Virtual device
Gerardo Canfora
et al. 2016
-
Measures of resource consumption
(CPU, Network, Memory, Storage I/O)
Monkey
Malware/benign classification
99.52% accuracy
# Virtual device
- Glassbox
Java calls (name)
System calls (name)
HTTP/HTTPS requests
(all content)
Monkey
UI brute force
Broadcast events
Real SMS/Call
Text fields filling
Data for malware/benign
classification
# Real device
Table 2: Legend.
+?
The paper is not clear enough on those details
and we cannot be sure that it is an exhaustive list
call (name) Only the name of the call is used, in order to get the appearance frequency
# Comment
- No data
Data exists but is irrelevant for this study
ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering
612
Java calls is another feature of interest as it cap-
tures an explicit behavior of the application. There
are several ways to collect them:
Application Instrumentation This strategy
does not need any modification of the Android
source code and is not dependent on Android ver-
sion. The application can be modified in or-
der to dump targeted method parameters and re-
turn values. APImonitor (pjlantz, 2012) is a tool
that enables the instrumentation of targeted Java
calls. It reverses the application into smali, a hu-
man friendly format equivalent to the Java byte-
code, with the baksmali (JesusFreke, 2009) utility.
Then, it adds monitor routines around the targeted
calls and it recompiles the code with the smali (Je-
susFreke, 2009) utility. This strategy is used by
the authors of (Afonso et al., 2015).
DVM/ART Instrumentation The DVM
(Dalvik Virtual Machine) or ART (Android Run-
Time, Android API version 4.4) is the sys-
tem that interprets and executes all the applica-
tion instructions. All Java calls converge to this
component. Hence, by hooking the execution of
DVM/ART, one can monitor and control all Java
calls, their arguments and their return values. That
implies the modification of Android source code
and its compilation to a custom ROM. This is the
strategy we chose to use for collecting Java calls.
We prefer this method for keeping the application
behaviors pristine, and particularly not inducing
additional bugs. Andrubis (Lindorfer et al., 2014)
and DroidScope (Yan and Yin, 2012) use similar
approaches for tracing method calls.
For the last features, they are highly marginal.
Here, Andrubis reported the retrieval of targeted
shared library calls. Another data are the network
communications. Only Andrubis reported the utiliza-
tion of features from network communications, but
without any further details. Our system makes use of
Panoptes (Filiol and Irolla, 2015) for gathering plain
text and encrypted web communications. A recent
study, (Canfora et al., 2016), shows that the mea-
sures of resource consumption for malware detection
is a promissing trail. It reports 99.52% accuracy with
global measures of CPU usage, Network usage, RAM
usage and Storage I/O usage.
2.2 Automated Testing Strategies
Dynamic analysis does not consist of launching the
application and waiting the malware to show its ma-
licious behaviors off. Malware are using logic bombs
for hiding the payload. Logic bombs are a malicious
piece of code that is executed after a condition is trig-
gered. It means we need to test each application as a
real user could have done it. For achieving this objec-
tive, several strategies have been used in the past:
Black Box Testing Strategies This class of
strategies does not take the application source
code into account, it focuses on sending inputs
in the application without any prior information.
This is the commonly used strategy. Monkey
1
is
a dedicated tool created by Google for this task.
It generates random events in a fast pace. Events
range from system events (home/wifi/bluetooth/-
sound volume etc.) to navigation events (mo-
tion, click). Because of its capacity to quickly
explore applications activities, it has been used
by most of dynamic analysis systems (AASandbox
(Bl
¨
asing et al., 2010), AppsPlayground (Rastogi
et al., 2013), Andrubis (Lindorfer et al., 2014),
(Afonso et al., 2015), (Canfora and al., 2015 and
2016), Maline (Dimja
ˇ
sevi
´
c et al., 2016)). Monkey
is sometimes confused with Monkey Runner
2
in
the literature, which is a python library for writing
Android test routines.
White Box Testing Strategies This class of
strategies takes the application source code into
account. It focuses on sending specific inputs in
the application for triggering targeted code paths.
It requires the information from the static analy-
sis of the application. Parsing the code is needed,
to find the target methods and all their trigger-
ing conditions. SmartDroid (Zheng et al., 2012)
and IntelliDroid (Wong and Lie, 2016) determine
all paths to sensitive API calls, then execute one
of the paths to the target with dynamic analy-
sis. Another kind of White Box testing strategy
is symbolic execution where dynamic analysis is
done by simulating the execution of the applica-
tion static code. AppAudit (Xia et al., 2015) uses
this technique for finding data leaks with symbolic
taint tracking.
Grey Box Testing Strategies This class of
strategies partially takes the applications source
code into account. It focuses on testing all visible
inputs the application declares or displays (UI).
It usually takes the output of the application to
generate the next inputs. Andrubis uses a Grey
Box strategy when it tests all possible application
services and activities, because they get the infor-
mation from the application manifest. AppsPlay-
ground also uses Grey Box testing with its Intelli-
1
https://developer.android.com/studio/test/monkey.html
2
https://developer.android.com/studio/test/monkeyrunner/
index.html
Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices
613
gent Execution where windows, widgets, and ob-
jects are uniquely identified to know when an ob-
ject has already been explored. Grey Box testing
have the advantages of using tests that the appli-
cations is suceptible to respond, contrary to Black
Box testing, without the requirement of process-
ing a Control Flow Graph, as White Box testing.
It is then the most efficient way of testing applica-
tions. This is why we use strategy in Glassbox.
2.3 Real Devices
The use of real devices for dynamic analysis started
with Crowdroid (Burguera et al., 2011), a crowd-
sourced based analysis. Whereas this approach give
good results, one cannot ask users to execute real mal-
ware on their personal device. So this system can only
be an option, when we have already a trained machine
learning algorithm, to find malware in the wild.
BareDroid is a system which manages real devices
in large scale for dynamical analysis. Whereas Bare-
Droid (Mutti et al., 2015) cannot be considered as a
dynamical analysis system, because it does not anal-
yse applications, it brought two major results for our
study. First, real devices for dynamic analysis sys-
tems are a scalable solution financially and in excu-
tion time compared to virtual devices. Second, using
real devices drastically improves features detected for
malware families that often rely on emulator evasion
like Android.HeHe, Android Pincer, and OBAD.
3 ARCHITECTURE OVERVIEW
Glassbox (Figure 1) is a modular system distributed
among one or several phones and a computer. Each
part is detailed in the following sections.
3.1 Android Instrumentation
A custom Android OS has been made, based on the
Android Open Source Project (AOSP)
3
. The objec-
tive here is to log dynamically each Java call of a tar-
geted application. This involves hooking these calls,
at a point where all of them pass through. We instru-
mented ART (Android RunTime)
4
, the Android man-
aged runtime system that executes application instruc-
tions. With the default parameters, we found that ART
(at least until Android Marshmallow) have the follow-
ing important behaviors for our study:
3
https://source.android.com/
4
https://source.android.com/devices/tech/dalvik/index.html
The first time Android is launched, Java Android
API libraries and applications are optimized and
compiled to a native code format called OAT (Sa-
banal, 2015).
Each time a new application is installed, it is opti-
mized and compiled to OAT format.
Java methods can be executed in three ways: by
an OAT JUMP instruction to the method address,
by the ART interpreter for non-compiled methods
(debugging purposes mostly), or via the Binder
for invoking a method from another process or
with Java Reflection. Details on the Android
Binder can be found in (Schreiber, 2011) chapter
4.
A straightforward way of hooking Java calls is to in-
strument the ART interpreter. Unfortunately only a
few calls are executed through it because most of the
code is compiled into OAT and therefore it is not in-
terpreted. We forced all calls to be interpreted by
disabling several optimizations. The first one is the
disabling of the compilation to OAT. That leads calls
to be interpreted before executed. But others opti-
mizations mechanisms comes into play, namely direct
branching and inlining.
The boot classpath contains the Android frame-
work (Figure 2) and core libraries. They are always
compiled in OAT resulting in a boot.oat file. This file
is mapped into memory by the Zygote process, started
at the initialisation of Android. For launching an ap-
plication, the main activity is given at Zygote in pa-
rameters. When Zygote is called that way, it forks and
starts the given activity. It means that any application
has access to same instance of the Android frame-
work and core libraries. Direct branching is an op-
timisation that replaces framework/core method calls
by their actual address in memory. So the calls does
not pass through the interpreter. That optimisation is
disabled.
Then inlining is an optimisation that replaces short
and frequently used methods with their actual code.
Although, it slightly increases the application size in
memory, runtime performance are increased. As there
is no method any more, it cannot be hooked in the
ART interpreter. That optimisation is also disabled.
A monitoring routine is added to ART interpreter
that logs any method call from a targeted application
pid. A sample of a capture of Java calls is shown in
annexes. All these modifications overload the global
execution of Android. Whereas it is not noticeable for
most of the applications, on gaming applications are
visibly slow down by this approach.
Lastly the phone is shipped with a real SIM card
for luring malware payloads with SMS/MMS/Calls.
ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering
614
Smart Monkey
Automated app testing
Logs
Collected app features
Panoptes
Wifi access point
Http/Https interception
Phone
Instrumented Android
ClamAV
malware ?
WoT
trustworthy url ?
Inetsim
Network services
simulation
Internet
Glassbox
i
n
p
u
t
s
o
u
t
p
u
t
s
r
e
q
u
e
s
t
s
r
e
s
p
o
n
s
e
s
r
e
q
u
e
s
t
s
r
e
s
p
o
n
s
e
s
n
o
r
e
s
p
o
n
s
e
s
y
e
s
r
e
s
p
o
n
s
e
s
y
e
s
r
e
s
p
o
n
s
e
s
n
o
r
e
s
p
o
n
s
e
s
syscalls / javacalls
n
e
t
w
o
r
k
f
e
a
t
u
r
e
s
w
o
t
r
e
p
u
t
a
t
i
o
n
s
c
o
r
e
s
Figure 1: Glassbox — Architecture overview.
Many malware may use it for stealing money with
premium numbers, and because we use a real SIM
card it would actually cost us money. We modified
the telephony framework of the Android API to re-
ject all outgoing communications except for our own
phone number. When a forbidden call is made, the
calling UI pops and closes after one second around.
This way does not crash applications that rely on calls
and SMS.
3.2 Network Control & Monitoring
All communications of the instrumented phone pass
through a transparent SSL/TLS interception proxy be-
hind a wifi access point. This is set by Panoptes (Fil-
iol and Irolla, 2015). To understand how it works we
need to describe a part of the TLS handshake. Here is
the regular behavior of a https request on Android:
Android have a keystore of all root certificates the
system trusts. When a SSL/TLS request is initialized,
the requested server send its certificate. It contains
identifying informations — like the domain name that
must verify the contacted domain name and a sig-
nature that can only be decrypted with the right root
CA. The server certificate is tested with each trusted
root CA, and if one matches the communication is ac-
cepted. Extended information on the TLS handshake
can be found with the RFC 2246 memo (Dierks and
Allen, 1999).
For our interception system to work, a SSL/TLS
Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices
615
Figure 2: Android architecture overview.
root certificate from a custom certification authority
(CA) is implanted in the keystore of Android. When
the device requests a https web page, the request goes
through the proxy. It is parsed and a new one is ini-
tialised to be sent to the original recipient. The re-
sponse is encapsulated in a new SSL/TLS response
signed by our custom certificate. This custom certifi-
cate is dynamically generated with the recipient iden-
tifying informations and our custom root CA private
key. As the communication is signed by an author-
ity of certification that is known by the client, An-
droid accepts it without any warning. Finally, all
HTTP/HTTPS communications are logged and a re-
port can be generated which is convenient for manual
analysis if needed.
This system has been extended to support manipu-
lation of requests. The objective is to restrict the pro-
liferation of malware and the damage that it may pro-
duce. As Glassbox runs malware, it may have a neg-
ative impact on its environment. An extreme mean
could be to disconnect the system from internet but
we would see less or no malicious behavior at all for
numerous applications. Our design is a trade-off be-
tween safety and behavior detection:
ClamAV (Kojm, 2004) is used to detect known
malware sent through network. If a malware is
detected, the payload is removed from the request
and is redirected to Inetsim (Hungenberg and Eck-
ert, 2013), a network services simulation server
that replies consistently to the requests. It forbids
the communication between the application and
internet without crashing it.
For all other requests, we assess the reputation of
the domain name or IP address with the Web of
Trust (WoT)
5
API. WoT is a browser extension
that filters urls based on different reputation rat-
ing. These rating come mainly from the users. If
the request contacts a known address with a good
reputation, we forbid the application under test
to reach it, then it is redirected to Inetsim. The
advantages are twofold. The application cannot
damage a respectable website, and it pre-filters
behaviors for classification.
Finally features from communications content are
collected and the WoT reputation scores as well. A
sample of a network capture is shown in the annexes.
3.3 Automated Application Testing
Smart Monkey is an automated testing program based
on Grey Box strategies. The context of the applica-
tion is determined at runtime for the automatic explo-
ration. We use UIautomator
6
, a tool that can dump
the hierarchy tree of the current UI elements present
on screen. It enable us to monitor variables of each UI
element at runtime. For a smart exploration, we need
to know if we have already processed an element. Un-
fortunately, elements do not carry such a unique iden-
tifier. Nonetheless, we found that elements can be
identified to some degree:
5
https://www.mywot.com/wiki/API
6
https://stuff.mit.edu/afs/sipb/project/android/docs/tools/
help/uiautomator/index.html
ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering
616
Strong Identification Elements can have an
associated ID string that developers set. Concate-
nated to the current activity name we have robust
identification, but for most of the elements this
field is empty. With the same method, a content
description is sometimes associated with the ele-
ment. We can also use this for strong identifica-
tion.
Partial Identification If we do not have ac-
cess to the previous values, which happens most
of the times, we can use lesser discriminative val-
ues. Textfields can be set with an initial value, or a
printed text can be associated to it. With no better
available options we use the element dimensions
to identify it. Obviously, when an element is par-
tially identified, a risk of false positive is possible.
Moreover each element carries a list of actions it can
trigger. Our automatic exploration consists of system-
atically triggering all actions of all elements for all ac-
tivities. We do not try each combination of actions as
it would not scale and be mostly redundant. To this
basic general process we add targeted actions to trig-
ger more sophisticated behaviors:
Some textfields of interest are detected like phone
number, first or last name, email address, IBAN,
country, city, street addresses, password or pin
code. These textfields are filled with consis-
tent values accordingly. For this task, we use
databases of realistic data (samples can be found
in the annexes). Uncategorised textfields are filled
with a pseudo-random string.
The order of actions done matters. For example
login and password textfields must be filled before
validating. In the exploration, filling textfields and
check-boxes takes precedence over the rest.
An application can register a receiver for a broad-
cast Android event like the change of phone state
or wifi state. It can be done statically in the ap-
plication manifest or dynamically. Those dynam-
ical receivers could be hidden from static analysis
with obfuscation. To trigger the receivers code,
we test applications with a list of broadcast events
that are often used by malware (a partial list is
given in annexes). Moreover, real SMS and phone
calls are sent to the real device own number.
We finally use the Monkey program during the analy-
sis. It can help to trigger behaviors requiring complex
inputs combination that Smart Monkey could miss. At
the end comes the cleaning phase. For our real device
we keep a white list of regular processes and installed
applications regular, system and device admin-
istrator applications. Non-authorised processes are
killed and applications uninstalled. Important phone
configurations like wifi, data network and sounds are
reset to a predefined value.
4 EXPERIMENTATION
4.1 Performance Measure
We use the average coverage of basic blocks for quan-
tifying the performance of the application code cov-
erage. It is a measure of the performance of the Smart
Monkey component of Glassbox. Here are definitions
of the vocabulary used in the experimentation:
A basic block is an uninterrupted section of in-
structions. A basic block begins at the start of the
program or at the target of a control transfert in-
struction (JUMP/CALL/RETURN). It ends at the
next control transfert instruction.
The basic block coverage for an application is the
number of unique basic blocks executed at run-
time divided by the number of unique basic blocks
present in the source code.
The average coverage of basic blocks is the sum
of the basic block coverage of all applications di-
vided by the number of applications.
4.2 Dataset
We use the AndroCoverage Dataset (AndroCoverage,
2016) for our experimentation. It contains 100 appli-
cations from F-Droid
7
, which is a repository of free
and open source (FOSS) applications. We have manu-
ally selected them with the following criteria for each
application:
It does not depend on a third party library or ap-
plication as an automatic tool would be unable to
install it.
It does not depend on root privilege. To meet the
requirement of a maximum of testing tools con-
figuration, we stick with regular privilege.
It does not depend on local or temporary remote
data. We want the application to be usable world-
wide and in the long-term. This category excludes
applications for a temporary event or a specific
country.
Our goal is to use applications which show a large
variety of different and steady behaviors. It is why
we predict that performance on the AndroCoverage
7
https://f-droid.org/
Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices
617
Dataset will be overestimated compared to the aver-
age of real applications. This dataset is to be used to
compare the performance of different automated test-
ing tools on the same ground.
The AndroCoverage Dataset is supplied with tools
which instruments the application, adding monitoring
routines for code coverage. These tools are partially
founded on BBoxTester (Zhauniarovich et al., 2015),
a tool for measuring the code coverage for Black Box
testing of Android applications.
4.3 Methodology
Research community used different strategies for au-
tomated application testing, with different evaluation
methods and different datasets. To promote the suc-
cessful strategies for future researches on the domain,
we need a standard for the experimentation. Other-
wise, we cannot compare the results objectively. The
research titled Automated Test Input Generation for
Android: Are We There Yet? (Choudhary et al., 2015)
shows the re-evaluation on the same ground of 5 pub-
lished automated testing tools for Android. The ex-
perimental results found is far from what have been
claimed in the published papers. Moreover, accord-
ing to this study, the Monkey program have the best
performances above all at around 53% average cover-
age of statements on 68 selected applications. Either
all rechearches on Android automated testing are not
better than a random event generator or the evaluation
methodology lacks pertinence. Our opinion is that a
better methodology can highlight the contribution of
main researches to the field.
To summarize, these observations reveal several
problems on the experimental results:
(1) They are currently not reproductible.
(2) They cannot be compared to each other.
(3) They do not hightlight the contribution of
the evaluated testing method compared to Monkey
program.
To answer those problems, we propose the follow-
ing rules:
(2)
8
A common performance measure. We pro-
pose the average coverage of basic blocks. State-
ment coverage (also called line coverage) is con-
sidered as the weakest code coverage measure by
specialists in software testing. This metric should
not be used when another one is available. For an
argued reflection about coverage metrics, we refer
to the paper What is Wrong with Statement Cov-
erage (Cornett, 1999).
8
This number and the following ones refers the problem
number it solves
(1)(2) A common dataset and common tools for
instrumenting the applications. We propose the
AndroCoverage Dataset (AndroCoverage, 2016).
(1)(2) A common configuration Monkey ar-
guments, a fixed seed for every random number
generator used and application versions. These
information are either present in annexes of this
document or on the AndroCoverage Github web
page.
(3) To assess the performance of the combination
of both Monkey and the evaluated testing method
(in our case it is Smart Monkey). Compared sep-
arately, the performance of Monkey and the eval-
uated method does not hightlight new code paths
that have been triggered by the evaluated method.
A complex method would not seem successful
whereas it would have triggered complex condi-
tions that Monkey could never find. Moreover, the
Monkey program is embeded in every Android de-
vice (real and virtual), it interacts in a very fast
pace with the application and produces good re-
sults. Then, on an operational situation, it makes
sense to use it in addition to any research tool.
4.4 Results
Smart Monkey usually runs Monkey at the beginning
of the analysis. For a fair trial, we tested the perfor-
mance of its code coverage with and without Mon-
key. The configuration of the Monkey tool has been
described in annexes. The results are presented on
Table 3.
The Monkey program tends to generate bugs with
the instrumentation process. For a significant amount
of applications (16%) we are unable to get the cov-
erage rate. We note that the same applications crash
between Monkey and Smart Monkey so the crash rate
has no effect on the performance comparison between
both programs.
The raw results does not give enough insight of
the contribution of Smart Monkey. We are interested
in the new paths that have been triggered compared
to the Monkey program. Therefore we calculate the
increase of average coverage of basic blocks of Smart
Monkey compared to Monkey:
smartmonkey
ac
monkey
ac
= 1.1352
Where:
smartmonkey
ac
is the average coverage of basic
blocks of Smart Monkey.
monkey
ac
is the average coverage of basic blocks
of Monkey.
ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering
618
Table 3: Code coverage results.
Method Classes average coverage Methods average coverage Blocks average coverage Crash rate
Monkey
32.93% 35.05% 36.32% 16%
Smart Monkey (w/o Monkey)
34.84% 36.68% 37.73% 0%
Smart Monkey (with Monkey)
37.12% 41.6% 41.23% 16%
It means that the testing strategy we have set up in
Smart Monkey leads to average increase of 13.52% of
basic blocks coverage.
5 LIMITATIONS
Dynamic analysis systems that allows internet com-
munications are vulnerable to fingerprinting. Our
platform is not an exception. For example Bouncer
(Lockheimer, 2012) have been the target of remote
shell attacks (Percoco and Nicholas, 2012) that en-
abled the fingerprinting of the system. The malware
gets some information on the system and sends it to
a command and control server. Hence, the malware
author can reshape the trigger conditions of the logic
bomb. We accepted this risk for now. A solution
halfway between shutting down all communications
and no filtering at all could be to strip all outgoing
information POST request contents/GET url vari-
ables/Cookies/Metadata fields. This could lead to a
loss of behaviors and the negative impact of such so-
lution needs to be measured. Anyway a smart mal-
ware author will eventually find a way to leak remote
shell outputs.
The network monitoring has limitations. First, it
cannot currently handle all protocols like POP, IMAP
and FTP so these protocols are simply blocked. In
fact the communications are parsed, to get its con-
tent, the destination and the metadata. So this parsing
needs to be changed for each protocol. It is an im-
possible task of adding one by one all protocols, so
we would need to measure the protocol usage and im-
plement the most used ones. At last, there is a coun-
termeasure to our SSL/TLS interception, namely cer-
tificate pinning. The requipement of the interception
is the implantation of a custom root certificate in the
Android keystore of trusted certificates. An applica-
tion can choose to discard the Android keystore and
to embed its own. Therefore when a communication,
encrypted with our custom certificate, is checked, the
communication is rejected. This technique is used in
many banking applications (Filiol and Irolla, 2015).
In fact the point of view of the bank is: the user OS
cannot be trusted. Although we have no evidence that
it happens for malware, it may be used by an avant-
gardist malware and other would follow the trail. It
is inconvenient for malware authors to buy a certifi-
cate signed by an authority of certification, as a pay-
ment trace could identify them. Despite of that, it is
possible to get a valid certificate from Let’s Encrypt
9
, or to control a legtimate server via hacking and
use it as a relay for the C&C server. In these cases,
certificate pinning could be used for hiding commu-
nications from analysts or interception systems. A
counter to this technique is instrumentation. By mon-
itoring arguments of the SSL/TLS encryption method,
one can get the plaintext communications. We have
done it manually for some banking applications (Fil-
iol and Irolla, 2015) with APImonitor (pjlantz, 2012),
but doing it automatically is another issue. Appli-
cations that use certificate pinning generally embed
their own library for SSL/TLS encryption, so detect-
ing dynamically which call is the SSL/TLS encryption
method can be challenging.
Last, the cleaning phase of Glassbox fits the secu-
rity needed for a prototype. However, to move to an
operational situation with malware that could execute
0-day root exploits, we need a real factory-reset of
the phone. This is why we plan to integrate the open
source project BareDroid as a part of Smart Monkey,
for its factory-reset capability on real device.
6 CONCLUSION & FUTURE
RESEARCH
This paper contributes to the domain of dynamic anal-
ysis system for Android in three ways. First, we pre-
sented Glassbox a functional prototype of a platform
that uses real devices, controls network and GSM
communications to some extends and monitors Java
calls, systems calls and network communication con-
tent. Second, we experimented Smart Monkey, an
automatic testing tool with a Grey Box testing strat-
egy. We showed that it enhances the application code
coverage compared to the common Black Box test-
ing tool called Monkey. Last, we presented a method
of evaluation of automated testing tools to research
community. This method covers the problems of re-
producibility, the comparison with other works and of
the contribution measurement of the tool. We made
9
https://letsencrypt.org/
Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices
619
the dataset available on Github under the name An-
droCoverage.
The next step is to use Glassbox on malware/benign
applications and to use the features found on a ma-
chine learning algorithm. We are working on the clas-
sification of these data with a neural network.
REFERENCES
Afonso, V. M., de Amorim, M. F., Gr
´
egio, A. R. A., Jun-
quera, G. B., and de Geus, P. L. (2015). Identify-
ing android malware using dynamically obtained fea-
tures. Journal of Computer Virology and Hacking
Techniques, 11(1):9–17.
AndroCoverage (2016). Androcoverage dataset. [Online]
https://github.com/androcoverage/androcoverage.
Bl
¨
asing, T., Batyuk, L., Schmidt, A. D., Camtepe, S. A.,
and Albayrak, S. (2010). An android application sand-
box system for suspicious software detection. In Ma-
licious and Unwanted Software (MALWARE), 5th In-
ternational Conference on, pages 55–62.
Burguera, I., Zurutuza, U., and Nadjm-Tehrani, S. (2011).
Crowdroid: behavior-based malware detection system
for android. In Proceedings of the 1st ACM workshop
on Security and privacy in smartphones and mobile
devices, pages 15–26. ACM.
Canfora, G., Medvet, E., Mercaldo, F., and Visaggio, C. A.
(2015). Detecting android malware using sequences
of system calls. In Proceedings of the 3rd Interna-
tional Workshop on Software Development Lifecycle
for Mobile, pages 13–20. ACM.
Canfora, G., Medvet, E., Mercaldo, F., and Visaggio, C. A.
(2016). Acquiring and analyzing app metrics for ef-
fective mobile malware detection. In Proceedings of
the 2016 ACM on International Workshop on Security
And Privacy Analytics, pages 50–57. ACM.
Choudhary, S. R., Gorla, A., and Orso, A. (2015). Auto-
mated test input generation for android: Are we there
yet?(e). In Automated Software Engineering (ASE),
2015 30th IEEE/ACM International Conference on,
pages 429–440. IEEE.
Cornett, S. (1999). What is wrong
with statement coverage. [Online]
http://www.bullseye.com/statementCoverage.html.
Dharmdasani, H. (2014). Android.hehe: Mal-
ware now disconnects phone calls. [On-
line] https://www.fireeye.com/blog/threat-
research/2014/01/android-hehe-malware-now-
disconnects-phone-calls.html.
Dierks, T. and Allen, C. (1999). The tls protocol version
1.0. [Online] http://www.ietf.org/rfc/rfc2246.txt.
Dimja
ˇ
sevi
´
c, M., Atzeni, S., Ugrina, I., and Rakamaric,
Z. (2016). Evaluation of android malware detection
based on system calls. In Proceedings of the 2016
ACM on International Workshop on Security And Pri-
vacy Analytics, pages 1–8. ACM.
Filiol, E. and Irolla, P. (Black Hat Asia 2015). (in)security
of mobile banking... and of other mobile apps.
Hungenberg, T. and Eckert, M. (2013). Inetsim: Internet
services simulation suite.
IDC (2015). Smartphone os market share, 2015 q2.
[Online] http://www.idc.com/prodserv/smartphone-
os-market-share.jsp.
JesusFreke (2009). Github - smali readme. [Online]
https://github.com/JesusFreke/smali.
Jing, Y., Zhao, Z., Ahn, G.-J., and Hu, H. (2014). Mor-
pheus: automatically generating heuristics to detect
android emulators. In Proceedings of the 30th Annual
Computer Security Applications Conference, pages
216–225. ACM.
Kojm, T. (2004). Clamav.
Lindorfer, M., Neugschwandtner, M., Weichselbaum, L.,
Fratantonio, Y., v. d. Veen, V., and Platzer, C. (2014).
Andrubis 1,000,000 apps later: A view on cur-
rent android malware behaviors. In 2014 Third In-
ternational Workshop on Building Analysis Datasets
and Gathering Experience Returns for Security (BAD-
GERS), pages 3–17.
Lockheimer, H. (2012). Android and security. [Online]
http://googlemobile.blogspot.fr/2012/02/android-
and-security.html.
Mutti, S., Fratantonio, Y., Bianchi, A., Invernizzi, L., Cor-
betta, J., Kirat, D., Kruegel, C., and Vigna, G. (2015).
Baredroid: Large-scale analysis of android apps on
real devices. In Proceedings of the 31st Annual Com-
puter Security Applications Conference, pages 71–80.
ACM.
Percoco and Nicholas, J. (2012). Adventures in bouncer-
land.
pjlantz (2012). Droidbox - apimonitor.wiki. [Online]
https://code.google.com/archive/p/droidbox/wikis/
APIMonitor.wiki.
Rastogi, V., Chen, Y., and Enck, W. (2013). Appsplay-
ground: automatic security analysis of smartphone ap-
plications. In Proceedings of the third ACM confer-
ence on Data and application security and privacy,
pages 209–220. ACM.
Sabanal, P. (2015). Hiding behind art.
Schreiber, T. (2011). Android binder - android interprocess
communication.
Tam, K., Khan, S. J., Fattori, A., and Cavallaro, L. (2015).
Copperdroid: Automatic reconstruction of android
malware behaviors. In NDSS.
Wong, M. Y. and Lie, D. (2016). Intellidroid: A targeted
input generator for the dynamic analysis of android
malware.
Xia, M., Gong, L., Lyu, Y., Qi, Z., and Liu, X. (2015). Ef-
fective real-time android application auditing. In Se-
curity and Privacy (SP), 2015 IEEE Symposium on,
pages 899–914. IEEE.
Yan, L. K. and Yin, H. (2012). Droidscope: seamlessly re-
constructing the os and dalvik semantic views for dy-
namic android malware analysis. In Presented as part
of the 21st USENIX Security Symposium (USENIX Se-
curity 12), pages 569–584.
Zhauniarovich, Y., Philippov, A., Gadyatskaya, O., Crispo,
B., and Massacci, F. (2015). Towards black box test-
ing of android apps. In 2015 Tenth International
ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering
620
Conference on Availability, Reliability and Security
(ARES), pages 501–510.
Zheng, C., Zhu, S., Dai, S., Gu, G., Gong, X., Han, X., and
Zou, W. (2012). Smartdroid: an automatic system for
revealing ui-based trigger conditions in android appli-
cations. In Proceedings of the second ACM workshop
on Security and privacy in smartphones and mobile
devices, pages 93–104. ACM.
ANNEXES
Data Samples Used in Smart Monkey
$ > head random - i b a n . txt
AL 9 4 2 8 3 40 5 7 9 7 9 77 6 2 9 2 8 15 6 3 6 5 9
AL 6 0 7 2 6 12 2 3 5 0 0 56 7 5 6 4 5 79 9 9 4 4 7
AL 2 3 7 9 3 88 4 9 6 0 5 03 6 6 5 7 8 45 2 1 8 1 5
AL 9 1 0 8 1 26 4 7 6 3 5 46 2 5 0 8 5 96 7 2 8 8 4
$ > head b r o a d c a s t - e v e nts . txt
andr o i d . inten t . a c t i on . BO O T_ C OM P LE T ED
andr o i d . inten t . a c t i on . BA T T E RY _ CH A NG E D
andr o i d . n et . conn . C O N NE C TI VI T Y_ C H A NG E
andr o i d . inten t . a c t i on . US E R_ P RE S ENT
Sample of a Java Calls Capture
$ > logc a t
[...]
void j a v a . lang . St r in g Bui l de r . < init >
java . l ang . S t ri n gB u il d er j ava . lang .
St r in g Bu i lde r . appen d
java . l ang . S t ri n gB u il d er j ava . lang .
St r in g Bu i lde r . appen d
java . l ang . Str i n g java . lang .
St r in g Bu i lde r . toS t ring
void com . en e rgy s ou r ce . szj . a n d roid . Log .i
andr o i d . os . L o oper a n droi d . os . Loop e r .
ge t Ma i nL o ope r
[...]
Sample of a Network Capture
[...]
< h eader >
< m ethod >R0VU </ method >
< s cheme > aHR0c A == </ s cheme >
<host > M T E1 L j E 4M i 4z M C 4 2O A == </ host >
<port > O DA = </ port >
<path > L 0 d l d E lu Z m 8 u Y XN o e D 9 h cH Bp Z D 0 3Z
mZ jN 2 J l OT Jm M 2 M 0 Y Td mY T A 4 M z Ux ZT N k N T
Nm OT h k Y SZ hc H B 2 Z X I9 Mj c 2 J n Y 9M S4 w L j Q
mY 2x p Z W 50 PT I m c G 4 9Y 29 t L m d w Ln Nl Y X J j
aC Z1 c 2 V yd mV y P T I u MC Zh Z H R 5 c GU 9M i Z j b
3 V u d H J 5P WZ y J m 5 0 PT Im b W 5 vP TI w O D E 1 Jn
V1 aW Q 9 Z mZ mZ m Z m Z m Yt ZW I w O S 0 5N Dc w L T U
zY 2U t Y m Mx Yj A w M D A wM DA w J m 9 z PT Yu M C 4 x
Jm Ru P U F PU 1A r b 2 4 r SG Ft b W V y S GV hZ C Z z a
Xp lP T E w OD Aq M T c 3 N iZ jY z 0 0 J m Nt PT M 4 L j
Qw J n J hb T 0 x OD k 5 N TA 4 a2 I =
</ path >
< http _ v e rsion >
SF R UUC 8 xL j E =
</ h t tp_ve r s i on >
<host > Y2 Z nL m F kc 2 1vZ 2 8uY 2 9t </ host >
< Connection >
S2 V lc C 1B b Gl 2 ZQ ==
</ C o n n e ction >
<User - Agent >
QX Bh Y 2 h lL Uh 0 d H B D bG ll b n Q v V U5 BV k F J T
EF C T E Ug K G p hd m E g MS 4 0K Q ==
</ User - A gent >
</ heade r >
< con t e nt / >
[...]
NB: all field values are encoded in base 64
Monkey Configuration
monke y -s 0 -- pct - s yske y s 0 -- pct -
app s wit c h 0 -- t h rot t l e 50 -p <
pac k a ge - name > -v 500
-s 0: The seed of the r a ndom n u mber
gen e rat o r is f ixed to 0
--pct - sys k e ys 0: No s y stem key e v ents
are sent , s u c h as Home , Back , S t art
Call , End Call , or V o lume i n p uts .
--pct - ap p s wi t c h 0: No st a rt A ct i vi t y ()
are i s sued as c a l ling the
in s tr u me n ta t io n act i vit y an othe r
time break s it .
-- th r ott l e 5 0: The d e l a y betwee n ev ents
is fixe d to 50 m i ll i sec o nd s .
500: A total of 500 ev e n t s a re sent .
Glassbox: Dynamic Analysis Platform for Malware Android Applications on Real Devices
621