Instance-based Anomaly Method for Android Malware Detection
Borja Sanz, Igor Santos, Xabier Ugarte-Pedrero, Carlos Laorden, Javier Nieves and Pablo G. Bringas
S3Lab, University of Deusto, Avenida de las Universidades 24, Bilbao, Spain
Keywords:
Security, Malware, Android.
Abstract:
The usage of mobile phones has increased in our lives because they offer nearly the same functionality as
a personal computer. Besides, the number of applications available for Android-based mobile devices has
increased. Android application distribution is based on a centralized market where the developers can upload
and sell their applications. However, as it happens with any popular service, it is prone to misuse and, in
particular, malware writers can use this market to upload their malicious creations. In this paper, we propose a
new method that, based upon several features that are extracted from the AndroidManifest file of the legitimate
applications, builds an anomaly detection system able to detect malware.
1 INTRODUCTION
Smartphones have become an indispensable gadget in
our daily lives. We check our email, browse the Inter-
net, or play games with our friends, wherever we are.
However, in order to take advantage of every possibil-
ity they may offer, applications have to be previously
installed in the devices. In the past, the installation of
applications was uncomfortable for the users because
the process was complicated: users had to look for the
desired application in the Internet and, after finding it,
they had to install it in their devices.
Afterwards, new methods for distribution and in-
stallation were developed taking advantage of the In-
ternet connection available in mobile devices. Users
can now install any application without a personal
computer, by using application stores that are already
installed in the devices. Apple’s AppStore was the
first online store to bring this new paradigm for users.
Since then, other vendors such as RIM, Microsoft or
Google have adopted the same model and deployed
application stores for their devices. These factors
have contributed to the popularity of smartphones and
thus, the number of applications has increased. In par-
ticular, Apple’s App Store offers more than 800,000
applications to their users
1
while Google’s Play Store,
Androids official application store, hosts 675,000
apps
2
.
Unfortunately, application markets are also sus-
1
http://www.apple.com/pr/library/2013/01/
28Apple-Updates-iOS-to-6-1.html
2
http://officialandroid.blogspot.com.es/search?q=675000
ceptible of hosting malware. In order to deal with
these threats, Android and iOS use different ap-
proaches. While Apple applies a very strict review
process for the submitted applications performed by
at least two reviewers, Android relies on its secu-
rity permission system and on the user’s sound judge-
ment. However, users may not have security con-
sciousness and may not read the required permis-
sions before installing an application (Mylonas et al.,
2012). Despite these efforts, both vendors have
hosted malware in their stores (Egele et al., 2011;
Zhou and Jiang, 2012). Therefore, both models are
not sufficient to ensure user’s safety and new models
should be developedand deployed in order to improve
the security of the devices.
With regards to Android malware, Zhou et al.
(Zhou and Jiang, 2012) created a big Android mal-
ware collection between 2010 and 2011. They ob-
tained 23 samples in January 2011 and 1,260 in Oc-
tober 2011 (which represented an increase of over
5,000%). They conducted a thorough study on this
subject and analysed its evolution in recent times.
They concluded that Android malware has shown a
rapid increase in both sophistication and number of
new samples. Besides, they show that more than 80%
of the malware samples repackage legitimate apps
and 93% of them exhibit a botnet-like capability.
Several approaches have been proposed to detect
these malicious software in Android. Shabtai et al.
(Shabtai et al., 2010) trained several machine learn-
ing models using the following features: the count
of elements, attributes and namespaces of the parsed
Android Package File (.apk). To validate their mod-
387
Sanz B., Santos I., Ugarte-Pedrero X., Laorden C., Nieves J. and G. Bringas P..
Instance-based Anomaly Method for Android Malware Detection.
DOI: 10.5220/0004529603870394
In Proceedings of the 10th International Conference on Security and Cryptography (SECRYPT-2013), pages 387-394
ISBN: 978-989-8565-73-0
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
els, they selected features using three selection meth-
ods: Information Gain, Fisher Score and Chi-Square.
Their approach achieved 89% of accuracy classifying
applications into only 2 categories: tools or games.
Albeit this research was not explicitly focused in mal-
ware detection, their authors suggest the usage of this
technique to detect malware as further work. Be-
sides, other proposals use dynamic analysis for the
detection of malicious applications. Crowdroid (Bur-
guera et al., 2011) is an approach that analyses the
behaviour of the applications through device usage
features. Blasing et al. created AASandbox (Blas-
ing et al., 2010), which is a hybrid dynamic-static
approximation. The approach is based on the anal-
ysis of the logs for the low-level interactions obtained
during execution. Shabtai and Elovici (Shabtai and
Elovici, 2010) also proposed a Host-Based Intrusion
Detection System (HIDS) which used machine learn-
ing methods to determine whether an application is
malware or not. Google itself has also deployed a
framework for the supervision of applications called
Bouncer. Oberheide and Miller 2012
3
revealed how
the system works: it is based in QEMU and performs
both static and dynamic analysis.
Against this background, we present a new tech-
nique for the detection of malicious Android executa-
bles. This approach uses several features extracted
by analysing the Manifest file of Android applica-
tions. In particular,we use the
uses-permission
and
the
uses-feature
tags within the manifest file. Us-
ing these features of several legitimate applications,
we build an instance-based anomaly detection method
able to detect anomalous malicious applications.
The reminder of this paper is organised as follows.
Section 2 details the generation of the dataset. Sec-
tion 3 presents the permissions used in our approach.
Section 4 describes our anomaly detection method.
Section 5 shows the results obtained for the empirical
evaluation. Section 6 details the related work. Finally,
section 7 discusses the results and shows the avenues
of further work.
2 DATASET DESCRIPTION
In this section, we describe the dataset collected to
validate our method. The dataset is composed of both
benign and malicious software. In order to compose
the dataset, the following requirements were consid-
ered: (i) it must be heterogeneous, showing diversity
in the types of applications available in the Android
market, and (ii), it must be proportional to the number
3
http://jon.oberheide.org/files/summercon12-bouncer.pdf
of samples of each category in the Android market.
2.1 Benign Software Dataset
We gathered 1,811 Android samples of diverse types
and categorised them using the same scheme that An-
droid market follows. To this end, we used an unof-
ficial library called android-market-api
4
that retrieves
the category of a given application. From the cate-
gorised samples, we selected a subsample to be part
of the final benign software dataset. The methodology
employed was the following:
1. Determine the Number of Total Samples. In order
to facilitate the training of machine-learning mod-
els, it is usually desirable for both categories to be
balanced. Otherwise, the resulting model can be
biased towards one of the classes, and the results
may not be fully representative. Therefore, given
that the number of malware samples is inferior to
the benign ones, we reduced the number of benign
applications to meet the number of the malicious
ones.
2. Determine the Number of Samples for each Be-
nign Category. We decided to balance the dataset
according to the observed distribution in the An-
droid market, and therefore, selected the number
of applications consequently.
3. Types of Application. There are different types
of applications: native ones (developed by means
of the Android SDK), web (developed through
HTML, JavaScript and CSS, which only launch
a Webkit frame) and widgets (simple applications
displayed in the Android desktop). All these ap-
plications have the same core, but different fea-
tures. To represent the heterogeneity of Android,
we included samples of each different type in the
final dataset.
4. Selection of the Samples for each Category. Once
the number of applications for each category was
determined, we randomly selected the applica-
tions, avoiding different versions of the same ap-
plication. Table 1 shows the distribution of the
legitimate applications in the categories.
2.2 Malicious Software Dataset
The malicious samples were obtained thanks to the
company VirusTotal
5
. VirusTotal offers a series of
services called VirusTotal Malware Intelligence Ser-
vices, which allows researchers to obtain samples
from their databases.
4
https://code.google.com/p/android-market-api/
5
http://www.virustotal.com/
SECRYPT2013-InternationalConferenceonSecurityandCryptography
388
Table 1: Number of benign software applications.
Category Number Category Number
Action and Arcade 32 Multimedia and video 23
Books 10 Music and audio 12
Business 1 News and magazines 7
Cards Games 2 Personalization 6
Casuals 10 Photography 6
Comics 1 Productivity 27
Communication 20 Puzzles 16
Sports 5 Races 6
Enterprise 4 Sales 3
Entertainment 16 Society 25
Finance 3 Tools 80
Health 3 Transportation 2
Libraries and Demos 2 Travels 2
Lifestyle 4 Weather 8
Medicine 1
Total: 333
Initially, we collected 2,808 samples. Next, we
normalized the different outputs of the different an-
tivirus vendors. The goal of this step was to deter-
mine their reliability detecting malware in Android.
To this end, we assumed that every sample that was
detected as malware by at least one antivirus was, in-
deed, malware. Then, we evaluated the detection rate
of each antivirus engine with respect to the complete
malware dataset:
a
i
=
n
n
i
(1)
where n
i
is the number of samples detected by the i
th
antivirus and n is the total number of malware sam-
ples. Then, we evaluated each malware sample taking
into account the weight of each antivirus. For this
evaluation, we applied the next metric:
w =
#A
i=1
a
i
A (2)
being A = (a
1
,a
2
,...,a
) the set of weights of the
antiviruses that detect that particular sample. There-
fore, w rates the detection taking into account the
antiviruses that detect the sample. We determined
a threshold to discard irrelevant samples from the
dataset that was set empirically to 0.1, which provided
us a total number of 1,202 malware samples. Finally,
we also removed every duplicated sample so the final
malware dataset was composed of 333 unique sam-
ples.
3 FEATURE ENGINEERING
In this section, we review the different feature sets
we have considered for the detection of Android
malware. We gathered these features from the
AndroidManifest.xml
file that is packed with ev-
ery Android application. To this end, we first ex-
tracted the permissions used by each application us-
ing the Android Asset Packaging Tool (
aapt
), avail-
able within the set of tools provided by the Android
SDK.
3.1 Permissions of the Application
The structure for declaring a
uses-permission
6
in
the
AndroidManifest.xml
file is shown in Figure 1.
<uses-permission android:name=
"string" />
Figure 1: General template for a
uses-permission
within
the AndroidManifest file.
In this way, there are several strings that
are used for declaring the permission us-
age of the different Android applications
such as
android.permission.CAMERA
or
android.permission.SEND SMS
.
We processed the
AndroidManifest.xml
file
searching for the
uses-permission
tag and retrieved
the string declaring the type of permission. After
that, we generated an input vector for each of the 130
6
http://developer.android.com/guide/topics/manifest/
uses-permission-element.html
Instance-basedAnomalyMethodforAndroidMalwareDetection
389
possible permissions, with a binary feature indicat-
ing whether the permission is present or not in the
analysed Android application. For example, Figure 2
shows the permission declaration of an Android ap-
plication.
...
<uses-permission android:name="
android.permission.SEND_SMS" />
<uses-permission android:name="
android.permission.INTERNET" />
<uses-permission android:name="
android.permission.READ_CONTACTS" />
...
Figure 2: Example of permission declaration in an applica-
tion.
The input vector of permissions for this ap-
plication would be composed of 127 zeros, rep-
resenting the permissions not used and 1s for
the 3
uses-permission
tags declared (
SEND SMS,
INTERNET
, and
READ CONTACTS
).
We selected this feature set for two main reasons:
first, the gathering process has a low computing over-
head and, second, these features represent the be-
haviour that an application may implement.
3.2 Uses-features of the Manifest File
The
AndroidManifest.xml
file shows other features
apart from permissions. Moreover, this information
may be relevant for the task of detecting malware.
The structure for declaring a
uses-feature
7
in the
AndroidManifest.xml
file is shown in Figure 3.
...
<uses-feature
android:name="string"
android:required =["true"|"false"]
android:glEsVersion ="integer" />
...
Figure 3: Declaration of features in an application.
The
android:name
attribute determines the fea-
ture (e.g., camera, gps) used by the application,
the
android:required
attribute determines whether
that feature is mandatory for the correct handling of
the application or not, and the
glEsVersion
attribute
determines the version of OpenGL, if used. These at-
tributes are represented in the Manifest file under the
tag
<uses-feature>
and determine some of the fea-
tures, both software and hardware, that are required
for the correct execution of an application. In this
7
http://developer.android.com/guide/topics/manifest/
uses-feature-element.html
way, the use of Bluetooth or the camera are deter-
mined by the tags
android.hardware.bluetooth
and
android.hardware.camera
. Besides, these el-
ements of the Manifest file only inform about the
behaviour of the application and are not mandatory.
Even though this information is not mandatory, it is
used sometimes by other services or applications in
order to improve the interaction between the applica-
tions. Nevertheless, due to the optional character of
these features, many applications lack these fields.
In our dataset, the features extracted are related
to the use hardware such as localization by means
of the GPS, Wi-Fi, or proximity sensors. In light of
this context, we considered this information relevant
in order to determine whether an application is mal-
ware or not, because it adds some information com-
plementary to permissions and provides us with a be-
havioural view of the inspected application. In order
to use these features as input vectors for machine-
learning, we processed the
AndroidManifest.xml
file searching for the
uses-features
tag and gath-
ered the string declaring the type of feature. After
that, we generated an input vector for each of the
37 (34 hardware and 3 software) possible features,
with a binary feature indicating whether the feature
is present or not in the analysed Android application.
As an example, Figure 4 shows the declaration of
uses-features
for an Android application.
...
<uses-feature
android:name=
"android.hardware.camera"
android:required ="false" />
<uses-feature
android:name=
"android.hardware.bluetooth" />
...
Figure 4: Example of the declaration of the
uses-features
in an application.
The input vector of features for this applica-
tion will be composed of 35 zeros, representing
the not used
uses-features
and 1s representing
the 2 used features (
android.hardware.camera
and
android.hardware.bluetooth
).
4 ANOMALY BASED METHOD
Anomaly detection approaches model normality and
try to identify outlier occurrences. In this way, ev-
ery deviation to this model is considered anomalous.
Through the representation described in the previous
section, our method represents Android applications
SECRYPT2013-InternationalConferenceonSecurityandCryptography
390
as points in the feature space. When an application is
being inspected, our method starts by computing the
features to represent the sample as a point in the fea-
ture space. This point is then compared with the pre-
viously calculated points of legitimate applications.
To this end, distance measures are required. In this
study, we have used the following distance measures:
Manhattan Distance. This distance between two
points v and u is the sum of the lengths of the
projections of the line segment between the points
onto the coordinate axes:
d(x,y) =
n
i=0
|x
i
y
i
| (3)
where x is the first point; y is the second point;
and x
i
and y
i
are the i
th
component of the first and
second point, respectively.
Euclidean Distance. This distance is the length
of the line segment connecting two points. It is
calculated as:
d(x,y) =
n
i=0
q
v
2
i
u
2
i
(4)
where x is the first point; y is the second point;
and x
i
and y
i
are the i
th
component of the first and
second point, respectively.
Cosine Similarity. It is a measure of similarity
between two vectors by finding the cosine of the
angle between them (Tata and Patel, 2007). Since
we are measuring distance and not similarity we
have used 1CosineSimilarity as a distance mea-
sure:
d(x,y) = 1 cos(θ) = 1
~v·~u
||~v||·||~u||
(5)
where~v is the vector from the origin of the feature
space to the first point x, ~u is the vector from the
origin of the feature space to the second point y,
~v ·~u is the inner product of ~v and ~u. ||~v||·||~v|| is
the cross product of~v and~u. This distance ranges
from 0 to 1, where 1 means that the two evidences
are completely different and 0 means that the evi-
dences are the same (i.e., the vectors are orthogo-
nal between them).
By means of these measures, we are able to com-
pute the deviation of an application with respect to
a set of legitimate applications. Other distance mea-
sures, such as Mahalanobis distance were discarded
due to their complexity. One characteristic of our ap-
proach is its simplicity, which allows its deployment
in smartphones with low processing capabilities.
Since we have to compute the distance of any ap-
plication to the points representing valid applications,
a combination metric is required in order to obtain
a final distance value which considers every measure
performed. To this end, our system employs 3 sim-
plistic rules: (i) select the mean value, (ii) select the
lowest distance value and (iii) select the highest value
of the computed distances.
In this way, when our method inspects an appli-
cation, a final distance value is acquired, which will
depend on both the chosen distance measure and a
combination rule.
5 EMPIRICAL VALIDATION
To evaluate our method, we used the dataset described
in section 2, composed by 666 samples. Specifically,
we followed the next configuration for the empirical
validation:
1. Cross Validation: We performed a 5-fold cross-
validation over the benign samples to divide them
into 5 different divisions of the data into training
and test sets.
2. Calculating Distances and Combination Rules:
We extracted the
uses-permissions
and
uses-features
of the applications and com-
bined 3 different measures with 3 different
combination rules described in section 4 to
obtain a final measure of deviation for each
testing instance. More accurately, we applied the
following distances: (i) the Manhattan Distance,
(ii) the Euclidean Distance, and (iii) the Cosine
Similarity. For the combination rules we tested
the following ones: (i) the mean value, (ii) the
lowest distance and (iii) the highest distance.
3. Defining Thresholds: For each measure and com-
bination rule, we established 10 different thresh-
olds to determine whether a sample is valid or not.
The lowest threshold was configured to produce
no false negatives, while the highest one was set
to produce no false positives.
4. Testing the Method: We evaluated the method by
measuring these parameters:
True Positive Ratio (TPR), also known as sen-
sitivity: TPR = TP/(TP+ FN) where TP are
the number of applications correctly classified
(true positives) and FN is the number of appli-
cations misclassified as valid ones.
False Positive Ratio
(FPR), that is the number
of legitimate applications misclassified as mal-
ware: FPR = FP/(FP+ TN) where FP is the
number of valid apps incorrectly detected as
malicious while TN is the number of valid apps
correctly classified.
Instance-basedAnomalyMethodforAndroidMalwareDetection
391
Accuracy
, which is the total number of hits di-
vided by the number of the instances in the
dataset: Accuracy = (TP+ TN)/(P+ N).
Area Under ROC Curve
(Singh et al., 2009),
establishes the relation between false negatives
and false positives.
Table 2 shows the obtained results. When we ap-
plied Manhattan distance, we obtained the best AUC
value (0.88), using average as the combination rule.
The accuracy obtained was around 85% for this com-
bination. Using euclidean distance, we obtained more
than 0.90 of AUC and 87.57% of accuracy. Finally,
using cosine distance, we obtained the best results:
0.91 of AUC and nearly 90% of accuracy.
In general, the results obtained surpassed 0.8 of
AUC and 80% of accuracy for all distances, consid-
ering the average as a combination rule in the three
cases.
6 RELATED WORK
In order to tackle the problem of growing malware in
Android, researchers have begun to explore this area
using the experience acquired in other platforms. We
can distinguish two different approaches. Dynamic
approaches execute the sample in an isolated environ-
ment and collect data about its execution. These ap-
proaches require high computational efforts and are
not suitable for the deployment on smartphones. Be-
sides, static approaches analyse the samples without
executing them. Some attempts are based on sig-
nature scanning, that is, detecting known patterns
present in malicious applications, while others try to
implement generic approaches to distinguish patterns
in benign or malicious applications.
Shabtai and Elovici (Shabtai et al., 2012) pre-
sented Andromaly”, a framework for detecting mal-
ware on Android mobile devices. This framework
collected 88 features and events and, then, applied
machine-learning algorithms to detect abnormal be-
haviours. Their dataset was composed of 4 self-
written pieces of malware, as well as goodware sam-
ples, both separated into two different categories
(games and tools). Their approach achieved a 0.99
area under ROC curve and 99% of accuracy.
Despite these results, their framework required the
acquisition of a huge number of features and events,
overloading the device and, consequently, draining
the battery. Our approach, in contrast, extracts the
data from the
AndroidManifest.xml
file, which is a
trivial process. Although our results are not as sound
as theirs, our approach requires much less computa-
tional efforts. In addition, our dataset is larger and
sparser in malware samples than theirs.
Regarding the signature based approach, Schmidt,
Camtepe, and Albayrak (Schmidt et al., 2010) fo-
cused on a static and light-weight analysis of the sam-
ples. They used system calls as features and simple
classifiers to detect malicious behaviours. Both ap-
proaches do not prevent the installation of malware
in the devices. Our system evaluates each application
before its installation, considering several features ex-
tracted from the manifest file, obtaining similar re-
sults to those obtained in previous work.
Peng et al. (Peng et al., 2012) proposed an ap-
proach to rank the risk of Android applications us-
ing probabilistic generative models. They selected
the permissions of the applications as key feature.
Specifically, they chose the top 20 most frequently re-
quested permissions in their dataset, composed by 2
benign software collections, obtained from the An-
droid application store Google Play (157,856 and
324,658 samples, respectively) and 378 unique sam-
ples of malware. They obtained a 0.94 AUC as best
result. Nevertheless, the unbalanced nature of their
dataset makes it difficult to directly compare the re-
sults with our approach. In fact, our approach is based
on anomaly detection, as it measures the deviation of
any sample to a set of benign applications. In addi-
tion, we complemented the information provided by
the permissions with the
uses-features
, enhancing
the results and approaching the results to those ob-
tained by previous methods. In summary, our ap-
proach prevents the installation of malware on the de-
vices, instead of monitoring the execution of the ap-
plications, thus saving device resources and prevent-
ing undesirable consequences.
7 CONCLUSIONS AND FUTURE
WORK
Smartphones and tablets are flooding both consumer
and business markets and, therefore, these devices
manage a large amount of information. Thus, mal-
ware writers have found in these devices a new source
of income and therefore the number of malware sam-
ples has grown exponentially in these platforms.
In this paper, we presented a new malicious soft-
ware detection approach that is inspired in anomaly
detection systems. In contrast to other approaches,
this method only needs to previously label goodware
and measures the deviation of a new sample respect to
normality (applications without malicious intentions).
Although anomaly detection systems tend to produce
high error rates (specially, false positives), our exper-
SECRYPT2013-InternationalConferenceonSecurityandCryptography
392
Table 2: Results for different combination measures and distance rules. The results in bold are the best for each combination
rule and distance measure.
(a) Manhattan distance.
Comb. Thres. TPR FPR AUC Acc.
Average
8755.51128 1.00000 1.00000
0.88167
50.00%
17468.67613 0.90991 0.21321 84.83%
26181.84099 0.62042 0.10811 75.62%
34895.00584 0.51952 0.06006 72.97%
43608.17069 0.42042 0.03303 69.37%
52321.33555 0.31471 0.01802 64.83%
61034.50040 0.27387 0.00601 63.39%
69747.66525 0.10751 0.00300 55.23%
78460.83011 0.01862 0.00000 50.93%
87173.99496 0.00000 0.00000 50.00%
Max.
47404.00000 1.00000 1.00000
0.65852
50.00%
54503.66667 0.98498 0.99099 49.70%
61603.33334 0.90811 0.94595 48.11%
68703.00000 0.74595 0.78679 47.96%
75802.66667 0.62943 0.53153 54.89%
82902.33334 0.52853 0.04204 74.32%
90002.00001 0.35796 0.00901 67.45%
97101.66667 0.02643 0.00000 51.32%
104201.33334 0.00480 0.00000 50.24%
111301.00001 0.00000 0.00000 50.00%
Min.
0.00000 1.00000 1.00000
0.72431
50.00%
3470.00000 0.54294 0.13814 70.24%
6940.00000 0.52853 0.09009 71.92%
10410.00000 0.41502 0.06006 67.75%
13880.00000 0.34054 0.02703 65.68%
17350.00001 0.17898 0.00000 58.95%
20820.00001 0.10450 0.00000 55.23%
24290.00001 0.02583 0.00000 51.29%
27760.00001 0.00661 0.00000 50.33%
31230.00001 0.00000 0.00000 50.00%
(b) Euclidean distance.
Comb. Thres. TPR FPR AUC Acc.
Average
70.35688 1.00000 1.00000
0.906084463
50.00%
95.24643 0.97057 0.47147 74.95%
120.13597 0.93814 0.23423 85.20%
145.02552 0.88649 0.13514 87.57%
169.91506 0.54655 0.09309 72.67%
194.80461 0.46066 0.04805 70.63%
219.69415 0.37898 0.03303 67.30%
244.58370 0.27628 0.00601 63.51%
269.47324 0.04444 0.00300 52.07%
294.36279 0.00000 0.00000 50.00%
Max.
217.72460 1.00000 1.00000
0.67797888
50.00%
230.60165 0.98559 0.99399 49.58%
243.47870 0.93393 0.96096 48.65%
256.35575 0.82763 0.88288 47.24%
269.23281 0.68468 0.62462 53.00%
282.10986 0.58799 0.15015 71.89%
294.98691 0.44505 0.01201 71.65%
307.86396 0.05165 0.00000 52.58%
320.74102 0.00721 0.00000 50.36%
333.61807 0.00000 0.00000 50.00%
Min.
0.00000 1.00000 1.00000
0.729316704
50.00%
19.63557 0.64985 0.30030 67.48%
39.27114 0.56336 0.23423 66.46%
58.90671 0.54294 0.13814 70.24%
78.54228 0.53333 0.09610 71.86%
98.17784 0.43063 0.06306 68.38%
117.81341 0.34054 0.02703 65.68%
137.44898 0.13634 0.00000 56.82%
157.08455 0.01562 0.00000 50.78%
176.72012 0.00000 0.00000 50.00%
(c) Cosine distance
Comb. Thres. TPR FPR AUC Acc.
Average
0.07126978 1.00 1.00
0.914959103
50.00%
0.17446314 0.89 0.16 86.28%
0.27765650 0.86 0.08 89.04%
0.38084985 0.79 0.04 87.24%
0.48404321 0.33 0.01 66.22%
0.58723657 0.33 0.01 66.22%
0.69042993 0.33 0.01 66.22%
0.79362328 0.33 0.01 66.22%
0.89681664 0.33 0.01 66.22%
1.00001000 0.00 0.00 50.00%
Max.
0.34282500 1.00 1.00
0.529312195
50.00%
0.41584556 0.99 0.98 50.12%
0.48886611 0.96 0.97 49.70%
0.56188667 0.87 0.80 53.18%
0.63490722 0.87 0.80 53.06%
0.70792778 0.87 0.80 53.06%
0.78094833 0.87 0.80 53.06%
0.85396889 0.87 0.80 53.06%
0.92698944 0.87 0.80 53.06%
1.00001000 0.00 0.00 50.00%
Min.
-0.10000000 1.00 1.00
0.803523343
50.00%
0.02222333 0.64 0.04 79.85%
0.14444667 0.33 0.01 66.22%
0.26667000 0.33 0.01 66.22%
0.38889333 0.33 0.01 66.22%
0.51111667 0.33 0.01 66.22%
0.63334000 0.33 0.01 66.22%
0.75556333 0.33 0.01 66.22%
0.87778667 0.33 0.01 66.22%
1.00001000 0.00 0.00 50.00%
imental results show low FPR values. The number
of malicious samples discovered up-to-date is limited:
signature scanning methods are effective and efficient
solutions for current malware. Nevertheless, we con-
sider that malware authors will soon apply obfusca-
tion techniques making difficult the detection process.
This possibility was explored by Rastogi et al. (Ras-
togi et al., 2013). In this way, our approach reduces
the necessity to collect malware samples (i.e., it is
not necessary to update a signature database), as it is
based on anomaly detection. In addition, our method
is based on features that are extracted from the mani-
fest file, making possible to prevent the installation of
malicious software.
However, this approach presents several limita-
tions. By means of an internet connection, a be-
nign application can download a malicious payload
and change its behaviour. In order to detect this be-
haviour, a dynamic approach is required. Unfortu-
nately, dynamic approaches cannot be deployed in
Instance-basedAnomalyMethodforAndroidMalwareDetection
393
current smartphones due to their computational and
battery limitations.
Future work is oriented in two main directions. On
the one hand, other distance measures and combina-
tion rules could be tested. On the other hand, there
are other static features that could be used to improve
the detection ratio, that could be obtained from the
AndroidManifest.xml
file or from the binary class
(e.g., strings, API calls). The use of different features
could reduce the risk of incorrectly classifying benign
applications that have permission usage declarations
similar to malicious samples.
ACKNOWLEDGEMENTS
This research was partially supported by the
Basque Government under the research project
‘BRANKA4U: Evoluci´on de los servicios bancarios
hacia el futuro’ granted by the ETORGAI 2011 pro-
gram.
REFERENCES
Blasing, T., Batyuk, L., Schmidt, A.-D., Camtepe, S. A.,
and Albayrak, S. (2010). An android application sand-
box system for suspicious software detection. In Mali-
cious and Unwanted Software (MALWARE), 2010 5th
International Conference on, pages 55–62. IEEE.
Burguera, I., Zurutuza, U., and Nadjm-Tehrani, S. (2011).
Crowdroid: behavior-based malware detection system
for android. In Proceedings of the 1st ACM workshop
on Security and privacy in smartphones and mobile
devices, pages 15–26. ACM.
Egele, M., Kruegel, C., Kirda, E., and Vigna, G. (2011).
Pios: Detecting privacy leaks in ios applications. In
Proceedings of the Network and Distributed System
Security Symposium.
Mylonas, A., Kastania, A., and Gritzalis, D. (2012). Del-
egate the smartphone user? security awareness in
smartphone platforms. Computers & Security.
Peng, H., Gates, C., Sarma, B., Li, N., Qi, Y., Potharaju, R.,
Nita-Rotaru, C., and Molloy, I. (2012). Using proba-
bilistic generative models for ranking risks of android
apps. In Proceedings of the 2012 ACM conference on
Computer and communications security, pages 241–
252. ACM.
Rastogi, V., Chen, Y., and Jiang, X. (2013). Evaluating an-
droid anti-malware against transformation attacks.
Schmidt, A.-D., Camtepe, A., and Albayrak, S. (2010).
Static smartphone malware detection. In proceedings
of the 5th Security Research Conference (Future Se-
curity 2010), ISBN, pages 978–3.
Shabtai, A. and Elovici, Y. (2010). Applying behavioral
detection on android-based devices. Mobile Wire-
less Middleware, Operating Systems, and Applica-
tions, pages 235–249.
Shabtai, A., Fledel, Y., and Elovici, Y. (2010). Automated
static code analysis for classifying android applica-
tions using machine learning. In Computational Intel-
ligence and Security (CIS), 2010 International Con-
ference on, pages 329–333. IEEE.
Shabtai, A., Kanonov, U., Elovici, Y., Glezer, C., and Weiss,
Y. (2012). andromaly: a behavioral malware detection
framework for android devices. Journal of Intelligent
Information Systems, pages 1–30.
Singh, Y., Kaur, A., and Malhotra, R. (2009). Compara-
tive analysis of regression and machine learning meth-
ods for predicting fault proneness models. Interna-
tional Journal of Computer Applications in Technol-
ogy, 35(2):183–193.
Tata, S. and Patel, J. M. (2007). Estimating the selectiv-
ity of tf-idf based cosine similarity predicates. ACM
SIGMOD Record, 36(2):7–12.
Zhou, Y. and Jiang, X. (2012). Dissecting android malware:
Characterization and evolution. In Security and Pri-
vacy (SP), 2012 IEEE Symposium on, pages 95–109.
IEEE.
SECRYPT2013-InternationalConferenceonSecurityandCryptography
394