A Novel Approach for Android Malware Detection and Classification
using Convolutional Neural Networks
Ahmed Lekssays
1 a
, Bouchaib Falah
1 b
and Sameer Abufardeh
2 c
1
School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane, Morocco
2
Math, Science, and Tech. Dept., University of Minnesota Crookston, Crookston, MN, U.S.A.
Keywords:
Malware, Android, Machine Learning, Classification, Convolutional Neural Networks.
Abstract:
Malicious software or malware has been growing exponentially in the last decades according to antiviruses
vendors. The growth of malware is due to advanced techniques that malware authors are using to evade de-
tection. Hence, the traditional methods that antiviruses vendors deploy are insufficient in protecting people’s
digital lives. In this work, an attempt is made to address the problem of mobile malware detection and clas-
sification based on a new approach to android mobile applications that uses Convolutional Neural Networks
(CNN). The paper suggests a static analysis method that helps in malware detection using malware visual-
ization. In our approach, first, we convert android applications in APK format into gray-scale images. Since
malware from the same family has shared patterns, we then designed a machine learning model to classify
Android applications as malware or benign based on pattern recognition. The dataset used in this research
is a combination of self-made datasets that used public APIs to scan the APK files downloaded from open
sources on the internet, and a research dataset provided by the University of New Brunswick, Canada. Using
our proposed solution, we achieved an 84.9% accuracy in detecting mobile malware.
1 INTRODUCTION
According to Statista, the number of Android devices
in the world is 2.53 Billion devices. This number
is expected to reach 2.87 Billion in 2020 (Statista,
2020). On the other hand, Kaspersky Lab detected
more than 6 million malware package in 2018 (Lab.,
2019). These numbers show the criticality of mobile
security and the challenges it has been facing in the
last years. It is an open war between malware writers
and AV vendors.
Due to the limitations of signature-based detec-
tion, AV vendors rely on some heuristic methods to
detect malware. These methods are based on some
rules defined by the AV security experts where they
state the definition of restricted behaviors or opera-
tions that a software can do or execute. The main
drawback of heuristic-based methods is that they gen-
erate many false positives since not all software that
access certain files are malware. However, they help
in detecting new malware without comparing it with
a
https://orcid.org/0000-0001-5783-8638
b
https://orcid.org/0000-0001-5086-0808
c
https://orcid.org/0000-0002-9893-8923
a signature that was saved in AV vendors’ databases
(Michie et al., 1994).
From the two discussed detection techniques, AV
vendors used a hybrid technique that combines both
signature-based and heuristic-based methods to de-
tect malware. There are two types of analyses that
are done on any software to detect if it is malware or
not: static analysis and dynamic analysis (Zhao and
Liu, 2007). They are both done in a sandboxed en-
vironment for security purposes. For static analysis,
it is a phase in malware analysis where the malware
analyst does not execute the malware. He or she tries
to reverse engineer the malware to get its source code.
Then, the malware analyst checks the string signature,
byte-sequence, opcodes, etc. For dynamic analysis, it
has two main components: memory and network. It
is a technique in malware analysis where the malware
analyst executes the malware and watch its behavior
in the sandboxed environment. The malware analyst
checks the behavior in the run time and the API end-
points and calls that the malware tries to access. It
gives an idea about the severity of the malware and
the ways that security professionals can implement to
stop the malware (Zhao and Liu, 2007).
Nowadays, the probability of success of machine
606
Lekssays, A., Falah, B. and Abufardeh, S.
A Novel Approach for Android Malware Detection and Classification using Convolutional Neural Networks.
DOI: 10.5220/0009822906060614
In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 606-614
ISBN: 978-989-758-443-5
Copyright
c
2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
learning in classification problems has increased
thanks to three main components: (1) the increase
of commercial feeds that helped new malware to ex-
ist, (2) computing power has become cheaper, and (3)
machine learning as an independent computer science
discipline has evolved, and big companies are invest-
ing in it which help researchers by providing the tools
to innovate in the field.
Machine learning approaches and results are en-
couraging to achieve a high malware detection rate
without any human interaction. As a result, AV ven-
dors and their research and development teams began
deploying some machine learning classifiers such as
neural networks, decision trees, and logistic regres-
sion (Krizhevsky et al., 2012).
Malware analysis, as an independent discipline in
cybersecurity, has been facing the problem of mal-
ware classification or detection as a binary classifi-
cation. So, any file shall be analyzed to detect if it is
malware or not. If it is malware, it is labeled accord-
ing to its type and family based on its behavior by
using a classification mechanism. The main purpose
behind this work is introduce a different approach for
malware detection in Android by using visual char-
acteristics of malware and deep learning for pattern
recognition.
Contributions. In this work, we have:
Combined and preprocessed a dataset containing
benign and malicous Android applications;
Developed a machine learning model based on
CNN to detect and classify mobile applications
samples as benign or malware.
Experimented the suggested model based on com-
parisons with other defined models.
Outline. The rest of the paper is organized as fol-
lows. In Section II, different malware types are intro-
duced. Then, in Section III, we introduce the different
malware analysis methdologies, whereas in Section
IV we discuss the related work. After that, in Sec-
tion V, we present in details our methodology. Then,
we explain the process of processing malware as an
image in Section VI. We share our results in Section
VII. And finally, we present out conclusions and fu-
ture work in Section VIII.
2 MALWARE TYPES
Malware is a compound word of two words: mali-
cious and software. Malware are software programs
that are designed and implemented to damage or ex-
ecute some malicious commands on a system which
may lead to some unwanted actions for the user such
as gathering sensitive information, disrupting normal
computer operations, gain control over the computer
system, spying on the user’s daily activities by gain-
ing access to mobile sensors, and destroying the mo-
bile system. The word malware is the general ter-
minology used to describe any malicious software.
However, they can be technically divided into the fol-
lowing categories depending on their goal.
Adware: It is a type of malware that automatically
displays advertisements. It is used to gather data
about users’ interests and to get revenue from the
displayed advertisements.
Spyware: It is a type of malware that tracks the
daily activities of the users without them knowing.
It is dependent on the data that it gets from mobile
sensors and other running applications. Spyware
can collect sensitive data, including keystrokes,
data harvesting, and monitoring activities.
Virus: It is a type of malware that can copy itself
and spread on the mobile system. Viruses can be
transported on any medium, including but not lim-
ited to email attachments, social media messages,
malicious links, etc.
Worm: It is a type of malware that can spread on
the network by exploiting operating system vul-
nerabilities. The major difference between worm
and virus is that virus depends on human action
to spread while the worm can replicate itself and
spread without any human interaction.
Trojan: it is a type of malware that makes itself
appear as a normal file or application to trick users
into downloading and installing the trojan. A tro-
jan can give unauthorized remote access to the in-
fected mobile phone. It is usually designed in a
client-server architecture where the server is in-
stalled on the attacker’s machine, and the client is
the trojan itself. It is used to steal private informa-
tion including but not limited to logins, financial
data, cryptocurrencies wallets, etc. In addition, it
is used to enable some devices on the victim’s mo-
bile phone such as front camera, spying on users’
activities such as keystrokes and files.
Rootkit: It is a type of malicious software de-
signed to access other mobile phones remotely
and control them without being detected.
Backdoors: It is a computer software that allows
access to compromised mobile phones. It allows
the attacker to have an entry point to the mobile
phone without the consent of the user.
Ransomware: It is a malicious software that re-
stricts the user from accessing his or her files by
encrypting them. The decryption happens after
A Novel Approach for Android Malware Detection and Classification using Convolutional Neural Networks
607
the attacker receives money from the victim. The
payment is usually done using cryptocurrencies
like Bitcoin or Ethereum.
Malware detection and classification problem is one
of the key issues that AV vendors face since their
foundation. They have been detecting malware us-
ing signature-based methods and heuristic methods.
A malware signature is a hash that identifies a specific
malware uniquely. This property helped AV vendors
to detect all the malware that happen to be from a cer-
tain family through a generic signature.
From experiments, researchers found out that all
malware in a certain family share properties, behav-
iors, and a generic signature of that family. This idea
is a fundamental concept that this thesis is built upon.
However, when AV started to detect such malware,
malware authors tried to bypass the security mech-
anisms implemented by AV vendors. They started
to write polymorphic and metamorphic malware to
avoid matching generic signatures. Polymorphic mal-
ware uses a polymorphic engine to mutate where the
original algorithm stays intact. One of the main tech-
niques used in writing polymorphic malware is en-
cryption. Metamorphic malware translates their bi-
nary code into a temporary representation. Then, they
edit their temporary representation, and they translate
then the edited version back to machine code again.
This mutation form can be achieved by several tech-
niques that may go to the architectural level of the mo-
bile phone by inserting NOP instructions or changing
the machine instructions completely.
3 MALWARE ANALYSIS
METHODOLOGIES
3.1 Static Analysis
Malware static analysis is based on the analysis of the
source code. It is called static because the analysis
is done without running the malware in a sandboxed
environment. It is also called code analysis. It is ba-
sically about examining the code without executing
the program. It helps in having an overall idea about
the code structure and auditing the code to check if it
adheres to industry standards. Automated tools help
security analysts and developers in performing static
analysis. The main advantage of static analysis is that
it can reveal bugs that do not manifest themselves.
Nevertheless, static analysis is just a first step towards
analyzing the behavior and the effects of a certain ap-
plication (Schmidt et al., 2009).
3.2 Dynamic Analysis
Dynamic analysis is a technique used in computer
forensics and software testing in order to test and eval-
uate an application by executing it in a sandboxed en-
vironment in real-time. The malware analyst keeps
an eye on the behavior of the application in terms of
CPU, memory, and network usage. Automated tools
can help in this process by raising alerts in case of
suspicious activity by the application (Schmidt et al.,
2009).
4 RELATED WORK
Researchers have been exploring the field of malware
detection in android applications from different per-
spectives and angles. In this work, we are exploring
the field of malware detection using deep learning in
the light of convolutional neural networks. Many ap-
proaches were used to detect malware based on their
network traffic, permissions, memory behavior, and
CPU usage. However, in our research, we followed a
static approach which is about detecting malware by
converting malware to images.
Ahmadi et al. have worked on novel feature ex-
traction, selection and fusion for Windows malware
families classification. Their work tried to keep up
with the involvement of modern malware which is de-
signed with mutation characteristics such as polymor-
phism and metamorphism. These characteristics lead
to an exponential growth in the number of variants
for each malware. In their research, they developed
a novel paradigm that is effective in classifying mal-
ware variants using feature extraction. They group
malware based on feature extraction, selection and fu-
sion (Ahmadi et al., 2016).
Saxe and Berlin discussed the problem of mal-
ware detection from a different approach. They used
deep neural networks to detect malware based on two-
dimensional binary program features. They take the
Windows binaries of malware and they designed a
system that learns from the binary features. They
introduced an approach that achieves a usable detec-
tion rate at a low false-positive rate (Saxe and Berlin,
2015).
Yajamanam et al. have chosen a different ap-
proach. They integrated a computer vision approach
by calculating GIST descriptors for image-based mal-
ware classification. GIST descriptors are image fea-
tures that have been recently used a lot in the field of
malware classification. In their research, they imple-
mented, tested, and analyzed a malware score based
on GIST descriptors. It is a potential advantage for
ICSOFT 2020 - 15th International Conference on Software Technologies
608
the field of malware classification. Their research
was based on Windows malware (Yajamanam et al.,
2018).
Li et al. opted for a hybrid malicious code de-
tection using deep learning. They suggested a new
Android classification method called HADM, which
stands for Hybrid Analysis for Detection of Windows
Malware. They start with static and dynamic infor-
mation extraction. Then, they convert it into vector-
based representations. The method is based on com-
bining features extracted from deep learning with the
original features which resulted in an increase in de-
tection rate (Li et al., 2015).
Tong et al. have developed a hybrid approach for
mobile malware detection in Android. They adopted
both static and dynamic analysis. They collected ex-
ecution data of sample malware and benign applica-
tions using a net
l
ink technology to generate patterns
of system calls. They have built up a malicious pat-
tern set and normal pattern set in order to compare
the patterns of malware and benign applications. For
detecting unknown applications, they have followed a
dynamic method to collect system calls data. Then,
they compare them with the patterns that were built
up before (Tong and Yan, 2017).
Narudin et al. have evaluated machine learn-
ing classifiers for mobile malware detection. They
used various network traffic features, and they group
them into four categories selected based on basic
information such as content-based, time-based, and
connection-based. They have used their own dataset.
They conducted multiple experiments and they found
that k-nearest neighbor is the efficient classifier for
malware detection (Narudin et al., 2016).
5 METHODOLOGY
5.1 General Overview of the Solution
The suggested solution of this paper is inspired by
previous research that has been done on malware de-
tection and classification on the Windows platform
(Ahmadi et al., 2016). The idea is to convert mal-
ware to images based on the observation that images
of different malware samples from the same malware
family appear to be the same as shown in Fig. 1. They
have common visual characteristics. The idea sug-
gested in this paper is not common in research since it
was tested only on Windows. In addition, if the mal-
ware was embedded in another application, it saves
the same visual characteristics, so it produces a simi-
lar image to its family.
Figure 1: Visualizing Malware as Grayscale Image.
Figure 2: Charger Malware Family Visualization.
The work in (Ahmadi et al., 2016) focused on
computing image-based features to characterize mal-
ware precisely. They used GIST descriptors to calcu-
late texture features without going through the pro-
cess of segmentation. This step resulted in feature
vectors for each malware in the size of 900 features.
However, their work used just 320 features because
based on research, they found that the 320 feature
is the optimal number of features needed to identify
malware. In addition, they suggested that just 60 fea-
ture is the minimum that can be used with an error of
30%.
The feature vectors are used to train a K-nearest
neighbor classifier with Euclidean distance. The con-
version process expects a malware binary file, in our
case the APK file, that is read as a vector of 8-bit un-
signed integers and structured as a 2D array. This ar-
ray is, then, visualized as a grayscale image in the
range [0, 255].
Visualizing malware as an image as the example
shown in Fig. 2 has many benefits especially that se-
quences of a binary can be identified easily. More-
over, malware authors tend usually to change some
small portions of the code in order to write new vari-
ants. Those small pieces of code are usually changed
after the malware is caught by anti-viruses. So, im-
ages are a good tool to detect the changes while hav-
ing a global structure of the malware. Hence, different
malware can be regrouped into families based on their
visual properties, so they can be easily identified from
images.
5.2 Dataset
The data set used in this research is called Android
Malware Dataset (CICAndMal2017) (Lashkari et al.,
2018). The approach used to build this dataset is
to run both malware and benign applications on real
android smartphones in order to ensure the exact
running behavior of the applications. Research has
shown that simulators often result in inconsistent be-
haviors of the applications, which might change the
A Novel Approach for Android Malware Detection and Classification using Convolutional Neural Networks
609
end results. In addition, some malware is smart
enough to detect emulated environments. The dataset
is composed of 10,854 samples (4,354 malware and
6,500 benign) from different sources. The benign ap-
plications are downloaded from Google Play in the
period between 2015 and 2017. However, the dataset
runs 5,491 applications (426 malware and 5,065 be-
nign). Due to storage and computational power limi-
tations, we have used in this research a sample of 852
applications (426 malware and 426 benign). This step
was also done to balance the dataset from any bias.
The malware samples in this dataset are classified into
four categories:
Adware
Ransomware
Scareware
SMS Malware
The samples consist of 42 unique malware fami-
lies within the four mentioned categories above.
6 MALWARE IMAGES
PREPROCESSING
6.1 Labeling
During this research, we have been dealing with a bi-
nary classification problem using convolutional neu-
ral networks. In order to build a balanced and unbi-
ased dataset, we downloaded benign android applica-
tions from Play Store. The process of identifying if
the application is benign or malware was done using
VirusTotal API. VirusTotal is a web framework that
provides malware analysis as a service based on file
signatures since it has a large database (VirusTotal,
2020).
6.2 Data Pre-processing
In order to feed the CNN, a normalization step is
needed in order to decrease the sizes of the images
and to make them unified. CNN expects a set of la-
beled images of the same size. However, this was a
problem in this research since images have different
dimensions. In order to overcome this problem, we
used mean subtraction and normalization.
6.2.1 Mean Subtraction
For the mean subtraction, it is a widely used tech-
nique in preprocessing images for CNN. It is about
subtracting the mean across every individual feature
in the data. It can be interpreted geometrically by
centering the data cloud around the origin of every di-
mension. In our implementation, we used NumPy ar-
rays. We have implemented this operation as: X =
np.mean(X) where X is our NumPy array that holds
the data assuming that we have grayscale images.
6.2.2 Normalization
For the normalization step, it refers to normalizing
the data dimensions so they are approximately the
same scale. We have implemented this step by di-
viding the standard deviation after making the data
zero-centered. The implementation using NumPy
was done as follows: X/ = np.std(X) where X is
our NumPy array that holds the data assuming that
we have grayscale images.
6.2.3 GIST Feature Extraction
In this research, we have worked with two types of
data. We have used images after performing the pre-
processing phase and feed them to the CNN. In ad-
dition, we have extracted the GIST vector features.
We have used the first 320 values. This was imple-
mented using LearGist Python Wrapper since the of-
ficial python library seems to be dead [15]. The idea
is to give a preprocessed image with the same size of
computing its GIST which results in a feature vector
of 960 values in which we used 320 values because
researchers concluded that 320 is enough to get opti-
mal results. In addition, researchers stated that only
60 values will give accurate results up to 60% (Douze
et al., 2009).
6.2.4 Malware Images Classification
In this paper, we have used the k-nearest neighbor
algorithm. KNN or k-Nearest Neighbor is a super-
vised learning algorithm that is used widely in ma-
chine learning and data mining. It is a classifier al-
gorithm where the learning is based on how a vec-
tor is similar to another. It does not compare the un-
classified data with all the other data. It performs a
mathematical calculation to measure the distance be-
tween the data to make the classification. The main
distance calculations that are used in k-NN are Eu-
clidean Distance and Manhattan Distance (Cover and
Hart, 1967). In this research, we have used the Eu-
clidean Distance between two points p and q with the
following formula:
d (p,q) =
s
n
i=1
(q
i
p
i
)
2
(1)
ICSOFT 2020 - 15th International Conference on Software Technologies
610
The algorithm of KNN is as follows:
1. Receives an unclassified data;
2. Measures the distance (Euclidean, Manhattan,
Minkowski or Weighted) from the new data to all
other data that is already classified;
3. Sort the distances;
4. Gets the K smaller distances (nearest neighbors)
where K is the number of neighbors that should
be selected. The default value for K is 1;
5. Gather the cluster of the nearest neighbor;
6. Classifies the new data with the cluster that has
been chosen in Step 5.
7 RESULTS
7.1 Environment
The experiments were run on Ubuntu 18.04 LTS
with built-in GPUs. We used Deep Learning Vir-
tual Machine provided by Microsoft Azure with SSD
storage. The programming language that we used
is Python because it supports many machine learn-
ing libraries such as TensorFlow, Keras, Anaconda,
Jupyter, NumPy, matplotlib, SciPy, Pandas, and scikit
learn. The usefulness of each library is as follows:
TensorFlow is “an open-source software li-
brary for numerical computation using data flow
graphs. The graph nodes represent mathematical
operations, while the graph edges represent the
multidimensional data arrays (tensors) that flow
between them (TensorFlow, 2020).
Keras is a high-level neural networks API, written
in Python and capable of running on top of Ten-
sorFlow, CNTK, or Theano (Keras, 2020).
Anaconda is a free and open-source distribution
of the Python and R programming languages for
scientific computing to simplify package manage-
ment and deployment (Anaconda, 2020).
Jupyter Project exists to develop open-source soft-
ware, open-standards, and services for interactive
computing across dozens of programming lan-
guages (Jupyter, 2020).
NumPy, or Numerical Python, is the most univer-
sal and versatile library both for pros and begin-
ners (NumPy, 2020).
Matplotlib is a flexible library for creating graphs
and visualization (Matplotlib, 2020).
Pandas is a well-known and high-performance
tool for presenting data frames (Pandas, 2020).
Scikit Learn implements a wide-range of
machine-learning algorithms and makes it com-
fortable to plug them into actual applications (kit
Learn, 2020).
Figure 3: Training and Validation Accuracy vs. Epochs
with 10 Epochs.
7.2 Architectures
In order to achieve high accuracy, we have built sev-
eral architectures to detect malware using convolu-
tional neural networks. We have changed various pa-
rameters such as the learning rate, batch size, and the
number of epochs for each architecture. We have im-
plemented different architecture with different layers
and activation functions.
7.2.1 CNN A: 3c 2D
The architecture consists of:
1. Input layer NxN pixels (N = 128)
2. Convolutional Layer (32 filters of size 3x3)
3. Max Pooling layer
4. Convolutional Layer (64 filters of size 3x3)
5. Max Pooling layer
6. Convolutional Layer (128 filters of size 3x3)
7. Max Pooling layer
8. Flatten Layer
9. Densely-connected layer (64 neurons)
10. Densely-connected layer (1 neuron)
We have used ReLU activation function for all the lay-
ers except for the last one, we used sigmoid. We have
fixed a learning rate of 0.01, batch size of 16, the num-
ber of epochs of 10, and a loss function to be binary
cross-entropy. The accuracy that we achieved with
this architecture is 84.9%. Fig. 3 describes the be-
havior of the model in terms of accuracy. In addition,
Fig. 4 shows the training vs. validation loss.
A Novel Approach for Android Malware Detection and Classification using Convolutional Neural Networks
611
Figure 4: Training and Validation Loss vs. Epochs with 10
Epochs.
Figure 5: Training and Validation Accuracy vs. Epochs
with 100 Epochs.
Figure 6: Training and Validation Loss vs. Epochs with 100
Epochs.
We have tried to experiment with the architec-
ture by increasing the number of epochs to reach 100.
The model kept its good detection performance. The
graphs in Fig. 5 and Fig. 6 describe the accuracy and
the loss of the model in 100 epochs.
Figure 7: Training and Validation Accuracy vs. Epochs
with 10 Epochs.
Figure 8: Training and Validation Loss vs. Epochs with 10
Epochs.
7.2.2 CNN B: 2c 2D
The architecture consists of:
1. Input layer NxN pixels (N = 128)
2. Convolutional Layer (32 filters of size 3x3)
3. Max Pooling layer
4. Convolutional Layer (64 filters of size 3x3)
5. Max Pooling layer
6. Flatten Layer
7. Densely-connected layer (1 neuron)
We have used ReLU activation function for all the
layers except for the last one, we used softmax. We
have fixed a learning rate of 0.001, batch size of 16,
number of epochs of 10, and a loss function to be bi-
nary cross-entropy. The accuracy that we achieved
with this architecture is 68.1%. The graph in Fig. 7
describes the behavior of the model in terms of ac-
curacy. In addition, the graph in Fig. 8 shows the
training vs. validation loss.
We have tried to experiment with the architec-
ture by increasing the number of epochs to reach 100.
ICSOFT 2020 - 15th International Conference on Software Technologies
612
The model kept its good detection performance. The
graphs in Fig. 9 and Fig. 10 describes the accuracy
and the loss of the model in 100 epochs:
Figure 9: Training and Validation Accuracy vs. Epochs
with 100 Epochs.
Figure 10: Training and Validation Loss vs. Epochs with
100 Epochs.
8 CONCLUSION AND FUTURE
WORK
In this research, we focused on studying the feasibility
of detecting and classifying mobile malware by treat-
ing them as images. It presents a static analysis ap-
proach using deep learning and convolutional neural
networks. It is a different approach in mobile malware
detection since the research focus is on how to detect
and classify mobile malware based on their behavior.
In addition, this work presented challenges of apply-
ing such approach on mobile applications since they
are usually large files. We used the proposed approach
to detect and classify malware from the binary files
of Android applications. We used different architec-
tures to test our approach and compared our results.
We have achieved an accuracy of 84.9% using one ar-
chitecture, and in another, we reached an accuracy of
68.1%. So, both architectures are useful for malware
detection. Our model can detect variants of malware
or another unknown malware based on the training
data. This is an added value to overcome some of the
problems facing signature-based detection systems.
Although we have 84.9% accuracy, we still have
limitations and challenges. The first limitation is the
dataset. Our model learned from previous sample
malware. The dataset that we used had a limited num-
ber of sample malware. Also, because the majority of
Android malware datasets and some malware families
and benign applications are private, we had to collect
our own data. The second limitation is the processing
power. Processing images needs a variety of high-
performance GPUs that are not available at the uni-
versity lab.
This work is an attempt to make an advanced mal-
ware detection system. In the future, we plan to
increase the accuracy of our system and reduce the
number of false positives by using new image feature
extraction and new computer vision techniques that
can help in the problem of malware classification. In
addition, we plan to introduce multi-level classifica-
tion instead of binary classification to precisely deter-
mine the type of the family of malware.
ACKNOWLEDGEMENTS
We would like to thank all fellow students, faculty,
and staff for supporting this research. Special thanks
to Mr. Saad Taame for his help in data pre-processing.
REFERENCES
Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., and
Giacinto, G. (2016). Novel feature extraction, selec-
tion and fusion for effective malware family classifi-
cation. In Proceedings of the sixth ACM conference
on data and application security and privacy, pages
183–194.
Anaconda (2020).
Cover, T. and Hart, P. (1967). Nearest neighbor pattern clas-
sification. IEEE transactions on information theory,
13(1):21–27.
Douze, M., J
´
egou, H., Sandhawalia, H., Amsaleg, L., and
Schmid, C. (2009). Evaluation of gist descriptors
for web-scale image search. In Proceedings of the
ACM International Conference on Image and Video
Retrieval, pages 1–8.
Jupyter, P. (2020).
Keras (2020).
kit Learn, S. (2020).
A Novel Approach for Android Malware Detection and Classification using Convolutional Neural Networks
613
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Advances in neural information process-
ing systems, pages 1097–1105.
Lab., K. (2019). Kaspersky lab detected more than 6 million
malware package.
Lashkari, A. H., Kadir, A. F. A., Taheri, L., and Ghor-
bani, A. A. (2018). Toward developing a system-
atic approach to generate benchmark android malware
datasets and classification. In 2018 International Car-
nahan Conference on Security Technology (ICCST),
pages 1–7. IEEE.
Li, Y., Ma, R., and Jiao, R. (2015). A hybrid malicious
code detection method based on deep learning. In-
ternational Journal of Security and Its Applications,
9(5):205–216.
Matplotlib (2020).
Michie, D., Spiegelhalter, D. J., Taylor, C., et al. (1994).
Machine learning. Neural and Statistical Classifica-
tion, 13(1994):1–298.
Narudin, F. A., Feizollah, A., Anuar, N. B., and Gani,
A. (2016). Evaluation of machine learning classi-
fiers for mobile malware detection. Soft Computing,
20(1):343–357.
NumPy (2020).
Pandas (2020).
Saxe, J. and Berlin, K. (2015). Deep neural network based
malware detection using two dimensional binary pro-
gram features. In 2015 10th International Conference
on Malicious and Unwanted Software (MALWARE),
pages 11–20. IEEE.
Schmidt, A.-D., Bye, R., Schmidt, H.-G., Clausen, J., Ki-
raz, O., Yuksel, K. A., Camtepe, S. A., and Albayrak,
S. (2009). Static analysis of executables for collab-
orative malware detection on android. In 2009 IEEE
International Conference on Communications, pages
1–5. IEEE.
Statista (2020). Number of smartphone users worldwide
from 2014 to 2020 (in billions).
TensorFlow (2020).
Tong, F. and Yan, Z. (2017). A hybrid approach of mobile
malware detection in android. Journal of Parallel and
Distributed computing, 103:22–31.
VirusTotal (2020). Analyze suspicious files and urls to de-
tect types of malware, automatically share them with
the security community).
Yajamanam, S., Selvin, V. R. S., Di Troia, F., and Stamp,
M. (2018). Deep learning versus gist descriptors for
image-based malware classification. In ICISSP, pages
553–561.
Zhao, Z. and Liu, H. (2007). Spectral feature selection for
supervised and unsupervised learning. In Proceed-
ings of the 24th international conference on Machine
learning, pages 1151–1157.
ICSOFT 2020 - 15th International Conference on Software Technologies
614