Interpretable Android Malware Detection Based on Dynamic Analysis

Arunab Singh, Maryam Tanha, Yashsvi Girdhar and Aaron Hunter

School of Computing and Academic Studies, British Columbia Institute of Technology, Canada

Keywords:

Malware, Dynamic Analysis, Android, Security.

Abstract:

Android has emerged as the dominant operating system for smart devices, which has consequently led to the

proliferation of Android malware. In response to this, different analysis methods have been suggested for the

identiﬁcation of Android malware. In this paper, we focus on so-called dynamic analysis, in which we run

applications and monitor their behaviour at run-time rather analyzing the source code and resources (which

is called static analysis). One approach to dynamic analysis is to use machine learning methods to identify

malware; essentially we run a large set of applications that may or may not be malware, and we learn how to

tell them apart. While this approach has been successfully applied, both academic and industrial stakeholders

exhibit a stronger interest in comprehending the rationale behind the classiﬁcation of apps as malicious. This

falls under the domain of interpretable machine learning, with a speciﬁc focus on the research ﬁeld of mobile

malware detection. To ﬁll this gap, we propose an explainable ML-based dynamic analysis framework for

Android malware. Our approach provides explanations for the classiﬁcation results by indicating the features

that are contributing the most to the detection result. The quality of explanations are assessed using stability

metrics.

1 INTRODUCTION

Mobile malware poses substantial security risks, im-

pacting both individuals and enterprises by exploit-

ing valuable data stored on mobile devices. Android,

holding a 72% market share (Statista, 2023), is a

prime target due to its open nature. Cyberattacks,

such as mobile banking threats (Kapersky, 2023), un-

derscore this vulnerability despite efforts by Google

and OEMs (original equipment manufacturers) (Ko-

vacs, ByEduard, 2023).

Dynamic analysis is an important technique for

Android malware detection. This approach involves

running potentially malicious Android applications in

a controlled environment to monitor their behavior in

real-time. By observing how an app interacts with the

device, network, and other applications, security ana-

lysts can uncover hidden malicious activities that may

not be evident through static analysis alone. Dynamic

analysis provides valuable insights into an app’s run-

time behavior, enabling the detection of malware that

exhibits polymorphic or obfuscated characteristics.

There is a great deal of research on Android mal-

ware detection through dynamic analysis, much of

which is based on using machine learning (ML) tech-

niques (Yan and Yan, 2018). While ML models show

promise in detecting malicious apps, their underly-

ing selection processes are often complex, making it

challenging to explain why an app has been classi-

ﬁed as malicious. This lack of transparency hinders

the ability to understand the key features or behav-

iors that triggered the classiﬁcation result, which is

crucial for both security analysts and app developers.

Moreover, the absence of explainability makes it difﬁ-

cult to identify false positives and false negatives, po-

tentially damaging the reputation of legitimate apps

and allowing some malicious ones to remain unde-

tected. Researchers are actively working on methods

to enhance the interpretability of ML models, as it’s

not only about making accurate predictions but also

about providing insights into why certain decisions

are made, thereby improving the overall effectiveness

and trustworthiness of ML-based systems. Explain-

able ML methods have not been employed extensively

for dynamic analysis. Moreover, some of the existing

research studies on dynamic analysis rely on inacces-

sible or outdated tools. A notable trend is the scarcity

of open-source initiatives or comprehensive instruc-

tions, hindering result reproduction and validation.

The main contributions of this paper are summa-

rized as follows:

• We present a clear methodology for producing a

tool for dynamic malware analysis. The list of

Singh, A., Tanha, M., Girdhar, Y. and Hunter, A.

Interpretable Android Malware Detection Based on Dynamic Analysis.

DOI: 10.5220/0012415800003648

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 10th International Conference on Information Systems Security and Privacy (ICISSP 2024), pages 195-202

ISBN: 978-989-758-683-5; ISSN: 2184-4356

195

apps used, the feature extraction and selection,

and developed ML models (for which we also ad-

dressed the class imbalance issues to improve the

results) are open-source and publicly available to

the research community on GitHub

. This will

facilitate the reproducibility of our research.

• We employ an explanation technique for interpret-

ing the predictions made by our ML models and

evaluate its effectiveness. In this way, we can

identify the features that are contributing the most

to detection of an app as malware as well as eval-

uating the quality of explanations. To the best

of our knowledge, this study is the ﬁrst work at-

tempting to provide a systematic approach to use

explainable ML methods for dynamic analysis of

Android malware.

• We have created datasets of traces for recent An-

droid apps (goodware and malware), which is

well-suited for dynamic analysis. Also, using our

open-source scripts, researchers can generate new

datasets in an easy and seamless manner.

2 BACKGROUND AND RELATED

WORK

2.1 Preliminaries

There are two main approaches to Android malware

detection and analysis: static analysis and dynamic

analysis. The static analysis approach involves ana-

lyzing the code and resources of an app without ex-

ecuting it. One problem with static analysis is that it

is generally unable to detect new malicious behaviour

that happens at run-time (Yan and Yan, 2018). It is

also not effective in identifying malicious deforma-

tion techniques such as java reﬂection and dynamic

code loading (Pan et al., 2020). Hence, a comprehen-

sive approach to malware analysis can not be based

solely on static methods. Dynamic analysis refers to

the approach where the behaviour of an application

is monitored while it is executed in a controlled en-

vironment. In this manner, we can monitor things

like system calls, network trafﬁc, user interactions

and inter-process communication. One advantage of

this approach is that it can identify zero-day attacks

that have not previously been observed. Moreover, it

can identify malware that has been deliberately writ-

ten to evade static analysis(Yan and Yan, 2018). Com-

pared with static analysis, there are fewer studies on

https://github.com/maryam-

tanha/DynamicAndroidMalwareAnalysis

dynamic analysis and it needs further investigation by

the research community. Therefore, in this paper, our

focus is on dynamic analysis for Android malware de-

tection.

2.2 Interpretable Machine Learning for

Dynamic Analysis

There are many studies that have used ML meth-

ods (including deep learning) for dynamic analysis

of Android malware. However, researchers have

raised concerns regarding the over-optimistic classi-

ﬁcation results produced by various ML-based mal-

ware detection techniques (Arp et al., 2022; Pendle-

bury et al., 2019). Furthermore, the approaches to

Android malware detection mainly rely on black-box

models. Consequently, security analysts frequently

found themselves questioning the reliability of pre-

dictions from such highly accurate ML-based mal-

ware detection models and pondering the suitability

of model selection before deployment. In response

to these issues, several studies introduced methods

aimed at explaining the predictions made by ML-

based malware detection models. Examples of such

studies for static analysis are (Fan et al., 2020; Wu

et al., 2021; Liu et al., 2022b). While there are stud-

ies that attempt at providing explanations for dynamic

analysis such as (De Lorenzo et al., 2020), none of the

existing ones have investigated using eXplainable Ar-

tiﬁcial Intelligence techniques(XAI) (Dwivedi et al.,

2023) for interpreting the results of classiﬁcation for

dynamic analysis.

2.3 Using System Calls for ML-Based

Dynamic Analysis

In the following, we give a brief summary of the

main recent studies that employed system calls as

their only set of features or along with other features.

We refer interested readers to (Yan and Yan, 2018;

Liu et al., 2022a; Razgallah et al., 2021) for more

comprehensive surveys of dynamic analysis for An-

droid malware detection. SpyDroid (Iqbal and Zulk-

ernine, 2018), is an Android malware dynamic anal-

ysis framework that utilizes multiple malware detec-

tors. It supports monitoring CPU and memory usage

as well as kernel system calls.Android malware de-

tection system (AMDS) (Zhang et al., 2022) relies on

system call traces, which are processed using N-gram

analysis. The experimental ﬁndings illustrate that

AMDS exhibits the ability to detect threats at an early

stage with high accuracy. TwinDroid (Asma Raza-

gallah, 2022) is a dataset of 1000 system call traces

for Android apps. An early version of TwinDroid is

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

196

used in (Razgallah and Khoury, 2021) for Android

malware detection by using n-grams as well as Term

Frequency–Inverse Document Frequency (TF–IDF)

weight vector of system calls and ML classiﬁcation

methods. MALINE (Dimja

sevi

c et al., 2016) em-

ploys system call frequency and system call depen-

dency for indicating malicious behaviour. Another

recent study, (Mahdavifar et al., 2020), detects mal-

ware families based on different categories of sys-

tem calls and applying a semi-supervised deep learn-

ing method. Sequences of system calls are used in

Deep4MalDroid (Hou et al., 2016) to construct a

weighted directed graph, which is then given to a deep

learning framework for malware detection. Droid-

Scribe (Dash et al., 2016) provides a framework for

malware family classiﬁcation. It utilizes (Tam et al.,

2015) as its dynamic analysis component to capture

system calls information.

3 METHODOLOGY

In this section, we explain our approach for develop-

ing an Android malware detection system based on

dynamic analysis using interpretable machine learn-

ing.

3.1 Data Collection and Pre-Processing

We used the AndroZoo dataset (Allix et al., 2016) for

obtaining our goodware and part of our malware. An-

droZoo is commonly used by research community for

Android malware detection and analysis. Note that

AndroZoo does not explicitly label apps as malware.

Instead, it provides the number of antiviruses from

VirusTotal (VirusTotal, 2023) that ﬂagged the app as

malware. Particularly, there is a property for an app

called vt detection in AndroZoo. A non-zero value for

this property means that one or more antivirus tools

have ﬂagged this app as malware; otherwise, the app

is considered to be goodware (vt detection=0). To

collect our malware samples, we chose the apps such

that their vt detection is more than 4 (to have more

conﬁdence in their label). A similar approach has

been employed in (Pendlebury et al., 2019). More-

over, we also downloaded malware from VirusShare

(VirusShare, 2023), which is a repository of malware

samples. The quality of the dataset is of crucial im-

portance for Android malware ML-based detection.

If the dataset is not up-to-date or not representative

of the population under study, the conclusions may

be invalid. Therefore, while selecting the apps, we

considered addressing issues including realistic class

ratio and temporal inconsistency.

3.1.1 Realistic Class Ratio and Addressing Class

Imbalance

To have realistic settings, the class ratio between be-

nign and malware classes at test time is chosen sim-

ilar to the one at deployment time. In the Android

app domain, malware is the minority class. Similar

to (Pendlebury et al., 2019; Miranda et al., 2022), we

used the average overall ratio of 10% Android mal-

ware in our test samples to represent the real-world

scenario. Moreover, we addressed the dataset imbal-

ance issue by using class weights and undersampling

techniques. More details are provided in Section 4.

3.1.2 Addressing Temporal Inconsistency

Temporal inconsistency (i.e., when malware and be-

nign apps are selected randomly from different time

periods)(Miranda et al., 2022; Cavallaro et al., 2023)

unrealistically improves the performance of the ML-

based malware detection approaches. Additionally,

explanations of ML-based Android malware detec-

tion suggest that the models can accurately pre-

dict malware by relying on temporal-related features

rather than the actual characteristics of malicious and

benign behaviors (Liu et al., 2022b). To prevent

temporal inconsistency, we chose both malware and

goodware from the same timeframe, i.e., 2022.

3.2 Feature Selection

There are different features that may be included in

developing a dynamic analysis solution. Many of the

primary studies on dynamic analysis selected features

associated with system calls and dynamic activities

such as network access and memory dump to capture

malicious behavior (Razgallah et al., 2021; Liu et al.,

2022a; Saneeha Khalid, 2022). Among such features,

system calls are the mainstream and have been fa-

vored by the research community for dynamic anal-

ysis of Android malware. In contrast to API calls,

Linux kernel system calls are independent from the

Android Operating System (OS) version, rendering

them more robust against strategies employed by mal-

ware to evade detection. Therefore, in this paper, we

focus primarily on the analysis of system calls. All

apps interact with the platform where they are ex-

ecuted by requesting services through a number of

available system calls. These calls deﬁne an interface

that allow apps to read/write ﬁles, send/receive data

through the network, read data from a sensor, make a

phone call, etc. Legitimate apps can be characterized

by the way they use such an interface, which facil-

itates the identiﬁcation of malicious components in-

Interpretable Android Malware Detection Based on Dynamic Analysis

197

Figure 1: Proposed framework.

serted into a seemingly harmless app and, more gen-

erally, other forms of malware.

3.3 Feature Extraction Process

We explored a variety of tools to collect system calls

produced by running an Android app in an emula-

tor. We mainly used adb (ADB, 2023), Android

Emulator (Android Studio, 2023), and strace (strace,

2023). Moreover, Monkey (monkey, 2023) was em-

ployed to simulate random user interactions with an

app. We have examined all of the traces from our

tests and we focus on extracting system calls and their

frequencies. Certain system calls might be indicative

of malicious behavior, such as trying to access sen-

sitive data or performing unauthorized action. Also,

unusual or high frequencies of certain system calls

might indicate that it is performing some action ex-

cessively, which could be an indication of malicious

activity. Thus, we identiﬁed a list of distinctive system

calls used in all the strace logs for malware and good-

ware. For each system call, its frequency is included

in our feature vector.

3.4 Classiﬁcation Models and

Explanation Method

We have opted to employ Random Forest (RF)

(Breiman, 2001) and eXtreme Gradient Boost (XG-

Boost) (Chen et al., 2015), as our classiﬁcation meth-

ods. Both of these methods are considered as ensem-

ble learning techniques. Such techniques combine the

predictions of multiple ML models. To interpret the

prediction results from the above classiﬁers, we em-

ployed Local Interpretable Model-agnostic Explana-

tions (LIME) (Ribeiro et al., 2016), which is a widely

used approach in various other ﬁelds (Belle and Pa-

pantonis, 2021). Apart from its model-agnostic na-

ture and the ability to offer localized interpretabil-

ity, LIME is favored for its simple implementation

and lower computational overhead when compared to

similar techniques in the ﬁeld of eXplainable Artiﬁ-

cial Intelligence (XAI) (Dwivedi et al., 2023). More-

over, LIME has been used for explaining the ML-

based predictions for static analysis of Android mal-

ware (Fan et al., 2020). LIME operates by altering

speciﬁc feature values within a single data sample and

then monitors the resulting impact on the output, es-

Table 1: Scenarios.

Scenario No. Goodware No. Malware

S1-10 2430 270

S2-15 2295 405

S3-20 2160 540

Table 2: Model performance evaluation metrics and their

computed values for the ﬁrst three scenarios.

Scenario Classiﬁer Precision Recall F1-Score

S1-10 RF 1 0.33 0.5

S1-10 XGBoost 0.79 0.50 0.61

S2-15 RF 0.93 0.47 0.62

S2-15 XGBoost 0.81 0.57 0.67

S3-20 RF 0.89 0.53 0.67

S3-20 XGBoost 0.78 0.60 0.68

sentially acquiring the classiﬁcation label of the mod-

iﬁed sample. Once predictions are created for a col-

lection of perturbed input samples, LIME proceeds

to approximate the decision boundary and calculate

the weights that denote the signiﬁcance of individual

features. It should be noted that LIME is a post hoc

explainable method, i.e., it is applied to a model after

training. Figure 1 shows our proposed framework.

4 PERFORMANCE EVALUATION

AND ANALYSIS OF RESULTS

4.1 Experimental Setup

We carried out our experiments on a Tensorbook with

Ubuntu 22.04 OS, Intel Core i7-processor (14 cores),

NVIDIA RTX 3070 Ti GPU, and 32 GB memory

(DDR5-4800). We randomly chose 2,798 benign

apps AndroZoo and 624 malware from AndroZoo and

VirusShare. The Android emulator was used to emu-

late a Pixel 6 (API 31) device. Moreover, 146 unique

system calls were identiﬁed by analyzing the traces

for benign apps and malware; these have been incor-

porated in our feature vector. Finally, the classiﬁca-

tion algorithms are implemented using Scikit-Learn

ML library (scikit-learn, 2023).

4.2 Performance Evaluation

We used the grid searching method with 10-fold

cross validation for hyperparameter tuning. Since our

dataset is imbalanced, having high accuracy does not

necessarily mean high detection rate for our positive

class (i.e., malware). Therefore, we have looked into

other metrics which are better representatives for the

performance of our classiﬁers as follows.

In our ﬁrst set of experiments, we examined three

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

198

scenarios, each maintaining a consistent 10% mal-

ware ratio during testing, while the training data con-

tained 10% (S1-10), 15% (S2-15), and 20% (S3-20)

malware in these respective scenarios. Table 1 shows

the distribution of goodware and malware in the train-

ing sets of our scenarios. Note that although maintain-

ing a realistic class ratio during training can reduce

the overall error rate, different ratios can be employed

to adjust the decision boundary, thereby balancing the

trade-off between false positives and false negatives

(and subsequently affect precision and recall). Table 2

summarizes the obtained values of performance met-

rics from the three scenarios. If we look at the recall

values of the malware class, XGBoost has a higher

value in all three scenarios. Therefore, XGBoost per-

forms better than RF in terms of detecting malware

(i.e., it correctly identiﬁes malware with higher prob-

ability). It is evident from the three scenarios that

When the ratio of malware in the training set goes

up, the decision boundary moves towards the good-

ware class, i.e., the recall is increased for malware but

the precision is reduced. Therefore, if our objective

is to improve malware detection (i.e., recall), we can

increase the malware ratio in the training set. Overall,

XGBoost has a better performance compared with RF

classiﬁer due to its higher F1-score. Also, we can see

the increase in F1-score when we increase the mal-

ware ratio (having a more balanced dataset) in s2-15

and s3-20 compared with s1-10 scenario. This results

from the fact that the F1-score is a measure of har-

monic mean of precision and recall. So, the closer the

values of precision and recall are to each other, the

better (higher) the F1-score will be.

Addressing Class Imbalance Using Under-

Sampling: In this scenario (named s4-u-10), we

applied Tomek links under-sampling technique

(Haibo and Yunqian, 2013) to our training set.

Tomek links consist of pairs of closely located

instances belonging to different classes. Tomek

links are present when two samples are each other’s

nearest neighbors. By eliminating the majority class

instances in each pair, the separation between the two

classes is widened, thus facilitating the classiﬁcation

process. Figure 2 shows the results of this scenario

for different classiﬁers and in comparison with

s1-10 (as our base scenario). Figure 2a displays

21%, and 14% improvement in recall and F1-score,

respectively for the RF classiﬁer. Moreover, Figure

2b illustrates 12.6% , 6%, and 9.8% improvement in

precision, recall and F1-score, respectively for the

XGBoost classiﬁer.

Addressing Class Imbalance Using Class Weights:

In this scenario, named s5-w-10, we used class

weights to address the class imbalance problem. By

(a) (b)

Figure 2: Performance comparison of s1-10 scenario with

scenarios addressing class imbalance.

assigning higher weights to the minority class (i.e.,

malware class), the model is enabled to give more im-

portance to malware samples during training and mit-

igate the bias towards the majority class (i.e., benign

class). The class weights are inversely proportional to

their frequency of occurrence in the dataset. Figure

2 illustrates the performance of the classiﬁers when

adding the class weights. In Figure 2a, we observe

60% and 30% improvement in recall and F1-score,

respectively for the RF classiﬁer. Similarly, ﬁgure

2b shows 26% and 18% improvement in recall and

F1-score, respectively for the XGBoost classiﬁer. It

should be noted that adding class weights is only ef-

fective (i.e., the improvement in the metrics is notice-

able) if the dataset is highly imbalanced (i.e., ratio of

10% or less for malware in our experiments).

Considering the two techniques that we used for

addressing the class imbalance and improving the

performance of models, class weight shows more

promise in terms of improving metrics when making

a comparison with Tomek links technique (s4-u-10).

Particularly, when adding class weights, the amount

of increase in F1-score is 16% for the RF classiﬁer

and 8.2% for the XGBoost classiﬁer compared with

the Tomek links scenario (s4-u-10). A drawback to

under-sampling is that we have eliminated informa-

tion that may be valuable; however, in a large dataset,

this may become less of a concern as the number of

removed benign samples is much lower than the total

number of such samples.

4.3 Analysis of LIME Explanations

So far, our focus has been on analyzing the perfor-

mance of our classiﬁers and mitigating the class im-

balance issue for dynamic analysis. We do not have

any information about the contributions of different

features (i.e., the frequency of system calls) to the

predictions made by our classiﬁers. In other words

there is an important question to answer ”Why is an

Android App classiﬁed as malware by our ML mod-

els?”. To provide some insights for the prediction

results, we have utilized the LIME explainable tech-

nique. Furthermore, we focus on XGBoost classiﬁer

Interpretable Android Malware Detection Based on Dynamic Analysis

199

(a) APK package name: com.jiccos.jicco

(b) APK package name: com.wGuitarProTabs

Figure 3: Examples of LIME explanations for three true positives in the testing set.

results with class weights (scenario s5-w-10) since it

outperforms the RF classiﬁer.

We analyzed the true positives (15 samples) for

our testing set, which includes a total of 300 An-

droid apps out of which 30 apps are malware. As we

mentioned earlier, LIME provides local explainabil-

ity, i.e., it indicates the most contributing features to

the prediction result for a test instance. We conﬁg-

ure LIME to show the top ﬁve features for a predic-

tion in our experiments. Figure 3 shows output ex-

amples of applying LIME to three test instances. By

looking at prediction probabilities, Figure 3a shows

that the XGBoost classiﬁer is 100% conﬁdent that

com.jiccos.jicco is malware. Similarly, the conﬁ-

dence of the classiﬁer is 92% and 99% when de-

tecting malware in Figure 3b and Figure 3c. More-

over, the top ﬁve system calls (based on their fre-

quencies) are shown in Figure 3. For instance, the

most important system calls (based on their frequen-

cies) to the prediction in Figure 3a are exit group,

sched get priority max, futex, write, and listen with

weights of 0.16, 0.13, 0.05, 0.07, and 0.06, respec-

tively. Note that the descriptions of system calls can

be found in (syscalls(2)–Linux manual page, 2023).

Additionally, the explanation provides the reasons

Table 3: Top system calls and their presence in true posi-

tives.

System call Presence in true positives

exit group 66%

sched get priority max 46%

sched getscheduler 33%

futex 26%

why the model make this prediction. Table 3 displays

the top four system calls that are present in most of

the true positives based on the output of LIME.

Explanation Quality: One important aspect of ex-

planations to consider is their stability, i.e., the

changes in explanations when utilizing the explain-

able method (in our case LIME) multiple times under

the same conditions. We use two stability metrics,

which were deﬁned in (Visani et al., 2022) for LIME,

namely Variables Stability Index (VSI) and Coefﬁ-

cients Stability Index (CSI). High VSI values ensure

that the variables (features) obtained in various LIME

instances are consistently identical. Conversely, low

values indicate unreliable explanations, as different

LIME calls may yield entirely different variables to

explain the same machine learning decision. Regard-

ing CSI, high values guarantee the reliability of LIME

coefﬁcients for each feature. Conversely, low values

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

200

advise practitioners to exercise caution: for a given

feature, the initial LIME call may yield a speciﬁc co-

efﬁcient value, but subsequent calls are likely to pro-

duce different values. As the coefﬁcient represents

the impact of the feature on the ML model decision,

obtaining different values corresponds to signiﬁcantly

distinct explanations. The average of VSI values is

90.6% with standard deviation of 5% whereas the av-

erage of CSI values is 97.6% with standard deviation

of 3.8%. Therefore, in our experiments for malware

detection LIME results are more stable w.r.t. to CSI

rather than VSI.

Limitations of Our Approach: It should be noted

that system calls are inherently low-level and lack the

extensive information found in the more semantically

enriched Android APIs. Therefore, mapping them di-

rectly to the malware’s behavior is a challenging task.

A dataset that mapped high-level Android APIs to dif-

ferent representations of low-level system call traces

was created in (Nisi et al., 2019). But the dataset is

not accessible and we could not use it to map the sys-

tem calls to API calls and subsequently relate them

to an app’s behavior. Moreover, we examined the re-

ports from VirusTotal for some of our malware in the

testing set but the features listed are mainly related to

static analysis of those apps.

5 CONCLUSION AND FUTURE

WORK

In this paper, we presented an interpretable ML-based

dynamic analysis approach for Android malware de-

tection. We focused on extracting the frequency of

system calls as features to train our ML models. We

applied LIME to the prediction results of the ML

models to indicate the most contributing features to

each prediction. Finally, we evaluated the stability of

LIME explanations for true positives of our model.

For facilitating the reproducibility of our research, all

the codes and data are publicly accessible. Some of

the directions for future work are as follows. One

can use an enhanced user interaction simulator such

as ARES (Romdhana et al., 2022), rather than rely-

ing on the random interactions generated by Monkey.

Moreover, using other interpretable ML techniques

and comparing their explanation quality with LIME

is a promising area that needs further exploration.

REFERENCES

ADB (2023). https://https://developer.android.com/tools/

adb. Accessed: July 2023.

Allix, K., Bissyand

e, T. F., Klein, J., and Le Traon, Y.

(2016). Androzoo: Collecting millions of android

apps for the research community. In Proceedings of

the 13th International Conference on Mining Software

Repositories, MSR ’16, pages 468–471, New York,

NY, USA. ACM.

Android Studio (2023). https://developer.android.com/stud

io/\\run/emulator. Accessed: July 2023.

Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pier-

azzi, F., Wressnegger, C., Cavallaro, L., and Rieck, K.

(2022). Dos and don’ts of machine learning in com-

puter security. In 31st USENIX Security Symposium

(USENIX Security 22), pages 3971–3988.

Asma Razagallah, Raphael Khoury, J.-B. P. (2022). Twin-

droid: a dataset of android app system call traces and

trace generation pipeline. In Proceedings of the 19th

International Conference on Mining Software Reposi-

tories.

Belle, V. and Papantonis, I. (2021). Principles and prac-

tice of explainable machine learning. Frontiers in big

Data, page 39.

Breiman, L. (2001). Random forests. Machine learning,

45:5–32.

Cavallaro, L., Kinder, J., Pendlebury, F., and Pierazzi, F.

(2023). Are machine learning models for malware de-

tection ready for prime time? IEEE Security & Pri-

vacy, 21(2):53–56.

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y.,

Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T.,

et al. (2015). Xgboost: extreme gradient boosting. R

package version 0.4-2, 1(4):1–4.

Dash, S. K., Suarez-Tangil, G., Khan, S., Tam, K., Ahmadi,

M., Kinder, J., and Cavallaro, L. (2016). Droidscribe:

Classifying android malware based on runtime behav-

ior. In 2016 IEEE Security and Privacy Workshops

(SPW), pages 252–261. IEEE.

De Lorenzo, A., Martinelli, F., Medvet, E., Mercaldo, F.,

and Santone, A. (2020). Visualizing the outcome

of dynamic analysis of android malware with viz-

mal. Journal of Information Security and Applica-

tions, 50:102423.

Dimja

sevi

c, M., Atzeni, S., Ugrina, I., and Rakamaric,

Z. (2016). Evaluation of android malware detection

based on system calls. In Proceedings of the 2016

ACM on International Workshop on Security And Pri-

vacy Analytics, pages 1–8.

Dwivedi, R., Dave, D., Naik, H., Singhal, S., Omer, R., Pa-

tel, P., Qian, B., Wen, Z., Shah, T., Morgan, G., et al.

(2023). Explainable ai (xai): Core ideas, techniques,

and solutions. ACM Computing Surveys, 55(9):1–33.

Fan, M., Wei, W., Xie, X., Liu, Y., Guan, X., and Liu, T.

(2020). Can we trust your explanations? sanity checks

for interpreters in android malware analysis. IEEE

Transactions on Information Forensics and Security,

16:838–853.

Haibo, H. and Yunqian, M. (2013). Imbalanced learn-

ing: foundations, algorithms, and applications. Wiley-

IEEE Press, 1(27):12.

Hou, S., Saas, A., Chen, L., and Ye, Y. (2016).

Deep4maldroid: A deep learning framework for an-

Interpretable Android Malware Detection Based on Dynamic Analysis

201

droid malware detection based on linux kernel system

call graphs. In 2016 IEEE/WIC/ACM International

Conference on Web Intelligence Workshops (WIW),

pages 104–111. IEEE.

Iqbal, S. and Zulkernine, M. (2018). Spydroid: A frame-

work for employing multiple real-time malware de-

tectors on android. In 2018 13th International Con-

ference on Malicious and Unwanted Software (MAL-

WARE), pages 1–8. IEEE.

Kapersky (2023). Android mobile security threats. https://

www.kaspersky.com/resource-center/threats/mobile.

Accessed: July 2023.

Kovacs, ByEduard (2023). New samsung message guard

protects mobile devices against zero- click exploits.

https://www.securityweek.com/new-samsung-messa

ge-guard-protects-mobile-devices-against-zero-click

-exploits/. Accessed: July 2023.

Liu, Y., Tantithamthavorn, C., Li, L., and Liu, Y. (2022a).

Deep learning for android malware defenses: a sys-

tematic literature review. ACM Computing Surveys,

55(8):1–36.

Liu, Y., Tantithamthavorn, C., Li, L., and Liu, Y. (2022b).

Explainable ai for android malware detection: To-

wards understanding why the models perform so well?

In 2022 IEEE 33rd International Symposium on Soft-

ware Reliability Engineering (ISSRE), pages 169–

180. IEEE.

Mahdavifar, S., Kadir, A. F. A., Fatemi, R., Alhadidi,

D., and Ghorbani, A. A. (2020). Dynamic an-

droid malware category classiﬁcation using semi-

supervised deep learning. In 2020 IEEE Intl Conf

on Dependable, Autonomic and Secure Computing,

Intl Conf on Pervasive Intelligence and Computing,

Intl Conf on Cloud and Big Data Computing, Intl

Conf on Cyber Science and Technology Congress

(DASC/PiCom/CBDCom/CyberSciTech), pages 515–

522. IEEE.

Miranda, T. C., Gimenez, P.-F., Lalande, J.-F., Tong, V.

V. T., and Wilke, P. (2022). Debiasing android mal-

ware datasets: How can i trust your results if your

dataset is biased? IEEE Transactions on Information

Forensics and Security, 17:2182–2197.

monkey (2023). https://developer.android.com/studio/test/

other-testing-tools/monkey. Accessed: July 2023.

Nisi, D., Bianchi, A., and Fratantonio, Y. (2019). Exploring

{Syscall-Based} semantics reconstruction of android

applications. In 22nd International Symposium on

Research in Attacks, Intrusions and Defenses (RAID

2019), pages 517–531.

Pan, Y., Ge, X., Fang, C., and Fan, Y. (2020). A systematic

literature review of android malware detection using

static analysis. IEEE Access, 8:116363–116379.

Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., and

Cavallaro, L. (2019). {TESSERACT}: Eliminat-

ing experimental bias in malware classiﬁcation across

space and time. In 28th USENIX Security Symposium

(USENIX Security 19), pages 729–746.

Razgallah, A. and Khoury, R. (2021). Behavioral classiﬁ-

cation of android applications using system calls. In

2021 28th Asia-Paciﬁc Software Engineering Confer-

ence (APSEC), pages 43–52. IEEE.

Razgallah, A., Khoury, R., Hall

e, S., and Khanmohammadi,

K. (2021). A survey of malware detection in android

apps: Recommendations and perspectives for future

research. Computer Science Review, 39:100358.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why

should I trust you?”: Explaining the predictions of any

classiﬁer. In Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery

and Data Mining, San Francisco, CA, USA, August

13-17, 2016, pages 1135–1144.

Romdhana, A., Merlo, A., Ceccato, M., and Tonella,

P. (2022). Deep reinforcement learning for black-

box testing of android apps. ACM Transactions on

Software Engineering and Methodology (TOSEM),

31(4):1–29.

Saneeha Khalid, F. B. H. (2022). Evaluating dynamic anal-

ysis features for android malware categorization. In

Proceedings of the International Wireless Communi-

cations and Mobile Computin (IWCMC).

scikit-learn (2023). https://scikit-learn.org/stable/. Ac-

cessed: Oct 2023.

Statista (2023). Global mobile os market share 2023. https:

//www.statista.com/statistics/272698/global-market-s

hare-held-by-mobile-operating-systems-since-2009/.

Accessed: July 2023.

strace (2023). https://man7.org/linux/man-pages/man1/stra

ce.1.html. Accessed: July 2023.

syscalls(2)–Linux manual page (2023). https://man7.org

/linux/man-pages/man2/syscalls.2.html. Accessed:

Nov 2023.

Tam, K., Khan, S. J., Fattori, A., and Cavallaro, L. (2015).

Copperdroid: Automatic reconstruction of android

malware behaviors. In Ndss, pages 1–15.

VirusShare (2023). https://virusshare.com/. Accessed: Nov

2023.

VirusTotal (2023). Virustotal - home. https://www.virustot

al.com/gui/home/upload. Accessed: August 2023.

Visani, G., Bagli, E., Chesani, F., Poluzzi, A., and Ca-

puzzo, D. (2022). Statistical stability indices for lime:

Obtaining reliable explanations for machine learning

models. Journal of the Operational Research Society,

73(1):91–101.

Wu, B., Chen, S., Gao, C., Fan, L., Liu, Y., Wen, W., and

Lyu, M. R. (2021). Why an android app is classiﬁed

as malware: Toward malware classiﬁcation interpreta-

tion. ACM Transactions on Software Engineering and

Methodology (TOSEM), 30(2):1–29.

Yan, P. and Yan, Z. (2018). A survey on dynamic mo-

bile malware detection. Software Quality Journal,

26(3):891–919.

Zhang, X., Mathur, A., Zhao, L., Rahmat, S., Niyaz, Q.,

Javaid, A., and Yang, X. (2022). An early detection

of android malware using system calls based machine

learning model. In Proceedings of the 17th Interna-

tional Conference on Availability, Reliability and Se-

curity, pages 1–9.

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

202