Distance Metric Learning using Particle Swarm Optimization to Improve

Static Malware Detection

Martin Jure

cek and R

obert L

orencz

Faculty of Information Technology, Czech Technical University in Prague, Czech Republic

Keywords:

Distance Metric Learning, Malware Detection, Static Analysis, Heterogeneous Distance Function, Particle

Swarm Optimization, k-Nearest Neighbor.

Abstract:

Distance metric learning is concerned with ﬁnding appropriate parameters of distance function with respect

to a particular task. In this work, we present a malware detection system based on static analysis. We use k-

nearest neighbors (KNN) classiﬁer with weighted heterogeneous distance function that can handle nominal and

numeric features extracted from portable executable ﬁle format. Our proposed approach attempts to specify

the weights of the features using particle swarm optimization algorithm. The experimental results indicate that

KNN with the weighted distance function improves classiﬁcation accuracy signiﬁcantly.

1 INTRODUCTION

During the last years, the current trend is to use mal-

ware detection frameworks based on machine lear-

ning algorithms. Thanks to cloud-based computing

which makes the cost of big data computing more af-

fordable, the concept of employing machine learning

to malware detection has become more realistic to de-

ploy. The problem to be solved is to detect malware

which has never been seen before. While signature-

based detection systems (Kephart and Arnold, 1994)

identify known malicious programs, these systems

can be bypassed by unknown malware. However, the

signature-based methods are still popular because of

their low false positive rate. Instead of using static

signatures, an effective alternative solution is to use

machine learning methods to detect malware.

Malware detection techniques can be typically

classiﬁed into two categories depending on how code

is analyzed: static and dynamic analysis. Static ana-

lysis (Nath and Mehtre, 2014), (Alrabaee et al., 2016)

aims at searching information about structure of a

ﬁle. Disassembly technique is one of the techniques

of static analysis which is used for extracting various

features from the executables. Dynamic analysis (Or-

Meir et al., 2019), (Egele et al., 2012) aims to exa-

mine a program which is executed in a real or virtual

environment.

Our research is based on static analysis and feature

vectors used in the experiments contains data from the

portable executable (PE) ﬁle format. Several works

(Saad et al., 2019), (Damodaran et al., 2017) have

described various limitations of static analysis. The

most important drawback is that data captured from

static analysis does not describe the complete beha-

vior of a program since the program is not executed.

However, dynamic analysis is more time-consuming

in comparison to static analysis and there are anti-

virtual machine technologies that evade detection sys-

tems based on dynamic analysis. Consequently, dyna-

mic analysis could be impractical for a large volume

of samples that come to antivirus vendors every day.

For these reasons, static analysis has still its place in

malware detection systems.

Good similarity measure plays an important role

in the performance of geometric-based classiﬁers,

such as k-nearest neighbors (KNN). The similarity be-

tween two feature vectors is determined by the dis-

tance metric between them. The distance between

two feature vectors having the same class label must

be minimized while the distance between two feature

vectors of different classes must be maximized.

A distance metric learning algorithm aims at ﬁn-

ding the most appropriate parameters of the metric

with respect to some optimization criteria. This task is

typically formulated as an optimization problem and

in this work, it is related to the malware detection

problem. This work concerns with learning a distance

function used in the KNN classiﬁer for the malware

detection problem. Note that learning the distance

metric is an important preprocessing step which is of-

ten ignored in practice.

Jure

cek, M. and Lórencz, R.

Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection.

DOI: 10.5220/0009180807250732

In Proceedings of the 6th International Conference on Information Systems Security and Pr ivacy (ICISSP 2020), pages 725-732

ISBN: 978-989-758-399-5; ISSN: 2184-4356

725

The main contribution of this paper is in ﬁnding

an appropriate weights for the heterogeneous distance

function used in the KNN classiﬁer, and as a results,

improving classiﬁcation accuracy of malware detec-

tion system. Searching for the most suitable weights

with respect to classiﬁcation accuracy can be consid-

ered as an optimization problem. Evolutionary algo-

rithms, swarm algorithms and other heuristics (Luke,

2013) are suitable for our optimization problem. In

our experiment, a biologically motivated algorithm

called particle swarm optimization (PSO) was used to

solve this problem. Experimental results indicate that

the performance of KNN using the weighted distance

function is considerably better than the performance

of KNN without weights.

The rest of the paper is organized as follows. Sec-

tion 2 brieﬂy reviews some related work in the ﬁeld of

malware detection based on data mining techniques.

Some weighted distance functions and distance met-

ric learning techniques are also reviewed in this sec-

tion. Our proposed malware detection model and the-

oretical background are presented in Section 3. Ex-

perimental setup and results are presented in Section

4. Conclusion and future work are given in Section 5.

2 RELATED WORK

In this section, we brieﬂy review some works related

to malware detection based on machine learning tech-

niques. We also review several approaches of how to

ﬁnd the most suitable feature weights for a distance

function used in KNN or other techniques working

with distances.

2.1 Malware Detection

Over the past two decades, a large number of mal-

ware detection techniques has been proposed. To

evade malware classiﬁers, malware writers usually

employ obfuscation techniques such as encryption,

binary packers, or self-modifying code. In recent

years, many malware researchers have focused on

data mining and machine learning algorithms to de-

tect unknown malware (Gandotra et al., 2014), (Ye

et al., 2017).

(Schultz et al., 2000) were the ﬁrst who introduced

the concept of data mining techniques for detection

of malicious code. The authors used three different

features: information from the PE header, string fea-

tures, and byte sequences extracted from binaries.

They used three machine learning algorithms: Naive

Bayes, Multinomial Naive Bayes, rule induction al-

gorithm called Ripper (Cohen, 1996), and compared

them with the signature-based method. Their results

indicate that the data mining detection rate of previ-

ously unknown malware was twice as high in compa-

rison to the signature-based method.

(Shaﬁq et al., 2009) extracted structural informa-

tion from the PE ﬁle format and selected the most

important features with respect to distinguishing be-

tween benign ﬁles and malware. The authors have

used three feature selection algorithms: Redundant

Feature Removal (RFR), Principal Component Ana-

lysis (PCA), and Haar Wavelet Transform (HWT)

(Witten et al., 2016), and applied ﬁve machine lear-

ning classiﬁers: instance based learner (IBk), decision

tree (J48), naive bayes (NB), inductive rule learner

(RIPPER), and support vector machine (SVM) using

sequential minimal optimization. The authors con-

cluded that J48 outperforms the rest of the classiﬁers

in terms of the detection accuracy.

More recently, (Zhong and Gu, 2019) improved

performance of deep learning models by organi-

zing them to the tree structure called Multiple-Level

Deep Learning System (MLDLS). Each deep learning

model focuses on speciﬁc malware family. As a re-

sult, the MLDLS can handle complex malware data

distribution. Experimental results indicate that pro-

posed method outperforms the SVM, decision tree,

the single deep learning method and ensemble based

approach.

2.2 Weighted Distance Functions for

KNN Classiﬁer

The k-nearest neighbors classiﬁer (Cover and Hart,

1967) is one of the simplest and best-known nonpara-

metric algorithms in machine learning. Several ap-

proaches have been proposed to increase the perfor-

mance of KNN. The work (Ghosh, 2006) presents the

technique for estimation of the optimal parameter k.

Many studies include research on the similarity mea-

sures. The work (Yu et al., 2008) studies distance

measure based on statistical analysis and presents

boosting heterogeneous measure for similarity esti-

mation. Work (Hsu and Chen, 2008) has derived con-

ditions for stability of the distance function in high-

dimensional space.

Several distance functions have been presented

(Wilson and Martinez, 1997). To improve results,

many weighting schemes were proposed. Review

of feature weighting methods for lazy learning algo-

rithms was proposed in (Wettschereck et al., 1997).

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

726

2.3 Distance Metric Learning

Distance metric learning is an active research area

(Yang and Jin, 2006), (Kulis et al., 2013). Dis-

tance metric learning is deﬁned as follows. Let

T = {(x

, c

), . . . , (x

, c

)} be the training set of m

feature vectors x

in d-dimensional metric space S ,

and c

be class labels. The goal is to learn a line-

ar transformation L : S → S , where squared distance

between two feature vectors x

and x

is deﬁned as

d(x

, x

) = kL(x

− x

. Note that d is a valid metric

if and only if the matrix L is full rank. We reformulate

the deﬁnition of the squared distance as D(x

, x

) =

− x

)

M(x

− x

), where M = L

L. Matrix M is

guaranteed to be positive semideﬁnite and the dis-

tance D is called Mahalanobis metric. Note that when

M is equal to the identity matrix, then the distance D

is reduced to Euclidean distance metric. The goal is to

ﬁnd a matrix M which is estimated from the data, that

leads to the highest classiﬁcation accuracy of KNN

classiﬁer.

Large Margin Nearest Neighbor (LMNN) (Wein-

berger et al., 2006) classiﬁcation is used to learn a

Mahanalobis distance metric for KNN classiﬁcation.

LMNN consists of two steps. In the ﬁrst step, set of k

similarly labeled neighbors is identiﬁed for each fea-

ture vector. In the second step, the Mahalanobis dis-

tance metric is learned using convex optimization.

Similar to our approach, (Xu et al., 2017) searched

for suitable weight vector using PSO. However, there

are several differencies: we used a heterogeneous dis-

tance function that can handle both nominal and nu-

meric features, we used different classiﬁer for evalu-

ation, we applied different modiﬁcation of the PSO

algorithm, and our goal is to improve malware detec-

tion for a different operating system. (Kong and Yan,

2013) proposed a malware detection method based on

structural information. Discriminant distance metric

is learned to cluster the malware samples belonging

to same malware family.

3 THE PROPOSED MALWARE

DETECTION MODEL

In this section, we present our proposed malware de-

tection system and describe all its components. Ar-

chitecture of the detection system is illustrated in

Fig. 1.

The detection system consists of the metric lear-

ning phase and the classiﬁcation phase. First, rele-

vant features are extracted from the binaries. Then

we split the dataset into two disjoint subset: T

pso

for

the metric learning phase, and T

eval

for the classiﬁ-

cation phase. In the metric learning phase, feature

weights are learned from the data. The feature selec-

tion method described in Section 3.3 is performed, as

a results, dimension of the feature vectors is reduced.

Then the data is split into training (80%) and testing

(20%) subsets and they are used for computation of

the ﬁtness function used in the PSO algorithm.

In the classiﬁcation phase, we evaluate the best

weight vector from the metric learning phase using

the KNN classiﬁer. First, we also reduce the dimen-

sion of the feature vectors with respect to the feature

selection results from the metric learning phase. Then

we apply KNN with ﬁvefold cross validation (Picard

and Cook, 1984) to obtain reliable experimental re-

sults. T

eval

is randomly divided into ﬁve subsets of

equal size, where four subsets are used for training

and one subset for testing. The experiment is repeated

ﬁve times on different subsets of data. The accuracies

obtained for each fold are averaged to produce a sin-

gle cross validation estimate.

We do not use cross-validation in the metric lear-

ning phase since evaluation of the ﬁtness function

is very time consuming. Note that if we used the

same training dataset in the metric learning phase and

also in the classiﬁcation phase, we could possibly ob-

tain better classiﬁcation results than when the training

datasets in both phases were disjoint. The aim of this

architecture is to show robustness of the resulted fea-

ture weights.

3.1 Heterogeneous Distance Metric

In this section, we describe weighted heterogeneous

distance function that is used in our experiments. The

distance without weights was proposed in (Jure

cek

and L

orencz, 2018). Let x and y be two feature vec-

tors of dimension m, and let w

be weight correspon-

ding to the attribute (feature) a. The weighted distance

is deﬁned as follows:

D(x, y) =

∑

a=1

, y

) (1)

where

(x, y) =











H (x, y) if a is a bit array

δ(x, y) if a is a checksum

Norm diff

(x, y) if a is a numeric

Norm vdm

(x, y) otherwise.

(2)

H (x, y) denotes Hamming distance deﬁned for binary

vectors x = (x

, . . . , x

), y = (y

, . . . , y

) as

H (x, y) = |{i|x

6= y

, i = 1, . . . , n}| (3)

Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection

727

Parsing

binaries

Split

dataset

Evaluation

KNN with

5-fold

cross-

validation

reduction

Classiﬁcation Phase

Feature

selection

Split

dataset

PSO

Metric Learning Phase

Dimensio-

nality

Figure 1: Architecture of our proposed malware detection system.

and δ(x, y) is the characteristic function deﬁned as

δ(x, y) =



0 if x = y

1 otherwise.

(4)

The Value Difference Metric (VDM) was introduced

by (Stanﬁll and Waltz, 1986) and the normalized

VDM is deﬁned as

Norm

vdm

(x, y) =

∑

c=1



a,x,c

a,x

−

a,y,c

a,y



(5)

where

• C is the number of classes,

• n

a,x,c

is the number of instances in the training set

T which have value x for attribute a and the in-

stance belongs to class c,

• n

a,x

is the number of instances in T that have

value x for attribute a.

Function Norm diff

(x, y) is deﬁned as:

Norm diff

(x, y) =

|x − y|

4σ

(6)

where σ

is the standard deviation of the values of

numeric attribute a.

The distance D is a modiﬁcation of Heteroge-

neous Value Difference Metric (HVDM) (Wilson and

Martinez, 1997), and it can be used for our PE fea-

ture space since it handles both numeric and nominal

attributes.

3.2 Particle Swarm Optimization

Particle swarm optimization (PSO) is a stochastic

optimization algorithm proposed by (Eberhart and

Kennedy, 1995). Many variants and modiﬁcations of

PSO are described in (Wang et al., 2018).

PSO is a biologically motivated algorithm based

on swarm intelligence. Each particle is represented

as a point in the search space and the quality of each

point is determined by a ﬁtness function. Each par-

ticle updates its position which is inﬂuenced by: the

current velocity, previous best particle’s position and

position of the most successful particle in the swarm.

Concept and notation of the PSO elements with re-

spect to our distance metric learning problem applied

on malware detection, is as follows:

• Particle represents vector of weights w. The cur-

rent position of i-th particle is denoted by x

and

denotes its current velocity.

• Swarm or population is an array of all particles

considered in the PSO algorithm.

• Local best position p

of i-th particle is its best po-

sition among all positions visited so far, and pbest

is the corresponding value of the ﬁtness function

f , i.e. pbest

= f (p

• Global best position p

is the position of the most

successful particle in the swarm, and gbest

f (p

• Fitness function f is an objective function that is

used to measure the quality of a particle. In our

malware detection problem, the ﬁtness function is

deﬁned as the accuracy of the KNN classiﬁer.

The pseudocode of the PSO algorithm is presented

in Algorithm 1.

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

728

Algorithm 1: PSO algorithm.

Input: ﬁtness function f , T

pso

Output: vector of weights

1: initialize particles with random positions x

and

velocities v

2: repeat

3: for each particle x

4: compute ﬁtness function f (x

)

5: if f (x

) > pbest

then

6: pbest

= f (x

)

7: p

= x

8: end if

9: end for

10: select the most successful particle in swarm so

far, and denote it by p

11: for each particle x

12: v

= v

+ Rand(0, φ

) ⊗ (p

− x

) +

Rand(0, φ

) ⊗ (p

− x

)

13: x

= x

+ v

14: end for

15: until maximum number of iterations or sufﬁ-

ciently good ﬁtness is attained

16: return global best position

Rand(0, ε) represents a vector of random numbers

uniformly distributed in [0, ε]. Operation ⊗ denotes

component-wise multiplication. Note the each parti-

cle is able to memorize its best previous position and

also it knows the best position of the whole swarm

so far. Each component of velocity v is kept in the

range [−V

max

], where parameter V

max

inﬂuences

search ability of the particles.

To better control the scope of the search and re-

duce the importance of V

max

, (Shi and Eberhart, 1998)

proposed the following modiﬁcation of particle’s ve-

locity equation (step 12 of Algorithm 1):

= ωv

+ Rand(0, φ

) ⊗ (p

− x

+Rand(0, φ

) ⊗ (p

− x

), (7)

where ω is an inertia weight. Higher values of

ω tend to global search while lower values tend to

local search. Parameters φ

, φ

and ω represents the

weights and they are used to balance the global and

the local search.

The PSO was chosen among other optimization

heuristics because its convergence rate is fast and the

algorithm is easy to implement and execute in paral-

lel. The drawback of the algorithm is that it is vulne-

rable to stuck into the local minima.

3.3 Feature Selection

In the proposed approach, the feature vector consists

of information from PE ﬁle format (Microsoft, 1999)

which is the most widely used ﬁle format for malware.

Gain ratio (GR) (Quinlan, 1986) is used to determine

the most useful features with respect to discriminating

between malware and benign ﬁles.

Gain ratio is a modiﬁcation of entropy-based mea-

sure called information gain (IG) (Mitchell, 1997).

Information gain IG(T , a) is the expected reduction

in entropy caused by knowing the value of an attribute

a relative to training dataset T , and it is deﬁned as

IG(T , a) = Entropy(T ) −

∑

v∈V (a)

|T |

Entropy(T

(8)

where V (a) denotes the set of all possible values for

attribute a, and T

denotes the subset of T for which

attribute a has value v.

Gain ratio penalizes attributes with large numbers

of possible values by incorporating a term called split

information (SI):

SI(T , a) = −

∑

i=1

|T |

log

|T |

, (9)

where T

are the d subsets of training dataset T re-

sulting from partitioning T by the d-valued attribute

a. Split information SI(T , a) is the entropy of T with

respect to the values of attribute a. The gain ratio is

then deﬁned as

GR(T , a) =

IG(T , a)

SI(T , a)

. (10)

The more the gain ratio, the more relevant a fea-

ture will be.

The following feature set with the highest gain ratio

was extracted and used in our experiment:

• Fields from the PE headers: number of sec-

tions, date/time stamp, major or minor versions of

linker, operating system, image, subsystem; sizes

and addresses of data directories; DLL character-

istics, and many others.

• Features from sections and their headers: Virtu-

alSize, VirtualAddress, SizeOfRawData, Pointer-

ToRawData, Section Flags.

• Resources: number of resources and the number

of types of resources.

• Overlay: size of the overlay.

• Other features: entropies and checksums of sec-

tions, the size of all imports, the number of DLLs

referred, the number of APIs referred.

Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection

729

Detailed description of these features can be found

in the documentation (Microsoft, 1999).

3.4 Performance Metrics

In this section, we present the performance metric

we used to measure the accuracy of our proposed ap-

proach for the detection of unknown malicious codes.

For evaluation purposes, the following classical quan-

tities are employed:

- True Positive (TP) represents the number of mali-

cious samples classiﬁed as malware

- True Negative (TN) represents the number of be-

nign samples classiﬁed as benign

- False Positive (FP) represents the number of be-

nign samples classiﬁed as malware

- False Negative (FN) represents the number of ma-

licious samples classiﬁed as benign

The performance of our classiﬁer on the test set is

measured using three standard parameters. The most

intuitive and commonly used evaluation measure in

Machine Learning is the accuracy (ACC):

ACC =

TP + TN

TP + TN + FP + FN

(11)

It is deﬁned on a given test set as the percentage

of correctly classiﬁed instances. The second param-

eter, True Positive Rate (TPR) (or detection rate), is

deﬁned as:

TPR =

TP + FN

(12)

TPR is the percentage of truly malicious samples

that were classiﬁed as malware. The third parameter

is False Positive Rate (FPR), and it is deﬁned as fol-

lows:

FPR =

TN + FP

(13)

FPR is the percentage of benign samples that were

wrongly classiﬁed as malware.

4 EXPERIMENTAL SETUP AND

RESULTS

In this section, we describe experimental setup and

present the results of our experiments.

4.1 Experimental Setup – Dataset and

Implementation

In this research, we use dataset consisting of 150,145

Windows programs in the PE ﬁle format, out of which

74,978 are malware, and 75,167 are benign programs.

The malicious and benign programs were obtained

from the laboratory of the industrial partner and also

from (VirusShare, 2019).

There are many variants and modiﬁcations of the

PSO algorithm. Initialization of the population con-

cerns with random generation of particles and their

velocities, however there are more advadced meth-

ods, such as nonlinear simplex method or centroidal

Voronoi tessellations and many others (Wang et al.,

2018). In our implementation of modiﬁed PSO, re-

sults from the feature selection algorithm (described

in Section 3.3) are used for initialization of the par-

ticles, instead of random initialization. Values of the

gain ratio can be considered as particle p and each

particle is initialized as p ⊗ Rand(0, ε), where ε is a

small constant. The purpose of this initialization is in

the acceleration of PSO, i.e. reducing the searching

space is done using results of the feature selection al-

gorithm.

Another modiﬁcation in our implementation is

feature scaling of the weight vector. Since each com-

ponent w

of the weight vector has to be non-negative,

in computing the ﬁtness function, we use a normali-

zed weight vector where each component is rescaling

using min-max normalization:

norm

x − min

max − min

, (14)

where x is an original value and min, resp. max, is

minimal, resp. maximal value of the original vector.

There are several control techniques (Robinson

and Rahmat-Samii, 2004) that are able to avoid par-

ticles running out of the search space. In our imple-

mentation, positions of particles are not constrained.

We use a static topological structure where each par-

ticle is fully informed, i.e. it uses information of the

entire neighborhood.

Our implementation was executed on a single

computer platform having two processors (Intel Xeon

Gold 6136, 3.0GHz, 12 cores each), with 32 GB of

RAM running the Ubuntu server 18.04 LTS operating

system.

4.2 Experimental Results

To ensure a ﬁne tuning of the hyperparameters of our

malware detection model, grid search (Bergstra and

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

730

Bengio, 2012) was used to explore the following PSO

parameters:

• φ

, φ

∈ {0.5, 1., 1.5, 2.},

• V

max

∈ {0.5, 1., 2., 4.}.

The rest of the PSO parameters are considered as

constants: population size is 40, and number of iter-

ations is 30. At the ﬁrst iteration, inertia weight ω is

set to one, and it linearly decreases at each iteration to

the value ω

min

= 0.8. All these parameters were cho-

sen following the guidelines from (Wang et al., 2018)

and (Poli et al., 2007). However, the choice of pa-

rameters of the PSO is problem-speciﬁc, and to deter-

mine the parameters appropriately and effectively is

an open problem (Wang et al., 2018).

We evaluated the performance of KNN (k = 5)

with respect to the following initialization techniques

in PSO:

• PSO-RAND denotes PSO where position and ve-

locity of each particle are initialized randomly

• PSO-GR denotes PSO where velocity of each par-

ticle p is initialized randomly, however, its posi-

tion is initialized as p ⊗ Rand(0, ε), where ε is a

small constant and p is the vector of gain ratio

values.

For our experiments, we used the heterogeneous

distance function described in Section 3.1. The clas-

siﬁcation results of the KNN with ﬁvefold cross vali-

dation are listed in Table 1.

Table 1: Classiﬁcation results of the KNN classiﬁer with

and without feature weights.

KNN TPR FPR ACC

without weights 96.51% 4.03% 96.24%

PSO-RAND 96.69% 3.46% 96.59%

PSO-GR 96.67% 3.30% 96.72%

Using weighted distance, we reduced the average

KNN classiﬁcation error rate from 3.76% to 3.28%,

i.e. the error rate has been decreased by 12.77%.

5 CONCLUSIONS

We applied the PSO algorithm to the problem of ﬁn-

ding the most appropriate feature weights used in the

heterogeneous distance function deﬁned for the fea-

tures extracted from PE ﬁle format. Our results in-

dicate that classiﬁcation performance of KNN can

be improved by using weighted distance function.

By comparing with the experiment of KNN without

weights, classiﬁcation error rate of KNN with weights

has been decreased by 12.77%. As a results, the accu-

racy of malware detection system using a geometric-

based classiﬁer such as KNN, can be increased signi-

ﬁcantly by using appropriate weights of the features.

For future work, it would be interesting to experi-

ment with several distance metric learning algorithms

and incorporate them in our proposed malware detec-

tion system. Important goal for future work is to learn

weights that vary between malware families.

ACKNOWLEDGEMENTS

The authors acknowledge the support of the OP VVV

MEYS funded project CZ.02.1.01/0.0/0.0/16 019/

0000765 ”Research Center for Informatics”.

REFERENCES

Alrabaee, S., Shirani, P., Debbabi, M., and Wang, L.

(2016). On the feasibility of malware authorship at-

tribution. In International Symposium on Foundations

and Practice of Security, pages 256–272. Springer.

Bergstra, J. and Bengio, Y. (2012). Random search for

hyper-parameter optimization. Journal of Machine

Learning Research, 13(Feb):281–305.

Cohen, W. W. (1996). Learning trees and rules with set-

valued features. In AAAI/IAAI, Vol. 1, pages 709–716.

Cover, T. and Hart, P. (1967). Nearest neighbor pattern clas-

siﬁcation. IEEE transactions on information theory,

13(1):21–27.

Damodaran, A., Di Troia, F., Visaggio, C. A., Austin,

T. H., and Stamp, M. (2017). A comparison of

static, dynamic, and hybrid analysis for malware de-

tection. Journal of Computer Virology and Hacking

Techniques, 13(1):1–12.

Eberhart, R. and Kennedy, J. (1995). Particle swarm

optimization. In Proceedings of the IEEE inter-

national conference on neural networks, volume 4,

pages 1942–1948. Citeseer.

Egele, M., Scholte, T., Kirda, E., and Kruegel, C. (2012). A

survey on automated dynamic malware-analysis tech-

niques and tools. ACM computing surveys (CSUR),

44(2):6.

Gandotra, E., Bansal, D., and Sofat, S. (2014). Malware

analysis and classiﬁcation: A survey. Journal of In-

formation Security, 5(02):56.

Ghosh, A. K. (2006). On optimum choice of k in near-

est neighbor classiﬁcation. Computational Statistics

& Data Analysis, 50(11):3113–3123.

Hsu, C.-M. and Chen, M.-S. (2008). On the design and ap-

plicability of distance functions in high-dimensional

data space. IEEE Transactions on Knowledge and

Data Engineering, 21(4):523–536.

Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection

731

Jure

cek, M. and L

orencz, R. (2018). Malware detection

using a heterogeneous distance function. Computing

and Informatics, 37(3):759–780.

Kephart, J. O. and Arnold, W. C. (1994). Automatic extrac-

tion of computer virus signatures. In 4th virus bulletin

international conference, pages 178–184.

Kong, D. and Yan, G. (2013). Discriminant malware dis-

tance learning on structural information for automated

malware classiﬁcation. In Proceedings of the 19th

ACM SIGKDD international conference on Knowl-

edge discovery and data mining, pages 1357–1365.

ACM.

Kulis, B. et al. (2013). Metric learning: A survey. Founda-

tions and Trends

 in Machine Learning, 5(4):287–

364.

Luke, S. (2013). Essentials of Metaheuristics. Lulu, second

edition.

Microsoft (1999). Microsoft portable executable and com-

mon object ﬁle format speciﬁcation.

Mitchell, T. M. (1997). Machine learning. New York.

Nath, H. V. and Mehtre, B. M. (2014). Static malware

analysis using machine learning methods. In Interna-

tional Conference on Security in Computer Networks

and Distributed Systems, pages 440–450. Springer.

Or-Meir, O., Nissim, N., Elovici, Y., and Rokach, L. (2019).

Dynamic malware analysis in the modern era—a state

of the art survey. ACM Computing Surveys (CSUR),

52(5):88.

Picard, R. R. and Cook, R. D. (1984). Cross-validation of

regression models. Journal of the American Statistical

Association, 79(387):575–583.

Poli, R., Kennedy, J., and Blackwell, T. (2007). Particle

swarm optimization. Swarm intelligence, 1(1):33–57.

Quinlan, J. R. (1986). Induction of decision trees. Machine

learning, 1(1):81–106.

Robinson, J. and Rahmat-Samii, Y. (2004). Particle swarm

optimization in electromagnetics. IEEE transactions

on antennas and propagation, 52(2):397–407.

Saad, S., Briguglio, W., and Elmiligi, H. (2019). The cu-

rious case of machine learning in malware detection.

In Proceedings of the 5th International Conference on

Information Systems Security and Privacy - Volume 1:

ICISSP, pages 528–535. INSTICC, SciTePress.

Schultz, M. G., Eskin, E., Zadok, F., and Stolfo, S. J. (2000).

Data mining methods for detection of new malicious

executables. In Proceedings 2001 IEEE Symposium

on Security and Privacy. S&P 2001, pages 38–49.

IEEE.

Shaﬁq, M. Z., Tabish, S. M., Mirza, F., and Farooq, M.

(2009). Pe-miner: Mining structural information to

detect malicious executables in realtime. In Interna-

tional Workshop on Recent Advances in Intrusion De-

tection, pages 121–141. Springer.

Shi, Y. and Eberhart, R. (1998). A modiﬁed particle

swarm optimizer. In 1998 IEEE international confer-

ence on evolutionary computation proceedings. IEEE

world congress on computational intelligence (Cat.

No. 98TH8360), pages 69–73. IEEE.

Stanﬁll, C. and Waltz, D. L. (1986). Toward memory-based

reasoning. Commun. ACM, 29(12):1213–1228.

VirusShare (2019). Virusshare.com.

Wang, D., Tan, D., and Liu, L. (2018). Particle swarm op-

timization algorithm: an overview. Soft Computing,

22(2):387–408.

Weinberger, K. Q., Blitzer, J., and Saul, L. K. (2006). Dis-

tance metric learning for large margin nearest neigh-

bor classiﬁcation. In Advances in neural information

processing systems, pages 1473–1480.

Wettschereck, D., Aha, D. W., and Mohri, T. (1997). A

review and empirical evaluation of feature weighting

methods for a class of lazy learning algorithms. Arti-

ﬁcial Intelligence Review, 11(1-5):273–314.

Wilson, D. R. and Martinez, T. R. (1997). Improved het-

erogeneous distance functions. Journal of artiﬁcial

intelligence research, 6:1–34.

Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. (2016).

Data Mining: Practical machine learning tools and

techniques. Morgan Kaufmann.

Xu, Y., Wu, C., Zheng, K., Wang, X., Niu, X., and Lu, T.

(2017). Computing adaptive feature weights with pso

to improve android malware detection. Security and

Communication Networks, 2017.

Yang, L. and Jin, R. (2006). Distance metric learning:

A comprehensive survey. Michigan State Universiy,

2(2):4.

Ye, Y., Li, T., Adjeroh, D., and Iyengar, S. S. (2017). A

survey on malware detection using data mining tech-

niques. ACM Computing Surveys (CSUR), 50(3):41.

Yu, J., Amores, J., Sebe, N., Radeva, P., and Tian, Q.

(2008). Distance learning for similarity estimation.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 30(3):451–462.

Zhong, W. and Gu, F. (2019). A multi-level deep learning

system for malware detection. Expert Systems with

Applications, 133:151–162.

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

732