A Method for Packed (and Unpacked) Malware Detection

by Means of Convolutional Neural Networks

Giovanni Ciaramella

1,2 a

Fabio Martinelli

3 b

Antonella Santone

4 c

and Francesco Mercaldo

4,2 d

IMT School for Advanced Studies Lucca, Lucca, Italy

Institute for Informatics and Telematics, National Research Council of Italy (CNR), Pisa, Italy

Institute for High Performance Computing and Networking, National Research Council of Italy (CNR), Rende, Italy

University of Molise, Campobasso, Italy

Keywords:

Malware, Packed Malware, Obfuscation, Deep Learning, Security, Testing.

Abstract:

The current signature-based mechanism implemented by free and commercial antimalware requires the pres-

ence of the signature of the malicious sample to provide protection, i.e., to detect malicious behavior. This is

why malware writers are developing techniques that can change the syntax of the code but leave the semantics

unchanged, i.e., the malware business logic. Among these techniques is the so-called packed malware, i.e.,

malware with binary code modiﬁed by packers, software aimed to pack software, compress it, and package

it with a stub. It is a program capable of decompressing and executing it in memory. In this way, malware

detected by antimalware is not even detected in the packed version. In this paper, we propose a technique to

detect packed malware by exploiting convolutional neural networks. In a nutshell, the proposed method per-

forms static analysis, i.e., it does not require running the application to detect the malicious samples: we start

from the application’s binary code exploited to generate an image that represents the input for a set of deep

learning classiﬁers. The classiﬁers aim to discern an application under analysis between trusted or (packed)

malicious. In the experimental analysis, we consider three different packers (i.e., mpress, BEP, and gzexe)

to generate packed malware, thus demonstrating the ability of the proposed method to detect packed and un-

packed malware with interesting performances.

1 INTRODUCTION

The main aim of malware, contraction for malicious

and software, is to perpetrate damages on the victim

devices (i.e., computers but also smartphones) with

particular regard to the information gathering, i.e.,

the ability to retrieve sensitive information and send

this information to the attacker. Malware writers are

constantly focused on developing new techniques to

evade the signature-based detection provided by cur-

rent free and commercial antimalware: one of the last

trends is represented by the so-called packed mal-

ware. The term packed malware refers to malicious

software that has been compressed or obfuscated us-

https://orcid.org/0009-0002-9512-0621

https://orcid.org/0000-0002-6721-9395

https://orcid.org/0000-0002-2634-4456

https://orcid.org/0000-0002-9425-1657

ing a technique known as ”packing.” This process in-

volves compressing the code of the malware into a

smaller, often encrypted format, making it harder for

antimalware programs to detect or analyze. Packing

can also involve embedding the malware inside other

legitimate software or ﬁles, disguising its true nature.

In the following, we provide several details related to

the packer’s working mechanisms:

• Compression and Encryption: The malware’s

original code is compressed or encrypted. When

the packed malware is executed, it decompresses

or decrypts itself in memory to carry out its mali-

cious actions.

• Bypassing Detection: Antivirus software often

uses signature-based detection, looking for pat-

terns in code to identify malware. Packing

changes the malware’s appearance, making it look

different from known threats.

Ciaramella, G., Martinelli, F., Santone, A., Mercaldo and F.

A Method for Packed (and Unpacked) Malware Detection by Means of Convolutional Neural Networks.

DOI: 10.5220/0013210400003979

In Proceedings of the 22nd International Conference on Security and Cryptography (SECRYPT 2025), pages 557-564

ISBN: 978-989-758-760-3; ISSN: 2184-7711

557

• Multiple Layers of Obfuscation: Malware authors

sometimes use multiple layers of packing, mak-

ing it even more challenging to analyze. Security

tools must ﬁrst unpack or decrypt each layer be-

fore inspecting the malware’s actual code.

Packed malware complicates both static and dynamic

analysis static analysis (examining code without run-

ning it) is more complex because the code is not in

its original form, while dynamic analysis (running the

malware in a sandbox environment to observe its be-

havior) can also be thwarted, as some packed malware

checks for signs of being in a virtual environment and

behaves differently to avoid detection (Qiang et al.,

2022).

Starting from these considerations, we propose a

method to detect packed and unpacked malware in

this paper. The proposed method relies on a Con-

volutional Neural Network (CNN) (He et al., 2024).

We propose a representation of an application as an

image, and we consider training a set of CNN to

understand the best performance for classifying (un-

packed) malware and trusted samples. In the evalu-

ation, we consider a set of packed malware to com-

prehend whether the trained model can detect packed

malware (not exploited in the training phase).

Thus, the paper proceeds as follows: in the sec-

tion we review the current state of the art related to

(packed) malware detection, in Section 3 we present

the method we designed for the detection of packed

and unpacked malware; the experimental analysis is

discussed in Section 4 and, ﬁnally, in the last section

conclusion and future research lines are drawn.

2 RELATED WORK

Over the years, due to the increase in cybersecurity,

malicious users started introducing several method-

ologies to curb malware detectors. This Section pro-

vides a literature review of several methods that iden-

tify packed and unpacked malware.

Authors in (Devi and Nandi, 2012) proposed a

method to identify packed and unpacked malware

in the Windows environment. In detail, they cre-

ated two datasets of 4,075 executable applications,

where 2,954 were malicious programs and 1,121 were

benign executables. The authors applied the UPX

packer to the ﬁrst dataset, while the second dataset

used unpacked malware. Moreover, using Weka, they

applied classiﬁcation algorithms for both datasets.

Differently from them, we proposed a methodology

to identify packed and unpacked malware leverag-

ing Deep Learning by applying several Convolutional

Neural Networks belonging to the state of the art. To

do that, we converted all datasets created (one com-

posed of unpacked executable applications and three

using different packing methods) into images, reach-

ing accuracy values in most of the cases higher than

0.980.

Biondi et al. in (Biondi et al., 2019) proposed a

method to identify packed malware leveraging three

different classiﬁcation Machine learning algorithms

such as Native Bayes, Decision Tree, and Random

Forest Extra Trees. In detail, researchers employed

a dataset of 280,000 samples on which they applied

two packing techniques (UPX and TheMida), extract-

ing many features. Moreover, they also perform clas-

siﬁcation using unpacked samples. In our proposed

method, we employed a methodology based on Deep

Learning. We applied three packing algorithms on

each malware sample, obtaining four datasets (one

composed of unpacked malware). In the following

step, all Windows executable applications were con-

verted by a script into images and submitted to ten

different Convolutional Neural Networks belonging

to the literature. Authors in (Rabadi and Teo, 2020)

proposed a method to detect malware Windows ﬁeld

using Machine Learning. In detail, they composed

a dataset of benign and malicious executable appli-

cations and extracted some features to train and test

models using two different methods. In the ﬁrst

method, each API call and the list of its arguments are

presented as one feature. In the second method, each

API call and each element of its arguments are con-

sidered as one feature. Consequently, they achieved

remarkable results in terms of accuracy. In our pro-

posed approach, we employed Deep Learning to train

and test models. Moreover, instead of using API calls,

we converted the entire application into an image to

create our dataset.

In (Ciaramella et al., 2024), authors proposed a

method to classify ransomware, general malware, and

trusted applications belonging to the Windows do-

main. In detail, they employed several Deep Learn-

ing architectures, obtaining the best result in terms of

accuracy using the VGG16 network. Moreover, on

the best model, they also applied the Grad-CAM al-

gorithm to identify which area of images turns out to

be crucial for a certain prediction. Unlike them, we

employed a dataset of general malware and trusted

PE ﬁles, for which we applied three different pack-

ing methods. Moreover, we also trained and tested

models using unpacked malware. Using all created

datasets, we reached good results in terms of accu-

racy.

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

558

3 THE METHOD

In this section we present the proposed method for the

detection of packed and unpacked malware. Figures 1

and 2 respectively present the training and the testing

step related to the proposed method.

Figure 1 shows the workﬂow of the proposed mal-

ware detection method consists of the following steps:

• Malware Applications. This step involves col-

lecting applications known to be malware. These

samples serve as malicious samples during the

training process.

• Trusted Applications. This step involves gather-

ing safe or trusted applications. These samples act

as legitimate samples in training.

• Application Label. Each application is labeled as

either ”Malware” or ”Trusted” based on its nature,

which will be used as the target variable in the

training process.

• Application Images. The collected applications

are converted into images. To convert a binary

to an image, we treat the sequence of bytes rep-

resenting the binary as the bytes of a gray-scale

PNG image. Depending on the binary size, we

consider a predeﬁned width of 256 and a variable

length. We developed a script to encode any bi-

nary ﬁle into a lossless PNG (Mercaldo and San-

tone, 2020).

• Deep Learning Network. A deep learning

network is then trained on these labeled im-

ages to learn features that distinguish malware

from trusted applications. We consider several

CNNs in this task: ALEX NET, LE NET, STAN-

DARD CNN, MobileNet, DenseNet, EfﬁcientNet,

ResNet50, VGG16, VGG19 and Inception.

• Deep Learning Model. After training, the deep

learning model can predict whether a new appli-

cation is malware or trusted based on the learned

features. This model can be deployed for real-

time malware detection.

Figure 2 shows the testing step of the proposed

method.

As shown from Figure 2, the testing step of pro-

posed malware detection method includes the follow-

ing steps:

• Application Packer: A tool used to create packed

versions of applications, often by compressing or

encrypting their code. This is used to generate

packed applications, which can obfuscate the be-

havior of the software and make detection more

challenging.

• Packed Application: Applications processed

through the packer. Due to their obfuscation tech-

niques, these packed applications are often harder

to analyze and detect as malware.

• Unseen Malware: This step involves testing the

detection method with new malware samples not

part of the training set, representing real-world

malware scenarios.

• Unseen Trusted Application: Similarly, new

trusted applications are used during testing to en-

sure the model’s ability to distinguish between

malware and safe software.

• Application Images: The unseen malware,

trusted applications, and packed applications are

transformed into images, which serve as inputs for

the deep learning model.

• Deep Learning Model: The trained deep learn-

ing model from the training phase analyzes the ap-

plication images and predicts their nature as either

malware or trusted.

• Malware/Trusted Prediction: The model pre-

dicts each application, determining whether it is

malware or trusted. This output helps evaluate the

model’s performance on unseen and packed sam-

ples.

4 EXPERIMENTAL ANALYSIS

To collect real-world samples to evaluate the pro-

posed method, we consider the following reposito-

ries: the ﬁrst one is the Dike dataset

, a freely avail-

able collection of trusted and malicious Portable Ex-

ecutable (PE) and Object Linking and Embedding

(OLE) ﬁles. The malware belonging to this dataset

represents the unpacked malware and tries a wide

spectrum of malware categories, i.e., generic trojan,

ransomware, worm, backdoor, spyware, rootkit, en-

crypter, and downloader.

Moreover, to evaluate the robustness of the pro-

posed method concerning packed malware, we have

to generate packed variants, so we consider three dif-

ferent packers i.e., mpress, BEP and gzexe. With the

aim to cover the spectrum of different packers, we ex-

periment with three packers: we consider mpress, as

modern and efﬁcient Windows executable compressor

with added beneﬁts of space saving and making re-

verse engineering more difﬁcult. From the other side

BEP is an older, less commonly used packer that com-

presses executables, but has been largely replaced by

more modern tools. We take into account also gzexe,

https://github.com/iosifache/DikeDataset

A Method for Packed (and Unpacked) Malware Detection by Means of Convolutional Neural Networks

559

Figure 1: The training phase of the proposed method.

Figure 2: The testing phase of the proposed method.

a simple Linux-based tool that compresses executa-

bles using the gzip algorithm for space savings on

Unix-like systems.

Thus, the full dataset is composed by following

5000 applications:

• 1000 (unpacked) malware samples;

• 1000 packed malware samples obtained with the

mpress packer:

• 1000 packed malware samples obtained with the

BEP packer;

• 1000 packed malware samples obtained with the

gzexe packer;

• 1000 trusted samples.

We train the deep learning models with the (un-

packed) and trusted samples. In the evaluation step,

we consider following evaluations:

• with the unpacked malware and the trusted sam-

ples (i.e., the O dataset);

• with the packed malware (obtained with the

mpress packed) and the trusted samples (i.e., the

P1 dataset);

• with the packed malware (obtained with the BEP

packed) and the trusted samples (i.e., the P2

dataset);

• with the packed malware (obtained with the gzexe

packed) and the trusted samples (i.e., the P3

dataset).

We conduct following experiments to evaluate the

effectiveness of the proposed method for the detection

of packed and unpacked malware.

Table 1 shows the details related to the hyperpa-

rameters we exploited for the considered DL models,

in particular we consider the Image size, the learning

rate, the epochs and the batch size considered in train-

ing.

As shown from Table 1, several well-known archi-

tectures are used, including AlexNet, LeNet, Standard

CNN, MobileNet, DenseNet, EfﬁcientNet, ResNet50,

VGG16, VGG19, and Inception.

From the image size point of view, most mod-

els use an input image size of 100x3 (likely repre-

senting a height and width of 100 pixels and 3 color

channels). However, certain models require larger

input sizes due to their architectural design, in fact

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

560

Table 1: Hyper-parameters setting.

Model Image size Learning rate Epochs Batch size

ALEX NET 100x3 0.0001 25 16

LE NET 100x3 0.0001 25 16

STANDARD CNN 100x3 0.0001 25 16

MobileNet 224x3 0.0001 25 16

DenseNet 224x3 0.0001 25 16

EfﬁcientNet 224x3 0.0001 25 16

ResNet50 100x3 0.0001 25 16

VGG16 100x3 0.0001 25 16

VGG19 100x3 0.0001 25 16

Inception 299x3 0.0001 25 16

MobileNet, DenseNet, EfﬁcientNet use images sized

224x3, while Inception requires an even larger input

size of 299x3.

Relating to the Learning Rate, a consistent learn-

ing rate of 0.0001 is used across all models. This low

learning rate suggests careful training to ensure sta-

bility and gradual convergence.

All models are trained for 25 epochs, providing a

uniform duration of training across different architec-

tures.

The batch size is ﬁxed at 16 for all models. This

suggests that the computational resources available

may favor smaller batches, allowing gradient updates

after processing small sets of samples.

The table outlines a consistent set of hyperparam-

eters (learning rate, epochs, and batch size) across all

models, with variations in the image size depending

on the architecture’s input requirements. This setup

allows for a controlled comparison of model perfor-

mance under similar training conditions, focusing on

the impact of different network architectures.

Table 2 shows the results related to the training

and validation of the models involted in the experi-

mental analysis.

Table 2 shows the performance results related to

the training and validation performance of the model

involved in the experimental analysis. The table pro-

vides insight into each model’s ability to generalize

from the training data to the validation set.

First of all, we note that EfﬁcientNet and

ResNet50 achieved the highest training accuracy, pre-

cision, and recall (0.994), indicating their robustness

during training. However, EfﬁcientNet’s validation

performance drops signiﬁcantly (0.498 across accu-

racy, precision, and recall), suggesting potential over-

ﬁtting. The MobileNet, DenseNet, and Inception net-

works showed strong validation performance (accu-

racy, precision, and recall values close to or above

0.9), suggesting better generalization compared to

models like EfﬁcientNet. ALEX NET and STAN-

DARD CNN both performed well in terms of vali-

dation metrics, with STANDARD CNN slightly out-

performing ALEX NET (0.923 vs. 0.879). LeNet,

VGG16, and VGG19 performed poorly, with valida-

tion accuracy, precision, and recall stuck around 0.5,

implying these models struggled to learn effectively

from the dataset.

• ALEX NET: Shows a moderate drop in validation

performance compared to training (validation ac-

curacy of 0.879). This could indicate some over-

ﬁtting, though the gap is not extreme.

• LeNet: Both training and validation metrics are

around 0.5, indicating that the model failed to

learn useful features, likely due to its simpler ar-

chitecture being insufﬁcient for the task.

• STANDARD CNN: Exhibits strong performance

in both training and validation. This model gener-

alizes well and has a low validation loss (0.295).

• MobileNet: Strong validation performance with a

high accuracy of 0.939. The model balances good

generalization with relatively low training loss.

• DenseNet: Performs very well on both training

and validation data, with validation accuracy at

0.945, indicating effective feature extraction and

generalization.

• EfﬁcientNet: While this model performs excep-

tionally well during training, its validation metrics

(0.498) suggest signiﬁcant overﬁtting, as it fails to

generalize.

• ResNet50: Very balanced, achieving excellent re-

sults in both training and validation, with valida-

tion accuracy at 0.926.

• VGG16 and VGG19: Both models fail to train ef-

fectively, as shown by the high training and vali-

dation loss values. Their architecture may not be

well-suited for this dataset.

• Inception: Achieves good performance across

all metrics, with particularly low validation loss

(0.197) and high accuracy (0.943), indicating

good generalization.

Models like ResNet50, DenseNet, MobileNet, and

Inception are strong candidates due to their balance of

low loss and high validation metrics, suggesting bet-

ter generalization. Simpler models like LeNet and the

VGG variants struggled to capture relevant features,

and EfﬁcientNet overﬁtted despite its high complex-

ity.

Thus, from Table 2 emerges that the mod-

els obtaining a validation accuracy greater than

0.90 are STANDARD CNN, MobileNet, DenseNet,

ResNet50, and Inception: for this reason, these mod-

els are considered in the testing step.

Table 3 presents the results of the experimental

analysis involving four datasets (0, P1, P2, and P3)

and ﬁve models (i.e., STANDARD CNN, MobileNet,

DenseNet, ResNet50, and Inception).

A Method for Packed (and Unpacked) Malware Detection by Means of Convolutional Neural Networks

561

Table 2: Training and validation performances related to loss, accuracy, precision, recall.

Model train loss train acc train prec train rec val loss val acc val prec val rec

ALEX NET 0.040 0.985 0.985 0.985 0.506 0.879 0.879 0.879

LE NET 0.693 0.508 0.508 0.508 0.693 0.500 0.500 0.500

STANDARD CNN 0.033 0.988 0.988 , 0.988 0.295 0.923 0.923 0.923

MobileNet 0.037 0.991 0.991 0.991 0.413 0.939 0.939 0.939

DenseNet 0.062 0.984 0.984 0.984 0.237 0.945 0.945 0.945

EfﬁcientNet 0.019 0.994 0.994 0.994 2.682 0.498 0.498 0.498

ResNet50 0.021 0.994 0.994 0.994 0.405 0.926 0.926 0.926

VGG16 0.693 0.496 0.496 0.496 0.693 0.499 0.499 0.499

VGG19 0.693 0.491 0.491 0.491 0.693 0.500 0.500 0.500

Inception 0.061 0.988 0.988 0.988 0.197 0.943 0.943 0.943

Table 3: The results of the experimental analysis.

Model Dataset Loss Accuracy Precision Recall F-Measure AUC

STANDARD CNN O 0.337 0.921 0.921 0.921 0.921 0.963

P1 0.180 0.954 0.954 0.954 0.954 0.982

P2 0.156 0.960 0.960 0.960 0.960 0.984

P3 0.174 0.956 0.956 0.956 0.956 0.983

MobileNet O 0.369 0.949 0.949 0.949 0.949 0.970

P1 0.140 0.974 0.974 0.974 0.974 0.987

P2 0.167 0.973 0.973 0.973 0.973 0.990

P3 0.151 0.974 0.974 0.974 0.974 0.988

DenseNet O 0.214 0.947 0.947 0.947 0.947 0.979

P1 0.133 0.966 0.966 0.966 0.966 0.988

P2 0.122 0.968 0.968 0.968 0.968 0.989

P3 0.120 0.968 0.968 0.968 0.968 0.989

ResNet50 O 0.442 0.920 0.920 0.920 0.920 0.960

P1 0.248 0.950 0.950 0.950 0.950 0.979

P2 0.234 0.952 0.952 0.952 0.952 0.980

P3 0.246 0.950 0.950 0.950 0.950 0.979

Inception O 0.193 0.941 0.941 0.941 0.941 0.976

P1 0.175 0.946 0.946 0.946 0.946 0.982

P2 0.157 0.952 0.952 0.952 0.952 0.987

P3 0.165 0.950 0.950 0.950 0.950 0.982

• Standard CNN

– Dataset 0: The Standard CNN model performs

well, with a high accuracy of 0.921 and an AUC

of 0.963, indicating that the model is effective

at distinguishing between malware and trusted

samples. However, the loss is relatively higher

at 0.337 compared to datasets P1, P2, and P3.

– Datasets P1, P2, P3: The performance im-

proves consistently across the datasets, with

accuracy peaking at 0.960 for P2 and slightly

dropping to 0.956 in P3. The loss also de-

creases, indicating that the model learns bet-

ter with more complex or larger datasets. The

F-measure and AUC remain consistently high,

suggesting that the model performs well across

these different datasets.

– The Standard CNN performs robustly across all

datasets, showing particularly strong general-

ization on P2 and P3, with AUC values con-

sistently above 0.98, indicating good discrimi-

nation ability.

• MobileNet

– Dataset 0: MobileNet starts with a high accu-

racy of 0.949, a lower loss of 0.369 compared

to Standard CNN, and an AUC of 0.970. The

model shows a very strong balance between

precision and recall, both at 0.949.

– Datasets P1, P2, P3: Performance improves

signiﬁcantly, with accuracy reaching 0.974 for

P1 and P3, and the loss dropping to 0.140

for P1. The AUC remains consistently high,

around 0.987-0.990, and the F-measure sug-

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

562

gests a high degree of consistency between pre-

cision and recall.

– MobileNet performs exceptionally well, im-

proving across larger or more complex datasets.

Its ability to generalize is highlighted by its

minimal ﬂuctuation in accuracy, precision, and

recall. The AUC remains high, conﬁrming its

reliability for classiﬁcation.

• DenseNet

– Dataset 0: DenseNet shows strong performance

with a loss of 0.214 and accuracy of 0.947, but

slightly lower than MobileNet for this dataset.

Precision, recall, and F-measure are all consis-

tent at 0.947, and the AUC is high at 0.979.

– Datasets P1, P2, P3: As the datasets change,

DenseNet consistently improves, with its loss

reducing to 0.120 for P3 and accuracy peaking

at 0.968 across P2 and P3. Precision, recall, and

F-measure remain stable at 0.968, and the AUC

remains around 0.989.

– DenseNet shows strong and stable perfor-

mance, with an especially low loss on the P2

and P3 datasets. It has excellent precision, re-

call, and AUC, making it a reliable model for

malware detection.

• ResNet50

– Dataset 0: ResNet50 starts with the highest loss

among the models (0.442) and relatively lower

accuracy (0.920). Its precision, recall, and F-

measure are all consistent at 0.920, and the

AUC is 0.960, indicating moderate discrimina-

tion ability.

– Datasets P1, P2, P3: ResNet50 improves in ac-

curacy and other metrics with P1, P2, and P3,

reaching a maximum accuracy of 0.952 for P2.

However, its loss remains relatively high com-

pared to other models (0.246 for P3). The AUC

remains stable around 0.979-0.980, indicating

that the model can still discriminate well be-

tween malware and trusted samples.

– ResNet50 performs solidly on the larger

datasets but struggles with higher loss and

slightly lower accuracy compared to DenseNet

and MobileNet. It does, however, maintain

good AUC values across all datasets.

• Inception

– Dataset 0: Inception shows competitive perfor-

mance, with an accuracy of 0.941 and loss of

0.193. Its precision, recall, and F-measure are

all consistent at 0.941, and the AUC is 0.976,

suggesting good performance.

– Datasets P1, P2, P3: As with the other mod-

els, Inception’s performance improves with the

more complex datasets, reaching an accuracy

of 0.952 for P2 and a low loss of 0.157. The

AUC peaks at 0.987 for P2, indicating strong

discriminatory power.

– Inception performs well, especially on dataset

P2, with high precision, recall, and AUC val-

ues. It strikes a balance between performance

metrics and generalization capability, though it

slightly lags behind MobileNet and DenseNet

on the larger datasets.

We note improved performance on the P1, P2,

and P3 datasets; all models show better perfor-

mance on the P1, P2, and P3 compared to dataset

0. This suggests that the complexity or size of these

datasets helps the models learn more effectively, re-

sulting in higher accuracy, precision, recall, and lower

loss. MobileNet and DenseNet Outperform: Mo-

bileNet and DenseNet consistently outperform the

other models regarding accuracy, precision, recall,

and F-measure, particularly on datasets P1 through

P3. Both models maintain low loss values and high

AUC, making them strong candidates for the task.

ResNet50, while still performing well in terms of ac-

curacy and AUC, suffers from higher loss values, in-

dicating that it struggles to ﬁt the data as effectively

as the other models. Inception shows balanced perfor-

mance across all datasets, with competitive metrics,

though slightly behind MobileNet and DenseNet in

accuracy and loss. MobileNet and DenseNet stand out

for their ability to generalize and maintain high per-

formance across all datasets. Both models show min-

imal ﬂuctuation in accuracy and have high AUC, in-

dicating their effectiveness in malware classiﬁcation.

The performance improvements across the datasets

suggest that the models beneﬁt from more complex

datasets (i.e., P1, P2, P3). While the ResNet50 and In-

ception models perform adequately, their higher loss

values and slightly lower accuracy suggest they may

not generalize as well as MobileNet and DenseNet,

especially in handling more complex data.

These ﬁndings indicate that MobileNet and

DenseNet would be the most reliable models for

malware detection. However, considering that the

MobileNet models obtain slightly higher accuracy if

compared with the DenseNet one, we consider the

MobileNet model the best model for the detection of

packed malware. As a matter of fact, the DenseNet

model obtains an accuracy equal to 0.949, 0.974,

0.973, and 0,974, while the DenseNet reaches the fol-

lowing accuracy, i.e., 0.947, 0.966, 0.968, and 0.968.

A Method for Packed (and Unpacked) Malware Detection by Means of Convolutional Neural Networks

563

5 CONCLUSION AND FUTURE

WORK

Current signature-based mechanisms used by free and

commercial antimalware solutions rely on having a

known signature of a malicious sample to detect and

block its activity. As a result, malware authors have

developed techniques that modify the syntax of the

malware’s code while preserving its underlying be-

havior or logic. One such method involves creating

packed malware, where packers—software tools that

compress and package an application along with a de-

compression stub—alter the binary code of the mal-

ware. This stub decompresses the packed code in

memory and executes it, allowing a previously de-

tected malware to evade detection in its packed form.

In this paper, we introduce a technique that uti-

lizes convolutional neural networks (CNNs) to detect

packed malware. Our approach involves static analy-

sis, meaning it does not require the execution of the

application to identify malicious samples. Starting

with the binary code of an application, we transform

it into an image that serves as input to a series of deep

learning classiﬁers. These classiﬁers aim to determine

whether the application under analysis is trusted or

(packed) malicious.

From the experimental analysis, it emerges that

the MobileNet and the DenseNet models show min-

imal ﬂuctuation in accuracy and have high AUC, in-

dicating their effectiveness in malware classiﬁcation.

In future work, we plan to consider prediction

explainability, with the aim of understanding which

parts of the images related to malware are symptoms

of the model prediction. Moreover, we will also con-

sider the possibility of detecting ransomware with the

proposed model.

ACKNOWLEDGMENT

This work has been partially supported by EU DUCA,

EU CyberSecPro, SYNAPSE, PTR 22-24 P2.01 (Cy-

bersecurity) and SERICS (PE00000014) under the

MUR National Recovery and Resilience Plan funded

by the EU - NextGenerationEU projects, by MUR -

REASONING: foRmal mEthods for computAtional

analySis for diagnOsis and progNosis in imagING -

PRIN, e-DAI (Digital ecosystem for integrated anal-

ysis of heterogeneous health data related to high-

impact diseases: innovative model of care and re-

search), Health Operational Plan, FSC 2014-2020,

PRIN-MUR-Ministry of Health, the National Plan for

NRRP Complementary Investments D

∧

3 4 Health:

Digital Driven Diagnostics, prognostics and therapeu-

tics for sustainable Health care, Progetto MolisCTe,

Ministero delle Imprese e del Made in Italy, Italy,

CUP: D33B22000060001, FORESEEN: FORmal

mEthodS for attack dEtEction in autonomous driv-

iNg systems CUP N.P2022WYAEW and ALOHA: a

framework for monitoring the physical and psycho-

logical health status of the Worker through Object de-

tection and federated machine learning, Call for Col-

laborative Research BRiC -2024, INAIL.

REFERENCES

Biondi, F., Enescu, M. A., Given-Wilson, T., Legay, A.,

Noureddine, L., and Verma, V. (2019). Effective, efﬁ-

cient, and robust packing detection and classiﬁcation.

Computers & Security, 85:436–451.

Ciaramella, G., Iadarola, G., Martinelli, F., Mercaldo, F.,

and Santone, A. (2024). Explainable ransomware

detection with deep learning techniques. Jour-

nal of Computer Virology and Hacking Techniques,

20(2):317–330.

Devi, D. and Nandi, S. (2012). Detection of packed mal-

ware. In Proceedings of the First International Con-

ference on Security of Internet of Things, pages 22–26.

He, H., Yang, H., Mercaldo, F., Santone, A., and Huang, P.

(2024). Isolation forest-voting fusion-multioutput: A

stroke risk classiﬁcation method based on the multidi-

mensional output of abnormal sample detection. Com-

puter Methods and Programs in Biomedicine, page

108255.

Mercaldo, F. and Santone, A. (2020). Deep learning

for image-based mobile malware detection. Jour-

nal of Computer Virology and Hacking Techniques,

16(2):157–171.

Qiang, W., Yang, L., and Jin, H. (2022). Efﬁcient and ro-

bust malware detection based on control ﬂow traces

using deep neural networks. Computers & Security,

122:102871.

Rabadi, D. and Teo, S. G. (2020). Advanced windows meth-

ods on malware detection and classiﬁcation. In Pro-

ceedings of the 36th Annual Computer Security Appli-

cations Conference, pages 54–68.

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

564