Adversarial Attacks and Robustness in AI: Methods, Empirical

Analysis, and Defense Mechanisms

Chenglin Song

Beijing Lize International Academy, Beijing, 100073, China

Keywords: Adversarial Attacks, Robustness, Deep Learning, Defense Mechanisms, Security.

Abstract: Adversarial attacks pose significant threats to modern artificial intelligence (AI) systems by introducing subtle

perturbations into input data that can drastically alter model predictions. These attacks have serious

implications in safety-critical applications such as autonomous driving and healthcare, where reliability and

robustness are essential. In addition to computer vision systems, adversarial vulnerabilities have been

observed in natural language processing and speech recognition, further highlighting the broad scope of this

issue. This paper provides an integrative review of adversarial attack generation techniques, discusses

empirical findings on AI robustness, and surveys existing defense mechanisms. Through an examination of

state-of-the-art research, current limitations are highlighted, and directions for developing more resilient AI

models are proposed. Practical considerations and potential future applications are also outlined with the goal

of informing both theoretical inquiry and real-world deployment strategies. Recent studies have further

expanded on these topics by emphasizing enhanced adversarial training methods and layered defense

architectures, which are also discussed in the context of new empirical evidence.

1 INTRODUCTION

As artificial intelligence (AI) systems become

increasingly integrated into diverse applications—

ranging from image recognition and speech

processing to medical diagnosis, financial modeling,

and industrial automation—ensuring their security

and dependability has emerged as a paramount

concern. One of the most critical vulnerabilities stems

from adversarial examples, which are inputs

deliberately altered with subtle perturbations. These

changes, often undetectable to the human eye or ear,

can lead to significant misclassifications by AI

models, compromising their reliability. The

pioneering work of （ Szegedy et al., 2014) and

(Goodfellow et al., 2015) first exposed these

weaknesses, revealing that even advanced neural

networks possess decision boundaries that are highly

exploitable through minor input modifications. This

discovery has since spurred extensive research into

adversarial vulnerabilities across multiple domains.

The real-world implications of adversarial attacks

are profound and far-reaching. In autonomous driving,

for example, slight alterations to road signs—such as

https://orcid.org/0009-0004-2343-0182

adding small stickers or modifying colors—can

mislead a vehicle’s perception system, potentially

causing accidents or endangering lives. In healthcare,

adversarial tampering with diagnostic images like X-

rays or MRIs could lead to erroneous diagnoses,

posing risks to patient safety and eroding confidence

in AI-assisted medical tools. Beyond these high-

stakes areas, adversarial threats have also emerged in

natural language processing (e.g. manipulating text to

deceive sentiment analysis), speech recognition (e.g.,

embedding inaudible commands), and reinforcement

learning (e.g. altering reward structures). This

pervasive vulnerability underscores the urgent need

for robust countermeasures to protect AI systems in

an increasingly digitized world.

This paper offers a comprehensive analysis of

adversarial attack methodologies, evaluates the

effectiveness of various defense strategies through

empirical testing, and proposes future research

directions to enhance AI robustness. The rapid

evolution of attack techniques, coupled with the

limitations of existing defenses, has created a

dynamic “arms race” between attackers and

defenders. Recent advancements, including improved

402

Song, C.

Adversarial Attacks and Robustness in AI: Methods, Empirical Analysis, and Defense Mechanisms.

DOI: 10.5220/0013826700004708

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Innovations in Applied Mathematics, Physics, and Astronomy (IAMPA 2025), pages 402-406

ISBN: 978-989-758-774-0

adversarial training techniques and the adoption of

multi-layered defense systems, provide hopeful

pathways forward. This study incorporates these

developments, leveraging empirical data from

benchmark datasets to assess model performance

under adversarial conditions. The paper is organized

as follows: Section 2 reviews key attack and defense

methods, Section 3 presents experimental results and

their practical implications, and Section 4 concludes

with key insights and future research priorities.

2 METHODS

2.1 Attack Methods

The study of adversarial attacks has led to the

development of several distinct strategies, each

targeting specific weaknesses in AI models. These

strategies are broadly classified into gradient-based

attacks, optimization-based attacks, and black-box as

well as transfer attacks, reflecting the growing

complexity of adversarial techniques.

Gradient-based attacks exploit the gradients of the

loss function with respect to the input data to identify

directions where small perturbations can significantly

impact model outputs. The Fast Gradient Sign

Method (FGSM), proposed by (Goodfellow et al.,

2015), generates adversarial examples in a single step

by adjusting the input based on the sign of the

gradient. This method applies a perturbation scaled

by a parameter that controls its magnitude, ensuring

the change remains subtle yet effective. A more

advanced technique, the Projected Gradient Descent

(PGD) attack, refines this approach by iteratively

applying smaller gradient steps and projecting the

result back into a constrained region to limit

perturbation size. Research by (Madry et al.,2018)has

shown that PGD is particularly effective as a first-

order adversary due to its iterative nature.

Optimization-based attacks, such as the Carlini &

Wagner (C&W) attack (Carlini & Wagner, 2017),

treat the creation of adversarial examples as an

optimization problem (Carlini & Wagner, 2017). This

approach seeks the smallest perturbation that causes

misclassification, making it highly effective against

defenses that obscure gradients. Black-box and

transfer attacks, on the other hand, operate without

direct access to model parameters. Black-box attacks

estimate gradients by querying the model with

different inputs and analyzing the resulting

confidence scores, while transfer attacks leverage the

observation that adversarial examples designed for

one model can often deceive others with similar

architectures. Recent studies by (Zhang et al.,2021)

emphasize that transferability remains a significant

challenge, particularly as models grow more complex,

necessitating adaptive defense mechanisms.

2.2 Defense Mechanisms

To counter the evolving landscape of adversarial

attacks, researchers have developed a variety of

defense strategies aimed at enhancing AI robustness.

One widely adopted approach is adversarial training,

which involves augmenting the training dataset with

adversarial examples to improve model resilience.

This method can substantially boost resistance to

specific attack types encountered during training;

however, it is computationally demanding and may

not generalize well to new or adaptive threats. Recent

advancements by (Xie et al.,2020)suggest that

combining adversarial training with regularization

techniques can enhance its adaptability, offering a

potential solution to these challenges.

Another strategy, gradient masking or obfuscation,

modifies the model’s gradients to make it harder for

attackers to compute effective perturbations. While

this can provide temporary protection, many such

techniques have been circumvented by adaptive

attacks that use alternative methods, such as finite

differences, to approximate gradients. Input

transformations, including random resizing or JPEG

compression, offer a different approach by disrupting

adversarial patterns in the data. These methods are

computationally efficient and provide moderate

improvements in robustness, though their

effectiveness diminishes against sophisticated

attackers who can adapt to these changes. An

emerging area of interest is certified robustness,

which uses formal verification or robust optimization

to provide theoretical guarantees of resilience within

a specific perturbation range. Despite their potential,

these methods face scalability issues, as noted in

recent work by (Kang et al.,2022), which explores

ways to make them more practical for larger networks.

The ongoing evolution of attack strategies

indicates that no single defense is universally

effective. A consensus is emerging within the

research community that a layered defense

approach—integrating adversarial training, input

transformations, and certified robustness—may offer

the best path to long-term resilience. This multi-

faceted strategy aims to address the diverse and

adaptive nature of adversarial threats, ensuring AI

systems remain secure in dynamic real-world

environments.

Adversarial Attacks and Robustness in AI: Methods, Empirical Analysis, and Defense Mechanisms

403

3 RESULTS AND DISCUSSION

3.1 Experimental Setup

To evaluate the impact of adversarial attacks and the

effectiveness of defense mechanisms empirically,

experiments were conducted using two standard

image classification datasets: MNIST and CIFAR-10.

The Modified National Institute of Standards and

Technology (MNIST) dataset consists of 60,000

training images and 10,000 testing images, featuring

handwritten digits in 28×28 pixel grayscale format.

The Canadian Institute for Advanced Research

(CIFAR-10) dataset includes 50,000 training images

and 10,000 testing images, with 32×32 pixel color

images across 10 classes.

For the MNIST experiments, a convolutional

neural network (CNN) was employed, consisting of

two convolutional layers with ReLU activations, a

max-pooling layer, and two fully connected layers.

This baseline model achieved approximately 99%

accuracy on clean, unperturbed data. For CIFAR-10,

a deeper CNN inspired by the VGG architecture was

used, reaching around 86% accuracy on clean inputs.

Both models were subjected to three attack

types—FGSM, PGD, and C&W—and tested with

three defense strategies: no defense, adversarial

training, and input transformation (via random

resizing or basic compression). This experimental

design enabled a thorough assessment of how

different attacks and defenses interact, providing

insights into their overall impact on classification

performance.

3.2 Quantitative Findings

The results demonstrate that baseline models without

defenses experience significant accuracy declines

when exposed to adversarial attacks, with iterative

methods like PGD and optimization-based C&W

attacks causing the most substantial drops. The

findings for the MNIST and CIFAR-10 datasets are

summarized in Tables 1 and 2, respectively.

Table 1: MNIST Accuracy (%) under Adversarial Attacks

and Defenses

Attac

Defense Accurac

FGSM No Defense 75

FGSM Adversarial Trainin

FGSM Input Transformation 85

PGD No Defense 40

PGD Adversarial Trainin

PGD In

ut Transformation 48

C&W No Defense 35

C&W Adversarial Training 62

C&W In

ut Transformation 45

Table 2: CIFAR-10 Accuracy (%) under Adversarial

Attacks and Defenses

Attac

Defense Accurac

FGSM No Defense 60

FGSM Adversarial Training 75

FGSM Input Transformation 70

PGD No Defense 25

PGD Adversarial Trainin

PGD In

ut Transformation 30

C&W No Defense 20

C&W Adversarial Training 35

C&W In

ut Transformation 28

These tables reveal that undefended models suffer

dramatic accuracy losses, with MNIST dropping from

99% to 35% under C&W attacks, and CIFAR-10

declining from 86% to 20%.

Adversarial training consistently improves

performance, though it does not fully neutralize

strong attacks, while input transformation offers

moderate enhancements but remains inadequate

against adaptive threats.

3.3 Discussion of Practical Implications

The experimental results highlight several key

insights with significant implications for deploying

AI systems in safety-critical contexts.The

pronounced vulnerability of baseline models to

adversarial attacks underscores the immediate need

for robust defense mechanisms, as even minor

perturbations can lead to catastrophic failures in areas

like autonomous driving or healthcare.The substantial

accuracy reductions observed—particularly with

PGD and C&W attacks—illustrate the real-world

risks of misclassification, emphasizing the

importance of proactive security measures.

Adversarial training proves effective but is

hindered by its high computational cost and limited

ability to generalize to unseen attacks, presenting

challenges for resource-constrained settings such as

edge computing devices.

This limitation suggests a need for innovative

training methods that optimize robustness while

minimizing resource demands. Input transformations,

while computationally lightweight, provide only

partial protection, indicating that attackers may

eventually develop strategies to overcome these

defenses (Shafahi et al., 2019).

The potential of layered defense systems, which

combine multiple approaches, is supported by recent

research, suggesting that hybrid strategies could

IAMPA 2025 - The International Conference on Innovations in Applied Mathematics, Physics, and Astronomy

404

address the shortcomings of individual methods more

effectively (e.g., Kang et al., 2022; Xie et al., 2020).

Additionally, the findings have broader

implications for AI trustworthiness and deployment.

As adversarial vulnerabilities extend beyond image

classification to domains like natural language

processing and speech recognition, developing cross-

modal defense strategies becomes crucial.

The dynamic nature of this field requires

continuous monitoring and adaptation of defense

mechanisms to counter emerging attack techniques.

Furthermore, the societal impact of adversarial

robustness—encompassing user trust, regulatory

compliance, and ethical considerations—warrants

further exploration. Integrating these factors into

future research will ensure that AI systems are not

only secure but also aligned with societal needs and

expectations (Papernot et al., 2017).

4 CONCLUSION

This paper has provided a detailed review of

prominent adversarial attack methods—

encompassing single-step, iterative, and

optimization-based approaches—and surveyed

existing defense mechanisms, including adversarial

training, gradient masking, input transformations, and

certified robustness.

Empirical evaluations on the MNIST and CIFAR-

10 datasets confirm that adversarial perturbations can

severely degrade AI performance, highlighting the

critical need for robust defenses in safety-critical

applications. While adversarial training and input

transformations enhance resilience, they fall short of

providing comprehensive protection against adaptive

or novel attacks, perpetuating the adversarial arms

race.

The widespread vulnerabilities of current AI

models, particularly without effective defenses, pose

significant risks, with potential misclassifications

leading to serious real-world consequences. Partial

solutions like adversarial training offer improvements

but lack the flexibility to address evolving threats,

underscoring the need for dynamic and adaptive

defense strategies.

Future research should focus on developing

scalable certified defenses that offer theoretical

guarantees of robustness, despite current

computational limitations, and extend validation

across diverse domains such as natural language

processing and speech recognition. Efficient training

pipelines, potentially leveraging transfer learning or

distributed computing, could reduce the

computational burden, making robust AI more

accessible.

Moreover, ethical and regulatory

considerations—such as liability, transparency, and

fairness—require collaboration among technologists,

policymakers, and ethicists to establish robust

governance frameworks.

The adoption of layered defense systems, which

integrate technical innovations with practical

feasibility, represents a promising direction for

enhancing AI security.

As adversarial threats continue to evolve,

sustained research and interdisciplinary cooperation

are essential to developing reliable and secure AI

systems. By addressing these challenges

comprehensively, the field can move toward a future

where AI robustness is a foundational standard,

ensuring its safe and effective deployment across all

sectors.

REFERENCES

Akhtar, N., & Mian, A., 2018. Threat of adversarial attacks

on deep learning in computer vision: A survey. In IEEE

Access, 6, 14410–14430.

Carlini, N., & Wagner, D., 2017. Towards evaluating the

robustness of neural networks. In Proceedings of the

2017 IEEE Symposium on Security and Privacy, 39–57.

Eykholt, K., 2018. Robust physical-world attacks on deep

learning visual classification. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 1625–1634.

Goodfellow, I. J., Shlens, J., & Szegedy, C., 2015.

Explaining and harnessing adversarial examples. In

International Conference on Learning Representations

(ICLR).

Kang, W., Li, Y., & Zhao, H., 2022. Adversarial robustness

in deep learning: A comprehensive review. In ACM

Computing Surveys, 55(2), Article 45.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu,

A., 2018. Towards deep learning models resistant to

adversarial attacks. In International Conference on

Learning Representations (ICLR).

Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik,

Z. B., & Swami, A., 2017. Practical black-box attacks

against machine learning. In Proceedings of the 2017

ACM on Asia Conference on Computer and

Communications Security (ASIACCS), 506–519.

Shafahi, A., Huang, W. R., Studer, C., Feizi, S., &

Goldstein, T., 2019. Are adversarial examples

inevitable? In International Conference on Learning

Representations (ICLR).

Xie, C., Wang, J., Zhang, Z., Ren, Z., & Yuille, A., 2020.

Enhanced adversarial training for robust deep neural

networks. In IEEE Transactions on Neural Networks

and Learning Systems, 31(9), 3414–3426.

Adversarial Attacks and Robustness in AI: Methods, Empirical Analysis, and Defense Mechanisms

405

Zhang, Y., Chen, X., & Liu, J., 2021. A survey on

adversarial attacks and defenses in deep learning. In

IEEE Access, 9, 112345–112367.

IAMPA 2025 - The International Conference on Innovations in Applied Mathematics, Physics, and Astronomy

406