Adversarial Attacks and Robustness in AI: Methods, Empirical
Analysis, and Defense Mechanisms
Chenglin Song
a
Beijing Lize International Academy, Beijing, 100073, China
Keywords: Adversarial Attacks, Robustness, Deep Learning, Defense Mechanisms, Security.
Abstract: Adversarial attacks pose significant threats to modern artificial intelligence (AI) systems by introducing subtle
perturbations into input data that can drastically alter model predictions. These attacks have serious
implications in safety-critical applications such as autonomous driving and healthcare, where reliability and
robustness are essential. In addition to computer vision systems, adversarial vulnerabilities have been
observed in natural language processing and speech recognition, further highlighting the broad scope of this
issue. This paper provides an integrative review of adversarial attack generation techniques, discusses
empirical findings on AI robustness, and surveys existing defense mechanisms. Through an examination of
state-of-the-art research, current limitations are highlighted, and directions for developing more resilient AI
models are proposed. Practical considerations and potential future applications are also outlined with the goal
of informing both theoretical inquiry and real-world deployment strategies. Recent studies have further
expanded on these topics by emphasizing enhanced adversarial training methods and layered defense
architectures, which are also discussed in the context of new empirical evidence.
1 INTRODUCTION
As artificial intelligence (AI) systems become
increasingly integrated into diverse applications—
ranging from image recognition and speech
processing to medical diagnosis, financial modeling,
and industrial automation—ensuring their security
and dependability has emerged as a paramount
concern. One of the most critical vulnerabilities stems
from adversarial examples, which are inputs
deliberately altered with subtle perturbations. These
changes, often undetectable to the human eye or ear,
can lead to significant misclassifications by AI
models, compromising their reliability. The
pioneering work of Szegedy et al., 2014) and
(Goodfellow et al., 2015) first exposed these
weaknesses, revealing that even advanced neural
networks possess decision boundaries that are highly
exploitable through minor input modifications. This
discovery has since spurred extensive research into
adversarial vulnerabilities across multiple domains.
The real-world implications of adversarial attacks
are profound and far-reaching. In autonomous driving,
for example, slight alterations to road signs—such as
a
https://orcid.org/0009-0004-2343-0182
adding small stickers or modifying colors—can
mislead a vehicle’s perception system, potentially
causing accidents or endangering lives. In healthcare,
adversarial tampering with diagnostic images like X-
rays or MRIs could lead to erroneous diagnoses,
posing risks to patient safety and eroding confidence
in AI-assisted medical tools. Beyond these high-
stakes areas, adversarial threats have also emerged in
natural language processing (e.g. manipulating text to
deceive sentiment analysis), speech recognition (e.g.,
embedding inaudible commands), and reinforcement
learning (e.g. altering reward structures). This
pervasive vulnerability underscores the urgent need
for robust countermeasures to protect AI systems in
an increasingly digitized world.
This paper offers a comprehensive analysis of
adversarial attack methodologies, evaluates the
effectiveness of various defense strategies through
empirical testing, and proposes future research
directions to enhance AI robustness. The rapid
evolution of attack techniques, coupled with the
limitations of existing defenses, has created a
dynamic “arms race” between attackers and
defenders. Recent advancements, including improved
402
Song, C.
Adversarial Attacks and Robustness in AI: Methods, Empirical Analysis, and Defense Mechanisms.
DOI: 10.5220/0013826700004708
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Innovations in Applied Mathematics, Physics, and Astronomy (IAMPA 2025), pages 402-406
ISBN: 978-989-758-774-0
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
adversarial training techniques and the adoption of
multi-layered defense systems, provide hopeful
pathways forward. This study incorporates these
developments, leveraging empirical data from
benchmark datasets to assess model performance
under adversarial conditions. The paper is organized
as follows: Section 2 reviews key attack and defense
methods, Section 3 presents experimental results and
their practical implications, and Section 4 concludes
with key insights and future research priorities.
2 METHODS
2.1 Attack Methods
The study of adversarial attacks has led to the
development of several distinct strategies, each
targeting specific weaknesses in AI models. These
strategies are broadly classified into gradient-based
attacks, optimization-based attacks, and black-box as
well as transfer attacks, reflecting the growing
complexity of adversarial techniques.
Gradient-based attacks exploit the gradients of the
loss function with respect to the input data to identify
directions where small perturbations can significantly
impact model outputs. The Fast Gradient Sign
Method (FGSM), proposed by (Goodfellow et al.,
2015), generates adversarial examples in a single step
by adjusting the input based on the sign of the
gradient. This method applies a perturbation scaled
by a parameter that controls its magnitude, ensuring
the change remains subtle yet effective. A more
advanced technique, the Projected Gradient Descent
(PGD) attack, refines this approach by iteratively
applying smaller gradient steps and projecting the
result back into a constrained region to limit
perturbation size. Research by (Madry et al.,2018)has
shown that PGD is particularly effective as a first-
order adversary due to its iterative nature.
Optimization-based attacks, such as the Carlini &
Wagner (C&W) attack (Carlini & Wagner, 2017),
treat the creation of adversarial examples as an
optimization problem (Carlini & Wagner, 2017). This
approach seeks the smallest perturbation that causes
misclassification, making it highly effective against
defenses that obscure gradients. Black-box and
transfer attacks, on the other hand, operate without
direct access to model parameters. Black-box attacks
estimate gradients by querying the model with
different inputs and analyzing the resulting
confidence scores, while transfer attacks leverage the
observation that adversarial examples designed for
one model can often deceive others with similar
architectures. Recent studies by (Zhang et al.,2021)
emphasize that transferability remains a significant
challenge, particularly as models grow more complex,
necessitating adaptive defense mechanisms.
2.2 Defense Mechanisms
To counter the evolving landscape of adversarial
attacks, researchers have developed a variety of
defense strategies aimed at enhancing AI robustness.
One widely adopted approach is adversarial training,
which involves augmenting the training dataset with
adversarial examples to improve model resilience.
This method can substantially boost resistance to
specific attack types encountered during training;
however, it is computationally demanding and may
not generalize well to new or adaptive threats. Recent
advancements by (Xie et al.,2020)suggest that
combining adversarial training with regularization
techniques can enhance its adaptability, offering a
potential solution to these challenges.
Another strategy, gradient masking or obfuscation,
modifies the model’s gradients to make it harder for
attackers to compute effective perturbations. While
this can provide temporary protection, many such
techniques have been circumvented by adaptive
attacks that use alternative methods, such as finite
differences, to approximate gradients. Input
transformations, including random resizing or JPEG
compression, offer a different approach by disrupting
adversarial patterns in the data. These methods are
computationally efficient and provide moderate
improvements in robustness, though their
effectiveness diminishes against sophisticated
attackers who can adapt to these changes. An
emerging area of interest is certified robustness,
which uses formal verification or robust optimization
to provide theoretical guarantees of resilience within
a specific perturbation range. Despite their potential,
these methods face scalability issues, as noted in
recent work by (Kang et al.,2022), which explores
ways to make them more practical for larger networks.
The ongoing evolution of attack strategies
indicates that no single defense is universally
effective. A consensus is emerging within the
research community that a layered defense
approach—integrating adversarial training, input
transformations, and certified robustness—may offer
the best path to long-term resilience. This multi-
faceted strategy aims to address the diverse and
adaptive nature of adversarial threats, ensuring AI
systems remain secure in dynamic real-world
environments.
Adversarial Attacks and Robustness in AI: Methods, Empirical Analysis, and Defense Mechanisms
403
3 RESULTS AND DISCUSSION
3.1 Experimental Setup
To evaluate the impact of adversarial attacks and the
effectiveness of defense mechanisms empirically,
experiments were conducted using two standard
image classification datasets: MNIST and CIFAR-10.
The Modified National Institute of Standards and
Technology (MNIST) dataset consists of 60,000
training images and 10,000 testing images, featuring
handwritten digits in 28×28 pixel grayscale format.
The Canadian Institute for Advanced Research
(CIFAR-10) dataset includes 50,000 training images
and 10,000 testing images, with 32×32 pixel color
images across 10 classes.
For the MNIST experiments, a convolutional
neural network (CNN) was employed, consisting of
two convolutional layers with ReLU activations, a
max-pooling layer, and two fully connected layers.
This baseline model achieved approximately 99%
accuracy on clean, unperturbed data. For CIFAR-10,
a deeper CNN inspired by the VGG architecture was
used, reaching around 86% accuracy on clean inputs.
Both models were subjected to three attack
types—FGSM, PGD, and C&W—and tested with
three defense strategies: no defense, adversarial
training, and input transformation (via random
resizing or basic compression). This experimental
design enabled a thorough assessment of how
different attacks and defenses interact, providing
insights into their overall impact on classification
performance.
3.2 Quantitative Findings
The results demonstrate that baseline models without
defenses experience significant accuracy declines
when exposed to adversarial attacks, with iterative
methods like PGD and optimization-based C&W
attacks causing the most substantial drops. The
findings for the MNIST and CIFAR-10 datasets are
summarized in Tables 1 and 2, respectively.
Table 1: MNIST Accuracy (%) under Adversarial Attacks
and Defenses
Attac
k
Defense Accurac
y
FGSM No Defense 75
FGSM Adversarial Trainin
g
88
FGSM Input Transformation 85
PGD No Defense 40
PGD Adversarial Trainin
g
70
PGD In
p
ut Transformation 48
C&W No Defense 35
C&W Adversarial Training 62
C&W In
p
ut Transformation 45
Table 2: CIFAR-10 Accuracy (%) under Adversarial
Attacks and Defenses
Attac
k
Defense Accurac
y
FGSM No Defense 60
FGSM Adversarial Training 75
FGSM Input Transformation 70
PGD No Defense 25
PGD Adversarial Trainin
g
40
PGD In
p
ut Transformation 30
C&W No Defense 20
C&W Adversarial Training 35
C&W In
p
ut Transformation 28
These tables reveal that undefended models suffer
dramatic accuracy losses, with MNIST dropping from
99% to 35% under C&W attacks, and CIFAR-10
declining from 86% to 20%.
Adversarial training consistently improves
performance, though it does not fully neutralize
strong attacks, while input transformation offers
moderate enhancements but remains inadequate
against adaptive threats.
3.3 Discussion of Practical Implications
The experimental results highlight several key
insights with significant implications for deploying
AI systems in safety-critical contexts.The
pronounced vulnerability of baseline models to
adversarial attacks underscores the immediate need
for robust defense mechanisms, as even minor
perturbations can lead to catastrophic failures in areas
like autonomous driving or healthcare.The substantial
accuracy reductions observed—particularly with
PGD and C&W attacks—illustrate the real-world
risks of misclassification, emphasizing the
importance of proactive security measures.
Adversarial training proves effective but is
hindered by its high computational cost and limited
ability to generalize to unseen attacks, presenting
challenges for resource-constrained settings such as
edge computing devices.
This limitation suggests a need for innovative
training methods that optimize robustness while
minimizing resource demands. Input transformations,
while computationally lightweight, provide only
partial protection, indicating that attackers may
eventually develop strategies to overcome these
defenses (Shafahi et al., 2019).
The potential of layered defense systems, which
combine multiple approaches, is supported by recent
research, suggesting that hybrid strategies could
IAMPA 2025 - The International Conference on Innovations in Applied Mathematics, Physics, and Astronomy
404
address the shortcomings of individual methods more
effectively (e.g., Kang et al., 2022; Xie et al., 2020).
Additionally, the findings have broader
implications for AI trustworthiness and deployment.
As adversarial vulnerabilities extend beyond image
classification to domains like natural language
processing and speech recognition, developing cross-
modal defense strategies becomes crucial.
The dynamic nature of this field requires
continuous monitoring and adaptation of defense
mechanisms to counter emerging attack techniques.
Furthermore, the societal impact of adversarial
robustness—encompassing user trust, regulatory
compliance, and ethical considerations—warrants
further exploration. Integrating these factors into
future research will ensure that AI systems are not
only secure but also aligned with societal needs and
expectations (Papernot et al., 2017).
4 CONCLUSION
This paper has provided a detailed review of
prominent adversarial attack methods
encompassing single-step, iterative, and
optimization-based approaches—and surveyed
existing defense mechanisms, including adversarial
training, gradient masking, input transformations, and
certified robustness.
Empirical evaluations on the MNIST and CIFAR-
10 datasets confirm that adversarial perturbations can
severely degrade AI performance, highlighting the
critical need for robust defenses in safety-critical
applications. While adversarial training and input
transformations enhance resilience, they fall short of
providing comprehensive protection against adaptive
or novel attacks, perpetuating the adversarial arms
race.
The widespread vulnerabilities of current AI
models, particularly without effective defenses, pose
significant risks, with potential misclassifications
leading to serious real-world consequences. Partial
solutions like adversarial training offer improvements
but lack the flexibility to address evolving threats,
underscoring the need for dynamic and adaptive
defense strategies.
Future research should focus on developing
scalable certified defenses that offer theoretical
guarantees of robustness, despite current
computational limitations, and extend validation
across diverse domains such as natural language
processing and speech recognition. Efficient training
pipelines, potentially leveraging transfer learning or
distributed computing, could reduce the
computational burden, making robust AI more
accessible.
Moreover, ethical and regulatory
considerations—such as liability, transparency, and
fairness—require collaboration among technologists,
policymakers, and ethicists to establish robust
governance frameworks.
The adoption of layered defense systems, which
integrate technical innovations with practical
feasibility, represents a promising direction for
enhancing AI security.
As adversarial threats continue to evolve,
sustained research and interdisciplinary cooperation
are essential to developing reliable and secure AI
systems. By addressing these challenges
comprehensively, the field can move toward a future
where AI robustness is a foundational standard,
ensuring its safe and effective deployment across all
sectors.
REFERENCES
Akhtar, N., & Mian, A., 2018. Threat of adversarial attacks
on deep learning in computer vision: A survey. In IEEE
Access, 6, 14410–14430.
Carlini, N., & Wagner, D., 2017. Towards evaluating the
robustness of neural networks. In Proceedings of the
2017 IEEE Symposium on Security and Privacy, 39–57.
Eykholt, K., 2018. Robust physical-world attacks on deep
learning visual classification. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 1625–1634.
Goodfellow, I. J., Shlens, J., & Szegedy, C., 2015.
Explaining and harnessing adversarial examples. In
International Conference on Learning Representations
(ICLR).
Kang, W., Li, Y., & Zhao, H., 2022. Adversarial robustness
in deep learning: A comprehensive review. In ACM
Computing Surveys, 55(2), Article 45.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu,
A., 2018. Towards deep learning models resistant to
adversarial attacks. In International Conference on
Learning Representations (ICLR).
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik,
Z. B., & Swami, A., 2017. Practical black-box attacks
against machine learning. In Proceedings of the 2017
ACM on Asia Conference on Computer and
Communications Security (ASIACCS), 506–519.
Shafahi, A., Huang, W. R., Studer, C., Feizi, S., &
Goldstein, T., 2019. Are adversarial examples
inevitable? In International Conference on Learning
Representations (ICLR).
Xie, C., Wang, J., Zhang, Z., Ren, Z., & Yuille, A., 2020.
Enhanced adversarial training for robust deep neural
networks. In IEEE Transactions on Neural Networks
and Learning Systems, 31(9), 3414–3426.
Adversarial Attacks and Robustness in AI: Methods, Empirical Analysis, and Defense Mechanisms
405
Zhang, Y., Chen, X., & Liu, J., 2021. A survey on
adversarial attacks and defenses in deep learning. In
IEEE Access, 9, 112345–112367.
IAMPA 2025 - The International Conference on Innovations in Applied Mathematics, Physics, and Astronomy
406