This study performs comprehensive prompt
injection attack tests on different widely-used LLM
architectures and compares their detection
performance across various attack types. This study
will provide a detailed analysis of how prompt
injection attacks affect LLMs, examine the
vulnerability differences between attack types, and
evaluate the effectiveness of current defense
mechanisms. This study provides an open dataset,
experimental framework, and model comparison to
contribute both academically and practically to LLM
security research.
To present the full scope of the study, the second
section reviews previous research related to prompt
injection attacks and existing defense strategies. The
third section defines the problem, introduces the
attack categories and dataset structure, and presents
the experimental setup. The fourth section reports the
experimental results across multiple LLMs and
analyzes performance. Finally, the fifth section
concludes the paper by summarizing key
contributions and outlining directions for future
research.
2 RELATED WORKS
The rapid development of deep learning techniques in
NLP has been primarily driven by advances in neural
network architectures. Early models such as
Recurrent Neural Networks (RNN) and Long Short-
Term Memory (LSTM) networks played a pivotal
role in sequential data processing. However, they
exhibited notable shortcomings in capturing long-
range dependencies and enabling parallel
computation (Hochreiter & Schmidhuber, 1997; Cho
et al., 2014). To address these limitations, the
Transformer architecture was introduced, as outlined
in “Attention is All You Need” (Vaswani et al., 2017).
By leveraging self-attention mechanisms,
Transformers provide stronger contextual
representations and enhanced computational
efficiency. They have since become the foundation of
large-scale models such as GPT, BERT, and T5,
which achieve near-human performance in tasks
including text generation, sentiment analysis,
summarization, code generation, and translation
(Devlin et al., 2019).
Despite these advancements, LLMs remain
vulnerable to adversarial manipulation through
malicious prompts. In particular, prompt injection
attacks are designed to mislead models into
generating outputs that deviate from their intended
task scope. Choi and Kim (2024) argued that the root
of these vulnerabilities lies in the models’ limited
command parsing ability and insufficient contextual
filtering, suggesting that structural weaknesses—
rather than adversarial prompts alone—enable such
attacks.
Prompt injection attacks are typically divided into
two categories: direct and indirect. Li and Zhou
(2024) demonstrated that goal-driven direct attacks,
often optimized through advanced techniques, can not
only alter generated outputs but also redefine the
model’s task boundaries. These findings highlight
how adversaries exploit models’ task adherence to
redirect outputs entirely. In contrast, Thapa and Park
(2024) showed that indirect attacks are particularly
difficult to detect with conventional filtering
methods, advocating for forensic analysis-based
detection as a more robust alternative. Such attacks
are especially concerning because they can bypass
internal safety mechanisms in addition to
manipulating content. Expanding on this, Ferrag
(2025) proposed a taxonomy of prompt injection
attack surfaces, including content injection, context
manipulation, and task redirection, thereby providing
a structured framework for developing defense
strategies. Similarly, Singh and Verma (2023)
demonstrated that the vulnerability of LLMs varies
across architectures, underscoring the necessity for
architecture-specific countermeasures.
As these threats increase, prompt injection attacks
are now widely recognized as a major concern in both
academia and industry. For example, the OWASP
Foundation listed prompt injection as a top security
risk in its Top 10 for LLM Applications (OWASP,
2024). Likewise, the National Institute of Standards
and Technology (NIST) highlighted risks such as
mission drift, information leakage, and system
manipulation in its Generative AI Risk Management
Framework (NIST, 2023).
From a defense perspective, current mitigation
strategies predominantly rely on rule-based methods
and content filtering. However, these approaches
exhibit critical weaknesses in real-world applications.
Chen and Kumar (2025) demonstrated that such
defenses often suffer from high false-positive rates
and poor detection of adversarial content. Therefore,
their effectiveness and scalability are significantly
limited.
Although significant progress has been made in
the literature on prompt injection attacks, current
research still faces key structural and methodological
limitations. A major issue is the lack of
comprehensive, publicly available datasets that
reflect real-world scenarios and diverse attack types,
making systematic evaluation difficult. Furthermore,
the limited number of comparative analyses across
different architectures restricts understanding of
model vulnerabilities. Finally, the absence of