Beyond Rules: How Large Language Models Are Redeﬁning

Cryptographic Misuse Detection

Zohaib Masood and Miguel Vargas Martin

Ontario Tech University, Oshawa, Canada

Keywords:

Cryptographic Misuse Detection, Large Language Models, Static Analysis.

Abstract:

The use of Large Language Models (LLMs) in software development is rapidly growing, with developers in-

creasingly relying on these models for coding assistance, including security-critical tasks. Our work presents

a comprehensive comparison between traditional static analysis tools for cryptographic API misuse detec-

tion—CryptoGuard, CogniCrypt, and Snyk Code—and the LLMs—GPT, Llama, Claude, and Gemini. Using

benchmark datasets (OWASP, CryptoAPI, and MASC), we evaluate the effectiveness of each tool in identi-

fying cryptographic misuses. Our ﬁndings show that GPT 4-o-mini surpasses current state-of-the-art static

analysis tools on the CryptoAPI and MASC datasets, though it lags on the OWASP dataset. Additionally,

we assess the quality of LLM responses to determine which models provide actionable and accurate advice,

giving developers insights into their practical utility for secure coding. This study highlights the comparative

strengths and limitations of static analysis versus LLM-driven approaches, offering valuable insights into the

evolving role of AI in advancing software security practices.

1 INTRODUCTION

Protecting sensitive data on digital devices from

eavesdropping or forgery relies primarily on cryptog-

raphy. To ensure effectiveness in protection, the cryp-

tographic algorithms employed must be conceptually

secure, implemented accurately, and utilized securely

within the relevant application. Despite the existence

of mature and still secure cryptographic algorithms,

numerous studies have pointed out that application

developers face challenges in utilizing the Applica-

tion Programming Interfaces (APIs) of libraries that

incorporate these algorithms. As an illustration, Lazar

et al. (Lazar et al., 2014) examined 269 vulnerabilities

related to cryptography and discovered that merely

17% are associated with ﬂawed algorithm implemen-

tations, while the remaining 83% stem from the mis-

use of cryptographic APIs by application develop-

ers. Additional research indicates that around 90%

of applications utilizing cryptographic APIs include

at least one instance of misuse (Chatzikonstantinou

et al., 2016; Egele et al., 2013).

Software designers and developers need to tackle

this security ﬂaw by ensuring that their applications

encrypt the sensitive data they handle and store. De-

spite the availability of educational materials (Graff

and Wyk, 2003) aimed at raising awareness and pro-

viding guidance on secure code development, many

developers remain unaware of security considera-

tions (Xie et al., 2011). Training initiatives are

not universally received by programmers (Xie et al.,

2011), and security is frequently treated as a sec-

ondary, desired goal (Whitten, 2004), rather than a

mandatory one, depending on the perceived risk and

criticality of the developed application. Typically, de-

velopers prioritize meeting functionality and time-to-

market requirements as their primary goals.

In exploring the factors contributing to this preva-

lent misuse, researchers previously triangulated ﬁnd-

ings from four empirical studies, including a survey

involving Java developers with prior experience us-

ing cryptographic APIs (Nadi et al., 2016). Their

ﬁndings reveal that a signiﬁcant majority of partici-

pants encountered difﬁculties in utilizing the respec-

tive APIs. Several other tools (Rahaman et al., 2019;

uger et al., 2017; Zhang et al., 2019; Kafader and

Ghafari, 2021) have been developed to detect cryp-

tographic misuses, however, the robustness of these

tools have been still in question (Zhang et al., 2023;

Ami et al., 2022). In addition to it, novice developers

have started adopting AI-assisted tools to code their

programming problems. Recent advancements en-

compass Github’s Copilot (GitHub AI Pair Program-

mer, nd), DeepMind’s AlphaCode (Li et al., 2022),

Masood, Z., Martin and M. V.

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection.

DOI: 10.5220/0013524100003979

In Proceedings of the 22nd International Conference on Security and Cryptography (SECRYPT 2025), pages 179-194

ISBN: 978-989-758-760-3; ISSN: 2184-7711

179

Amazon’s Q Developer (Amazon Q Developer, nd),

Tabnine (Tabnine, nd), Google’s Gemini (Gem-

ini, nd), Meta’s Llama (Llama, nd), Anthropic’s

Claude (Claude, nd), Qwen’s QwenLM (QwenLM,

nd) and Open AI’s ChatGPT (ChatGPT, nd)—nine

systems capable of translating a problem descrip-

tion into code. Prior results suggest that Open AI’s

ChatGPT reliably produces Java programming solu-

tions known for their elevated readability and well-

organized structure (Ouh et al., 2023).

Prior research has primarily tested Large Lan-

guage Models (LLMs) using limited benchmark

datasets, offering an initial understanding of their per-

formance. However, a comprehensive study evalu-

ating LLM effectiveness across a broader range of

benchmarks is still lacking, and no prior work has

closely examined the quality of LLM responses, par-

ticularly their accuracy and actionability for practi-

cal use. This study addresses these gaps by ana-

lyzing LLM performance across diverse datasets and

introducing a framework to assess response qual-

ity. We assess LLM efﬁcacy in identifying crypto-

graphic misuses and compare it with state-of-the-art

(SOTA) static cryptographic analysis tools from lit-

erature, such as CogniCrypt (Kr

uger et al., 2017),

CryptoGuard (Rahaman et al., 2019), and an industry-

based tool, Snyk Code (Snyk, nd). To evaluate LLM

accuracy, we used well-known benchmark datasets

for cryptographic misuse detection, namely OWASP

Bench (OWASP Benchmark, 2016) and CryptoAPI

Bench (Afrose et al., 2019). Additionally, we tested

LLM robustness in detecting mutated test cases,

where many static tools often struggle (Ami et al.,

2022), and evaluated the quality of LLM responses

using two metrics, Actionability and Speciﬁcity, to

assess whether responses can help developers identify

and ﬁx misuses.

Our research aims to address the following

research questions (RQs):

RQ1. How effective are LLMs in detecting crypto-

graphic misuses compared to other static tools?

The static tools tested focus on slightly different

pattern sets, leading to varied trade-offs in precision

and recall, both among the tools themselves and

in comparison to LLMs. To evaluate the effective-

ness of LLMs in detecting cryptographic misuses

compared to static tools, we ran test cases from the

CryptoAPI and OWASP benchmarks. Based on the

ﬁndings, GPT had a better detection rate across both

benchmarks. For CryptoAPI, GPT missed only 3 true

instances, with CryptoGuard lagging behind at 24

misses. In the OWASP benchmark, GPT identiﬁed

all true instances, while CryptoGuard missed 40 true

misuse cases. However, for the OWASP benchmark,

GPT had a high false positive rate, indicating that no

tool is universally superior across both benchmarks.

RQ2. How robust are LLMs in detecting mutated

test cases that other static tools fail to detect?

To evaluate the robustness of LLMs, we ran test

cases from a manually curated MASC dataset (Ami

et al., 2022) against LLMs to see if LLMs can detect

these mutations of cryptographic code effectively.

Our results suggest that GPT performed better than

other LLMs in detecting cryptographic mutations.

RQ3. How do prevalent LLMs compare in detect-

ing cryptographic misuse and providing actionable,

speciﬁc guidance for developers?

To address this, we compared the performance of

LLMs across both benchmarks and a mutated dataset

to evaluate which LLM performs best. Additionally,

we introduced a method to assess whether an LLM’s

response can assist developers in ﬁxing misuses.

Since LLMs often generate text-heavy responses that

may not effectively help developers, we implemented

a keyword-based approach to analyze LLM outputs.

This approach identiﬁes which LLM provides more

actionable and speciﬁc guidance for developers to

address misuse instances.

Our main contributions from this work are as fol-

lows:

• We are the ﬁrst to conduct a comprehensive eval-

uation of the effectiveness of LLMs in detect-

ing cryptographic misuses, comparing their per-

formance to that of SOTA static analysis tools. It

highlights the strengths and weaknesses of LLMs

compared to static tools, offering insights into

their reliability and applicability for secure soft-

ware development.

• We assess the robustness of LLMs by evaluating

their ability to detect cryptographic misuse in mu-

tated test cases that static tools often miss. This

contribution explores whether LLMs can adapt to

variations in misuse patterns, providing an under-

standing of their resilience and potential superior-

ity in handling unconventional or complex misuse

scenarios.

• To systematically assess the practicality of LLM-

generated guidance, we develop a keyword-based

framework that scans LLM responses for action-

able elements. This framework serves as a tool to

measure the usability of LLM outputs for devel-

opers, supporting the identiﬁcation of models that

best fulﬁll developers’ needs in addressing cryp-

tographic misuse. To the best of our knowledge,

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

180

this is the ﬁrst work that explores the actionability

and speciﬁcity of LLM responses.

In the next section, we cover the motivation for

our work. Section 3 discusses background and related

work. Section 4 explains the methodology, while Sec-

tion 5 presents the results and ﬁndings. Section 6 cov-

ers the discussion, followed by Section 7 on ecolog-

ical validity. Section 8 mentions the limitations and

future directions, and ﬁnally we conclude our paper

in Section 9.

2 MOTIVATION

To proactively identify vulnerabilities before soft-

ware release, individuals without specialized exper-

tise—such as software developers or quality assur-

ance teams—are likely to integrate cryptographic

API misuse detectors, commonly known as crypto-

detectors, into the Continuous Integration/Continuous

Delivery (CI/CD) pipeline. Examples of these de-

tectors include CryptoGuard (Rahaman et al., 2019)

and CogniCrypt (Kr

uger et al., 2017). Consequently,

the incapability of a crypto detector to identify an in-

stance of misuse that it claims to detect has a direct

impact on the security of end-user software. We illus-

trate this challenge with an example.

Imagine Johnny, a Java developer utilizing Cryp-

toGuard [8], an advanced crypto detector, to pinpoint

cryptographic vulnerabilities in his software before

its release. In one of his applications, Johnny opts to

employ the DES cipher, outlined as follows:

Cipher cipher = Cipher . getInstance (“des”);

This signiﬁes an instance of misuse involving the

use of the vulnerable DES cipher. CryptoGuard falls

short in recognizing this vulnerability because Johnny

utilizes “des” instead of “DES” as the parameter (Ami

et al., 2022). However, this presents a concern be-

cause the difference in lower case for the parameter

lacks functional signiﬁcance, as Java ofﬁcially sup-

ports both upper case and lower case. Due to Crypto-

Guard’s oversight of this vulnerability, Johnny might

erroneously believe that his application is secure and

may proceed with releasing it to end-users.

The same example was tested with ChatGPT

by providing a starting prompt “Can you detect

cryptographic API misuse in the given Java code?”.

ChatGPT generated the following response:

“The line of code you provided initializes a

Cipher object in Java using the DES (Data Encryp-

tion Standard) algorithm. However, it’s important

to note that DES is considered insecure for many

applications due to its small key size (56 bits) and

vulnerability to brute-force attacks. It is generally

recommended to use more secure algorithms like AES

(Advanced Encryption Standard) instead.”

In other words, these static tools for cryptographic

misuse detection could have fundamental ﬂaws pre-

venting them from detecting even straightforward in-

stances of crypto-API misuse found in applications.

On the other hand, ChatGPT’s accurate response to

such test cases raises a curiosity to test whether LLMs

can effectively detect cryptographic misuses better

than these static analysis tools. To extend it, novice

developers intrigued by the power of LLMs capabil-

ities, have started adapting to the technology by us-

ing such platforms to look for coding solutions. With

the increasing use of AI to look for code solutions

by novice developers, little research is done in this

area to explore whether prominent platforms, such

as ChatGPT, are effective in detecting these crypto-

graphic misuses. It will also likely conclude whether

the code provided by these LLMs for cryptographic

problems is secure or not. This insight guides our

approach to systematically compare widely known

crypto detectors with LLMs.

Although early analysis showed promising results

for Open AI’s ChatGPT, an intensive study was con-

ducted to conclude the effectiveness of LLMs in

detecting cryptographic misuses. LLMs can gen-

erate and understand code in various programming

languages, however, static tools and benchmarking

datasets used for comparison in our study rely only

on Java. Therefore, the scope of our work is limited

to Java as a programming language.

3 BACKGROUND

Recently, security researchers have expressed signif-

icant interest in externally validating static analysis

tools (Zhang et al., 2023; Ami et al., 2022). Speciﬁ-

cally, there’s a growing recognition that while static

analysis security tools are theoretically sound, they

can be “soundy” in practice. This means they consist

of a core set of sound decisions but also include cer-

tain strategically unsound choices made for practical

reasons, such as performance or precision consider-

ations (CryptoGuardOSS, 2020). In this section, we

provide a brief overview of the static analysis tools

that will be used for comparison with LLMs. More-

over, we also discuss similar work done in this domain

so far.

CryptoGuard (Rahaman et al., 2019) expands on

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection

181

Table 1: Overview of Existing Work on LLM-Based Detection.

Name Benchmarks LLM-based Detection Evaluating Quality of LLM Response

CryptoAPI MASC OWASP

Firouzi et al. (Firouzi et al., 2024) ✓ × × ✓ ×

Xia et al. (Xia et al., 2024) ✓ ✓ × ✓ ×

Our Work ✓ ✓ ✓ ✓ ✓

Soot (Soot, 2020), a widely used program anal-

ysis framework for statically analyzing Java byte-

code (Vall

ee-Rai et al., 1999). It targets vulnerabili-

ties associated with CWE-327, CWE-295, CWE-330,

CWE-326, CWE-798, and CWE-757, using context-

and ﬁeld-sensitive backward and forward slicing for

precise detection. However, since eight of its rules

involve constant value usage, traditional slicing tech-

niques may falsely ﬂag constants unrelated to secu-

rity. To mitigate this, CryptoGuard integrates re-

ﬁnement algorithms that reduce false positives using

cryptographic domain knowledge.

CogniCrypt (Kr

uger et al., 2017) employs rules

speciﬁed in a domain-speciﬁc language (DSL) called

CrySL (Kr

uger et al., 2021) to identify API mis-

uses. CogniCrypt translates CrySL rules into context-

sensitive, ﬂow-sensitive, and demand-driven static

analysis, enabling users to enhance the tool’s capa-

bilities by creating new rules. In case of identiﬁed

API misuse, CogniCrypt provides guidance on its res-

olution, such as substituting an insecure parameter

value with a secure one. Its pattern set pertains to

CWE-327, CWE-295, CWE-330, CWE-326, CWE-

798, and CWE-757.

Snyk Code (Snyk, nd) is a static application secu-

rity testing (SAST) tool that helps developers identify

and ﬁx code vulnerabilities. It integrates with devel-

opment environments, continuously scanning code-

bases and providing actionable remediation insights.

Notable for its developer-friendly approach, Snyk

Code supports popular IDEs, CI/CD pipelines, and

repositories. It uses machine learning and semantic

analysis for accurate and relevant issue detection and

supports multiple programming languages and frame-

works, ensuring versatility.

Static analysis tools often suffer from a high

rate of false positives, ﬂagging issues that do not

pose a threat and creating a disconnect between re-

ported misuse alerts and real vulnerabilities. Chen et

al. (Chen et al., 2024) examine these limitations by

analyzing the rules, models, and implementations of

such tools, highlighting the need for improvements in

their precision and usability. In contrast, LLMs are

becoming popular for code analysis. Fang et al. (Fang

et al., 2024) found that advanced LLMs like ChatGPT

3.5 and 4.0 show promise in accurate code review.

This indicates that LLMs could reduce false positives

common in static tools, offering a more reliable alter-

native for vulnerability detection. Liu et al.(Liu et al.,

2024) highlight ChatGPT’s potential in vulnerability

management, demonstrating its effectiveness in tasks

like generating software bug report titles. As LLMs

gain popularity, researchers are exploring their use in

detecting cryptographic misuses(Firouzi et al., 2024;

Xia et al., 2024). However, no prior research has in-

vestigated how LLMs’ text-based responses help de-

velopers ﬁx misuse instances.

Despite advances in static analysis tools for de-

tecting cryptographic misuse, gaps remain in their

coverage and adaptability to varied misuse patterns.

Existing tools struggle with mutated misuse instances,

reducing their effectiveness in real-world scenarios.

Prior research (Masood and Martin, 2024) has ex-

plored LLMs for detecting cryptographic misuses but

has been limited in model comparison. This study

extends previous work by including a broader range

of LLMs for a more thorough evaluation against

static tools and among LLMs. It introduces ROC-

annotated plots for clearer performance comparisons.

Our study evaluates static tools and LLMs using

OWASP and CryptoAPI benchmarks, testing LLMs’

robustness with mutated misuse cases. We also in-

troduce a keyword-based approach to assess the ac-

tionability and speciﬁcity of LLM responses in aid-

ing code correction. By addressing these aspects, our

work explores the potential of replacing static tools

with LLMs to help developers write secure code. Ta-

ble 1 highlights the unique contributions of our work

compared to prior research.

4 METHODOLOGY

This section describes our methodology for evaluat-

ing LLMs in cryptographic misuse detection. We as-

sessed LLMs against various datasets and further ana-

lyzed their responses to determine their effectiveness

in detecting misuse instances.

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

182

4.1 Selection Criteria

4.1.1 Static Tools

A range of tools exist for detecting cryptographic vul-

nerabilities in Java code, with most prior research fo-

cusing on this language. We selected two academic

tools, CryptoGuard (Rahaman et al., 2019) and Cog-

niCrypt SAST (Kr

uger et al., 2017), and one industry

tool, Snyk Code (Snyk, nd). We prioritized tools that

accept JAR ﬁles and are generic, excluding Android-

speciﬁc tools. We also favored widely recognized

tools to enhance the credibility of our comparison

with LLM models. CryptoGuard and CogniCrypt

were set up on an Ubuntu VM, with CryptoGuard re-

quiring Java 8 and CogniCrypt requiring Java 11 or

higher. Snyk Code was installed as a Visual Studio

Code extension. Table 2 lists the static analysis tools

and their versions. Results from all tools were ex-

tracted as JSON ﬁles and analyzed against benchmark

labels.

Table 2: Version of Static Analysis Tools.

Tool Version

CryptoGuard 04.05.03

CogniCrypt 3.0.2 and 4.0.1

Snyk Code 2.18.2

4.1.2 Benchmark Datasets

To evaluate the tools’ effectiveness, we conducted ex-

tensive online searches for open-source benchmarks

that classify programs based on correct or incor-

rect use of cryptographic APIs. We selected three

widely used datasets: CryptoAPI Bench (Afrose

et al., 2019), OWASP Benchmark (OWASP, n.d.), and

MASC dataset (Ami et al., 2022). These datasets pro-

vide test cases for assessing cryptographic misuses in

Java programs.

• The CryptoAPI Bench (Afrose et al., 2019) con-

sists of 182 test cases, with 181 labeled and one

unlabeled. Of the labeled cases, 144 are true mis-

uses, and 37 are false misuses. The unlabeled case

was excluded from testing.

• The OWASP Benchmark (OWASP, n.d.) is a Java

test suite evaluating vulnerability detection meth-

ods, incorporating vulnerabilities from the Com-

mon Weakness Enumeration (CWE - Common

Weakness Enumeration, 2021). Version 1.2 con-

tains 2,740 Java programs, but the study focuses

on 975 programs categorized into weak cryptog-

raphy, weak hashing, and weak randomness, with

477 showing misuse and 498 showing proper use.

• The MASC dataset, designed by Ami et al. (Ami

et al., 2022), applies mutation testing to identify

vulnerabilities in crypto detectors, assessing their

ability to withstand code modiﬁcations. We man-

ually selected 30 test cases from MASC’s mini-

mal test suites, targeting ﬁve prevalent ﬂaw types

in modern crypto detectors.

The test case outline for the benchmarking

datasets is shown in Table 3. We chose existing

benchmark datasets for two reasons: they are pub-

licly accessible, curated by diverse stakeholders, en-

suring a comprehensive and replicable empirical com-

parison, and benchmarks like OWASP Benchmark are

widely recognized and inﬂuential in the industry. Us-

ing these datasets with static analysis tools and LLMs

helped answer RQ1, while testing LLMs with muta-

tions addressed RQ2.

Table 3: Labels of Benchmarking Datasets.

Benchmarking Dataset True Labels False Labels Total

CryptoAPI Bench 144 37 181

MASC 29 1 30

OWASP 477 498 975

4.1.3 Large Language Models

The popular LLM models selected for this study

include OpenAI’s GPT, Google’s Gemini, Meta’s

Llama, and Anthropic’s Claude, chosen for their

advanced capabilities in converting text to code

and identifying vulnerabilities through code analysis.

Each misuse instance was provided with a standard-

ized prompt (details in Appendix), which included in-

formation like the method name, starting line number,

highlighted message, message description (reason for

misuse), lines of code with the misuse, and a label

indicating cryptographic misuse. The LLM responses

followed a similar pattern, providing the misuse label,

method name, starting line number, code line, high-

lighted message, cause explanation, and best prac-

tices. The LLM models and their versions are listed

in Table 4.

Table 4: Version of Large Language Models.

LLM Version Release Date

Meta’s Llama Llama 3.0 April 18, 2024

Open AI’s GPT gpt-4o-mini-2024-07-18 July 18, 2024

Google’s Gemini gemini-1.5-ﬂash-002 September 24, 2024

Anthropic’s Claude Claude 3.5 Haiku October 22, 2024

4.2 Evaluation Metrics

Accuracy of each static tool and LLMs will be

measured by these 8 metrics: True Positive, False

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection

183

Positive, False Negative, True Negative, Precision,

Recall, F1-score, and F2-score.

While the F1-score is an effective measure for

evaluating a tool’s performance, it should not be the

only criterion for tool selection. In a security context,

placing more emphasis on recall with F2-score will be

more reliable than relying solely on the F1-score. For

this reason, our study uses the F2-score alongside the

F1-score.

4.2.1 LLM Speciﬁc Metrics

Since Large Language Models generate text, tradi-

tional performance metrics such as precision, recall,

and F-score are insufﬁcient for fully assessing their

capabilities. Therefore, in addition to these metrics,

we employed two measures inﬂuenced by Redmiles

et al. (Redmiles et al., 2020) to evaluate security and

privacy advice: Actionability and Speciﬁcity.

To evaluate these metrics, we utilized speciﬁc key-

words proposed by Rostami et al. (Rostami and Karls-

son, 2024) to deduce actionability in information se-

curity policies. We utilized their proposed list of key-

words to evaluate whether the response from LLM is

actionable. The keywords are listed in Table 5. For

instance, if the text contains words such as “should”,

“need”, or “must”, we categorize it as “Actionable”.

Furthermore, a response is considered higher “Speci-

ﬁcity” if it includes a link to a resource (e.g., a URL)

or CWE.

Table 5: Actionable Keywords.

Keywords

never

need

should

not

forbidden

must

shall

For example, one of the mutated test cases re-

places AES with DES by replacing “A” with “D”. The

misuse line is as follows:

Cipher c = Cipher. getInstance (‘‘ AES’’.

replace (‘‘ A’’, ‘‘ D’’)) ;

For this particular case, GPT gave the misuse de-

scription as:

“The usage of “AES”.replace(“A”, “D”)‘ is a signif-

icant misuse. This expression incorrectly modiﬁes the

string that is supposed to deﬁne the algorithm name,

changing it from ”AES” to ”DES”. Since ”DES” is

a less secure algorithm, this can lead to vulnerabili-

ties in the encryption implementation realized by this

code. This does not represent a proper way to specify

or select cryptographic algorithms, as it can easily be

manipulated or misunderstood.”

The presence of an actionable keyword “not” in

GPT’s response provides clear feedback to developers

to ﬁx this misuse instance.

In contrast, the absence of actionable keywords in

an LLM response may limit developers’ ability to un-

derstand and address the misuse effectively. For in-

stance, GPT provided the following response in an-

other case where AES was replaced by DES:

“The issue arises because the program allows for the

selection of weak ciphers (like DES) through method

chaining without any safeguards or validation mech-

anisms to ensure that only strong ciphers are used.

This directly impacts the security posture of the ap-

plication by making it susceptible to various crypto-

graphic attacks, thereby constituting a cryptographic

misuse.”

Here, the lack of actionable keywords may limit

the guidance that novice developers need, reducing

the response’s effectiveness in directing them toward

an appropriate ﬁx and failing to emphasize the poten-

tial consequences of the misuse.

4.3 Experiments

Figure 1 shows our evaluation methodology for LLM-

based cryptographic misuse detection. The method-

ology is divided into three sections: LLM vs Static

Tools, Comparing LLMs, and Evaluating LLM re-

sponses.

Source Code

LLM

Static Analysis

Tools

Reports

Generated

Report Analysis

Prompt

CLI

Figure 1: Evaluation Methodology for Comparing Static

Tools with LLMs.

4.3.1 LLM vs Static Tools

We began our experiments by testing LLMs and static

tools with the CryptoAPI and OWASP benchmarks to

evaluate their effectiveness in detecting cryptographic

misuses, using the evaluation metrics outlined in the

metrics section. This helped us understand how well

each approach identiﬁes cryptographic misuses and

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

184

allowed us to compare the strengths of LLMs with

those of static tools. This contributed to our ﬁndings

on RQ1.

4.3.2 Comparing LLMs

In the second phase of our experiments, we tested

LLMs using mutated test cases from the MASC

dataset, along with CryptoAPI and OWASP bench-

marks, to evaluate their effectiveness in detecting

cryptographic misuses. This enabled us to assess

the robustness of these LLMs in identifying crypto-

graphic misuse across varied and altered scenarios,

concluding our ﬁndings for RQ2.

4.3.3 Evaluating LLM Responses

Additionally, we analyzed the responses of LLMs us-

ing the Actionability and Speciﬁcity metrics. If a re-

sponse from an LLM contained a keyword from Ta-

ble 5, it was considered actionable. The number of

actionable responses in a particular dataset represents

the total actionability of the LLM for that dataset. To

determine the Speciﬁcity of an LLM response, we

manually reviewed each response to check if the LLM

provided a link, referenced a CWE, or included rec-

ommended code that could assist developers in ﬁxing

the misuse. If the response from an LLM contained

a link, CWE reference, or ﬁxed code, it was consid-

ered speciﬁc. The number of test cases with speciﬁc

responses in a particular dataset indicates the speci-

ﬁcity of the LLM and is compared with other LLMs

to assess speciﬁcity.

This comprehensive approach offered a clearer

understanding of the responses provided by LLMs,

assessing whether each response qualiﬁed as action-

able advice for developers. Additionally, it revealed

whether any links provided by the LLMs effectively

guide developers to relevant resources for addressing

speciﬁc cryptographic misuses, thus contributing in-

sights for RQ3.

5 RESULTS AND FINDINGS

In this section, we present the results of our work

on detecting cryptographic misuses for CryptoAPI

and OWASP benchmarks, gathered using the selected

static tools and GPT. Furthermore, we extended the

results by comparing LLMs on the MASC dataset to

determine which LLM performed better. We also an-

alyzed the responses from LLMs to gain insights into

which one offered more actionable and speciﬁc ad-

vice.

5.1 LLMs vs Static Tools

5.1.1 CryptoAPI Benchmark

For CryptoAPI benchmark, GPT 4-o-mini achieved

the highest F1-score of 87.6%, F2-score of 93.5%,

and a recall of 97.9%, accurately detecting 141 out

of 144 misuse cases. CryptoGuard followed with an

F1-score of 84.8%, F2-score of 83.9%, and a recall of

83.3%, missing 24 misuse cases. Detailed metrics are

shown in Table 6.

Figure 2: Tools Comparison on CryptoAPI Benchmark.

GPT 4-o-mini’s high accuracy suggests its effec-

tiveness as an alternative to traditional static analy-

sis tools for cryptographic misuse detection. Unlike

static tools that depend on ﬁxed rules, GPT 4-o-mini

utilizes advanced pattern recognition and contextual

adaptability, enabling it to handle complex misuse

cases with high precision and recall, although it shows

a slightly higher rate of false positives.

Figure 3: Detection Results for CryptoAPI Benchmark

Across Different Tools.

This adaptability allows GPT 4-o-mini to outper-

form static tools like CryptoGuard and CogniCrypt,

which are constrained by predeﬁned rules and may

miss speciﬁc misuse types, such as Java’s Secret Key

Factory or Hostname Veriﬁer API cases. Key metrics

are illustrated in Figure 2 and Figure 3.

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection

185

Table 6: Evaluation Metrics for CryptoAPI and OWASP Benchmark.

Benchmark Tool True Positive False Positive False Negative True Negative Precision (%) Recall (%) F1 Score (%) F2 Score (%)

CryptoAPI

GPT 4-o-mini 141 37 3 0 79.2 97.9 87.6 93.5

CryptoGuard 120 19 24 18 86.3 83.3 84.8 83.9

CogniCrypt 4.0.1 113 28 31 9 80.1 78.5 79.3 78.8

CogniCrypt 3.0.2 110 31 34 6 78.0 76.4 77.2 76.7

Snyk Code 93 30 51 7 75.6 64.6 69.7 66.5

OWASP

CryptoGuard 437 27 40 471 94.2 91.6 92.9 92.1

GPT 4-o-mini 477 498 0 0 48.9 100.0 65.7 82.7

Snyk Code 477 498 0 0 48.9 100.0 65.7 82.7

CogniCrypt 4.0.1 259 200 218 298 56.4 54.3 55.3 54.7

CogniCrypt 3.0.2 259 475 218 23 35.3 54.3 42.8 49.0

Finding 1 RQ1: GPT 4-o-mini achieved the high-

est F-score (F1-score: 87.6%, F2-score: 93.5%)

among tools for CryptoAPI benchmark, showing

strong cryptographic misuse detection despite not

being specialized, followed closely by Crypto-

Guard. Snyk Code and CogniCrypt had higher er-

ror rates, making them less reliable overall.

5.1.2 OWASP Benchmark

For OWASP benchmark, CryptoGuard achieved the

highest F1-score of 92.9%, F2-score of 92.1%, with a

precision of 94.2% and a recall of 91.6%, effectively

detecting true misuse cases while minimizing false

positives. Snyk Code and GPT 4-o-mini followed

with an F1-score of 65.7% and F2-score of 82.7%,

where the higher F2-score reﬂects stronger emphasis

on recall, highlighting GPT 4-o-mini’s ability to catch

more misuse cases despite lower precision. Results

are summarized in Table 6.

Figure 4: Tools Comparison on OWASP Benchmark.

GPT 4-o-mini’s high false positive rate stems from

its text-based responses, which frequently suggest

best practices even when no misuse is present. This

Figure 5: Detection Results for OWASP Benchmark Across

Different Tools.

led to labeling cases as “TRUE” due to best practice

recommendations, even without actual misuse, reduc-

ing its precision in identifying true misuse cases. Ad-

ditionally, the OWASP test cases had a higher average

line count compared to CryptoAPI test cases, increas-

ing the amount of code GPT had to analyze. With

longer code samples, GPT often provided general rec-

ommendations rather than focusing solely on detect-

ing cryptographic misuses, leading to more false posi-

tives. However, GPT performed more accurately with

smaller code samples than with longer ones. Fig-

ure 4 and Figure 5 illustrate the key metrics for tools

tested on the OWASP Benchmark, showcasing Cryp-

toGuard’s superior performance compared to other

tools.

Finding 2 RQ1: CryptoGuard outperforms other

tools on the OWASP benchmark with a 92.1% F2-

score, while GPT and Snyk Code lag behind at

82.7%, demonstrating CryptoGuard’s higher accu-

racy in detecting cryptographic misuses.

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

186

Figure 6: TPR vs FPR for Static Tools and GPT.

5.1.3 ROC Comparison

Receiver Operating Characteristic (ROC) curves help

visualize and compare the trade-off between true and

false positives across different detection thresholds.

It provides an intuitive way to assess how well each

model distinguishes between correct and incorrect

cryptographic API usage. The comparison of static

tools with GPT across the CryptoAPI and OWASP

benchmarks revealed varied performance. Crypto-

Guard effectively balanced true positive detection and

false positive reduction, outperforming other tools, as

shown in Figure 6. GPT 4-o-mini had high recall

but also a high false positive rate, indicating chal-

lenges with precision in LLM-based vulnerability de-

tection. CogniCrypt and Snyk Code performed simi-

larly to random guessing on CryptoAPI. CogniCrypt

struggled with diverse misuse instances, leading to

higher false positives. The CryptoAPI results were

affected by an imbalance in misuse instances. Reduc-

ing false positives is crucial for practical use, and fu-

ture research should include probability mechanisms

for more accurate comparisons.

Finding 3 RQ1: CryptoGuard demonstrates a

good balance between detecting true positives and

minimizing false positives, outperforming other

tools and GPT in some cases.

5.1.4 Time Comparison

The benchmarks were run on tools conﬁgured in an

Ubuntu VM hosted on a laptop with the following

specs: Windows 11 OS, i7-1255U CPU (1.70 GHz),

16 GB RAM. The VMs used for the tools had 10 GB

RAM, an 8 GB JVM heap, and 4 processors.

Test cases were executed on GPT via an API, and

response times for each benchmark were recorded.

Static analysis tools, including CryptoGuard, Cog-

niCrypt, and Snyk Code, were run using CLI com-

mands and Visual Studio Code extension. The time

taken by each tool for both benchmarks is shown in

Table 7.

Table 7: Execution Time of Tools for CryptoAPI and

OWASP Benchmark.

Tools CryptoAPI OWASP

CryptoGuard 1m 48s 10m 59s

CogniCrypt 3.0.2 54s 16m 35s

CogniCrypt 4.0.1 1m 7s 23m 26s

Snyk Code 8s 52s

GPT 4-o-mini 8m 45s 71m

Snyk Code was the fastest tool across both bench-

marks, with execution times of 8 seconds for Cryp-

toAPI and 52 seconds for OWASP. CryptoGuard per-

formed well on the CryptoAPI benchmark (1m 48s)

but took longer on the OWASP benchmark (10m 59s).

GPT 4-o-mini had the longest execution times, re-

quiring 8m 45s for CryptoAPI and 71 minutes for

OWASP, highlighting a trade-off between detection

accuracy and slower processing speed, which could

limit scalability for large datasets.

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection

187

Finding 4 RQ1: Snyk Code is the fastest tool for

both benchmarks, followed by CryptoGuard. GPT

4-o-mini, while strong in detection, has slower

processing speeds, limiting its scalability for larger

datasets.

5.2 Comparing LLMs with Datasets

The performance of selected LLMs was evaluated on

benchmark datasets, including mutated test cases, to

identify the best model. GPT 4-o-mini was accessed

via API, while Gemini 1.5 Flash, Llama 3.0, and

Claude Haiku 3.5 were accessed via chat interfaces.

The models were tested on 30 random test cases from

the CryptoAPI and OWASP benchmarks, as well as

30 manually curated MASC test cases.

For the CryptoAPI benchmark, Claude achieved

the highest F2-score of 94.3% and recall of 100%, fol-

lowed by GPT (97.9% recall, 93.5% F2-score), and

Gemini with the lowest F2-score of 85.5%. For the

OWASP benchmark, GPT led with an F2-score of

82.7%, followed by Gemini at 81.1%, while Llama

and Claude had the lowest F2-scores, both at 76.9%.

All LLMs demonstrated perfect recall (100%), with

Claude showing better performance for the Cryp-

toAPI benchmark and GPT performing better for the

OWASP benchmark, as detailed in Table 8.

On the mutation dataset (MASC), GPT and

Claude both achieved 100% recall and an F2-score

of 99.3%, detecting all 29 true misuse cases. Gemini,

however, had the lowest performance with a recall of

79.3% and an F2-score of 82.7%, identifying 23 out

of 29 misuse cases. These results emphasize the ef-

fectiveness of LLMs in scenarios where static tools

face challenges.

Finding 1 RQ2: Claude and GPT achieved com-

parable F2-scores of 99.3% on the MASC bench-

mark, performing better than other LLMs and in-

dicating similar effectiveness in detecting cryp-

tographic misuses. GPT outperformed all mod-

els on the OWASP benchmark with an F2-score

of 82.7%, while Claude delivered the best results

on the CryptoAPI benchmark with an F2-score of

94.3%.

5.2.1 ROC Comparison

Among LLMs, Gemini 1.5 Flash performs better than

the others, correctly categorizing false instances as

Gemini, Llama, and Claude were evaluated against

random 30 test cases for CryptoAPI and OWASP datasets.

false and falling to the left of the random guessing

line (Figure 7 to Figure 9).

Figure 7: TPR vs FPR of LLMs for CryptoAPI.

Figure 8: TPR vs FPR of LLMs for OWASP.

In contrast, GPT-4-o-mini, Llama 3.0, and Claude

3.5 Haiku are closer to the random guessing line,

correctly identifying true instances but misclassify-

ing false ones. These results are inﬂuenced by class

imbalance in MASC, where true instances dominate,

and only a single false instance is present. Addition-

ally, the results are affected by the fact that, apart from

GPT, all other LLMs were tested against only a subset

of misuse instances from the CryptoAPI and OWASP

benchmarks. While LLMs show potential, reducing

false positives and addressing class imbalance remain

crucial for ensuring a fair comparison.

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

188

Table 8: Comparison of LLMs on Different Benchmarks.

Benchmark LLM True Positive False Positive False Negative True Negative Precision (%) Recall (%) F1 Score (%) F2 Score (%)

CryptoAPI

GPT 4-o-mini 141 37 3 0 79.2 97.9 87.6 93.5

Llama 3.0 22 7 1 0 75.9 95.7 84.6 91.0

Claude 3.5 Haiku 23 7 0 0 76.7 100 86.8 94.3

Gemini Flash 1.5 20 5 3 2 80.0 87 83.3 85.5

MASC

GPT 4-o-mini 29 1 0 0 96.7 100 98.3 99.3

Llama 3.0 28 1 1 0 96.6 96.6 96.6 96.6

Claude 3.5 Haiku 29 1 0 0 96.7 100 98.3 99.3

Gemini Flash 1.5 23 0 6 1 100 79.3 88.5 82.7

OWASP

GPT 4-o-mini 477 498 0 0 48.9 100 65.7 82.7

Llama 3.0 12 18 0 0 40 100 57.1 76.9

Claude 3.5 Haiku 12 18 0 0 40 100 57.1 76.9

Gemini Flash 1.5 12 14 0 4 46.2 100 63.2 81.1

Figure 9: TPR vs FPR of LLMs for MASC.

Finding 2 RQ2: Gemini strikes a good balance

between detecting true positives and minimizing

false positives, although GPT and other LLMs out-

perform it with higher recall and F-scores.

5.3 Evaluating LLM Responses

Actionability and Speciﬁcity were used to evaluate

LLM responses. Actionability was determined by

checking for actionable keywords, while Speciﬁcity

was assessed by the inclusion of CWE references, ex-

ternal resources, or ﬁxed code. The percentage of

actionable and speciﬁc responses was calculated for

each benchmark to compare LLMs. The results for

actionability and speciﬁcity across benchmarks are

presented in Table 9.

GPT 4-o-mini provided the most actionable re-

sponses on MASC (63.3%) and CryptoAPI (61.4%)

but had 0% speciﬁcity across all benchmarks, indicat-

ing a lack of detailed guidance. Llama 3.0 achieved

the highest actionability on OWASP (90.0%) but also

showed no speciﬁcity. In contrast, Claude 3.5 Haiku

excelled in speciﬁcity, scoring 43.3% on OWASP

and 46.7% on MASC, while maintaining high action-

ability on OWASP (86.7%) and moderate levels on

CryptoAPI (53.3%). Overall, GPT 4-o-mini demon-

strated strong actionability but lacked detailed guid-

ance. Llama 3.0 performed best in actionability for

OWASP, while Claude 3.5 Haiku balanced action-

ability with the highest speciﬁcity by providing CWE

references, external links, or ﬁxable code for crypto-

graphic misuses.

Figure 10: GPT Actionable Keywords Occurence.

Figures 10 to 13 shows the occurrence of action-

able keywords for the selected LLMs. The most fre-

quently used keywords across benchmarks are “not”

and “should”. In both the OWASP and CryptoAPI

benchmarks, “should” is common, with 996 instances

in GPT for OWASP, and a lower but consistent pres-

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection

189

Table 9: Actionability and Speciﬁcity of LLMs Across Benchmark Datasets.

Benchmark LLM Actionable Responses Speciﬁc Responses Total Responses Actionability (%) Speciﬁcity (%)

CryptoAPI

GPT 4-o-mini 162 0 264 61.4 0.0

Llama 3.0 14 2 30 46.7 6.7

Claude Haiku 3.5 16 2 30 53.3 6.7

Gemini Flash 1.5 9 2 30 30.0 6.7

MASC

GPT 4-o-mini 19 0 30 63.3 0.0

Llama 3.0 15 2 30 50.0 6.7

Claude Haiku 3.5 15 14 30 50.0 46.7

Gemini Flash 1.5 7 7 30 23.3 23.3

OWASP

GPT 4-o-mini 1584 0 1909 83.0 0.0

Llama 3.0 27 0 30 90.0 0.0

Claude Haiku 3.5 26 13 30 86.7 43.3

Gemini Flash 1.5 21 6 30 70.0 20.0

Figure 11: Gemini Actionable Keywords Occurence.

Figure 12: Llama Actionable Keywords Occurence.

Figure 13: Claude Actionable Keywords Occurence.

ence across other benchmarks, providing positive ac-

tionable feedback for developers. Similarly, “not”

appears frequently, especially in OWASP, with 1404

instances in GPT, providing negative actionability

and guiding developers on cryptographic misuses.

“Forbidden” and “shall” were the least used key-

words across different benchmarks for the selected

LLMs, highlighting a focus on suggestion and nega-

tion, guiding actions to consider or avoid. Less fre-

quent keywords, such as “need”, “must”, “never”

and “shall”, suggest a preference for softer language

rather than strict requirements.

Finding RQ3: GPT 4-o-mini excels in action-

ability across multiple benchmarks, providing ac-

tionable feedback on things to consider and avoid,

but lacks speciﬁcity. In contrast, while Claude

3.5 Haiku exhibits lower actionability, it provides

greater speciﬁcity compared to other LLMs.

6 DISCUSSION

Strengths of Large Language Models. LLMs

have an inherent advantage over static analysis tools

as static tools rely on a rule-based approach. Any

deviations of misuse instance from a rule will likely

cause the tool to miss it. However, LLM detection

capability is not particular to a speciﬁc rule set, and

it can detect misuse instances in a variety of code

as highlighted by the results of mutated test cases.

Similarly, LLMs are not speciﬁc to a certain program-

ming language. It can read and understand various

programming languages as opposed to static analysis

tools that are speciﬁc to a programming language.

LLMs do not have a framework dependency and now

integrations are being supported to adopt LLMs for

various frameworks which will likely increase its

usage in software development. Even though LLMs

currently have a high false positive rate for detecting

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

190

cryptographic misuses, this is likely to improve

as they are trained on more data. Better accuracy

than static analysis tools in detecting true alerts for

security context shows LLMs can be utilized in this

domain.

Benchmark Inconsistencies. Our analysis of the

CryptoAPI benchmark revealed inconsistencies, in-

cluding missing test cases in the dataset that were

listed in the label sheet, such as CredentialInStringB-

BCase2.java and PredictableSeedsABPMCase2.java.

This mismatch can lead to false negatives and mis-

represent the tools’ effectiveness. Additionally, some

Java test cases lacked corresponding labels, causing

tools to detect misuse without reference. To ensure

consistency, we excluded these unlisted test cases,

emphasizing the need for a comprehensive, standard-

ized dataset to properly evaluate tool effectiveness

across all Java API categories.

7 ECOLOGICAL VALIDITY

7.1 Applicability to Real World

Scenarios

Our experiments utilized the OWASP Benchmark, a

well-known open-source Java suite that serves as a

standard for assessing application security. The Cryp-

toAPI benchmark was also included based on a re-

view of previous studies, where it has been exten-

sively used. Additionally, the MASC dataset was se-

lected to capture the diverse ways in which code can

be written by developers. The results obtained from

these benchmarks provide insights into the effective-

ness of different tools and LLMs in detecting vul-

nerabilities across various domains. Our focus was

speciﬁcally on cryptographic misuses, making the re-

sults highly relevant for comparing our tool with both

industry-standard and academic tools, many of which

are evaluated against these benchmarks. Since these

benchmarks are commonly used, their results are a

reliable measure of a tool’s accuracy in identifying

cryptographic misuses.

The increasing adoption of LLMs by novice de-

velopers, often in place of traditional search engines,

reﬂects a shift in how coding solutions are sourced.

Many coding environments, such as Visual Studio

Code, now integrate LLM-based assistants like Chat-

GPT, allowing developers to prompt these models di-

rectly. With LLMs offering quick, context-speciﬁc

answers, novice developers increasingly rely on these

tools not only for general coding guidance but also

for secure coding practices, including cryptographic

implementations. As more developers turn to LLMs

for solutions, they inherently depend on the models’

security accuracy to prevent cryptographic misuses in

their code. At the same time, integrating LLM-based

detection into the CI/CD pipeline ensures that cryp-

tographic issues are automatically ﬂagged before de-

ployment. This helps teams enforce secure coding

practices continuously, reduces the risk of introduc-

ing security ﬂaws into production code, and addresses

cryptographic vulnerabilities as part of the develop-

ment lifecycle with minimal manual effort from de-

velopers. Therefore, it becomes critical to evaluate

how well LLMs address security contexts, ensuring

that the solutions provided are both effective and safe.

Understanding this trend highlights the real-world im-

portance of our work, which assesses the quality and

security of LLM responses in situations where devel-

opers may not be aware of potential misuses.

7.2 Threats to Validity

Threats to Internal Validity The scope of our ﬁndings

may be limited by the speciﬁc tools and datasets cho-

sen for testing. To address this, we plan to expand

our evaluation by including additional tools and test-

ing them on examples drawn from real-world cryp-

tographic code. Another potential limitation is data

leakage, as the open-source benchmarks used for test-

ing might have been utilized in the training of LLMs,

potentially affecting our results; however, the high

number of false positives suggests otherwise. In the

future, we plan to evaluate test cases by anonymiz-

ing class names to prevent any possible data leakage.

LLMs may hallucinate non-existent vulnerabilities or

incorrect patterns, leading to false positives that re-

quire manual veriﬁcation. Additionally, the action-

ability and speciﬁcity metrics evaluated in our study

may not fully translate into practical guidance for de-

velopers to ﬁx code in every instance.

Threats to External Validity Our results may not

generalize beyond the speciﬁc tools and datasets used

in this study. To address this, we plan to expand our

evaluation by incorporating additional tools and real-

world cryptographic benchmarks. Additionally, our

ﬁndings are speciﬁc to Java and may not apply in the

same way to other programming languages. More-

over, we tested for mutations in test cases that were

speciﬁc to Java as a programming language, the accu-

racy of detecting mutant misuses might not apply to

other programming languages.

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection

191

8 FUTURE WORK

Future research could explore testing LLMs on real-

world codebase benchmarks, such as the Apache

CryptoAPI benchmark, to assess performance on

complex cryptographic scenarios. Expanding the

study to include newer versions of LLMs, particu-

larly those specialized in code generation, may pro-

vide insights into the beneﬁts of using code-focused

models over general-purpose ones. Additionally, test-

ing LLMs on benchmarks in other programming lan-

guages could reveal language-speciﬁc strengths and

limitations.

9 CONCLUSION

Cryptographic misuses in software pose sig-

niﬁcant security risks. Our study investigated

how LLMs compare with traditional static anal-

ysis tools in detecting such vulnerabilities. We

evaluated their performance across established

benchmarks—CryptoAPI, OWASP, and MASC—and

found that LLMs, particularly GPT, often achieved

higher accuracy than static tools like CryptoGuard,

especially on the CryptoAPI and MASC datasets.

However, CryptoGuard outperformed LLMs on the

OWASP dataset, highlighting that no single approach

is universally superior.

While LLMs show strong potential, their rela-

tively high false positive rate may hinder adoption

by overwhelming developers with alerts. To address

this, we introduced Actionability and Speciﬁcity as

complementary metrics, offering deeper insight be-

yond precision and recall. Results showed that GPT

offered more actionable suggestions, while Claude

excelled in speciﬁcity—underscoring trade-offs in

LLM-generated guidance. Integration of LLMs into

CI/CD pipelines or development environments could

enhance secure coding practices, but their effective-

ness will rely on reducing false positives and improv-

ing contextual relevance. As these models evolve,

they could complement existing tools and provide

real-time, intelligent support for secure software de-

velopment.

ACKNOWLEDGEMENTS

We thank the support of a Discovery Grant from the

Natural Sciences and Engineering Research Council

of Canada (Grant # RGPIN/005919-2018) and the

Vulnerability Research Centre at the Communications

Security Establishment.

REFERENCES

Afrose, S., Rahaman, S., and Yao, D. (2019). CryptoAPI-

Bench: A Comprehensive Benchmark on Java Cryp-

tographic API Misuses. In 2019 IEEE Cybersecurity

Development, pages 49–61.

Amazon Q Developer (n.d.). Amazon Q Developer. https://

aws.amazon.com/q/developer/. Retrieved August 21,

2024.

Ami, A. S., Cooper, N., et al. (2022). Why Crypto-detectors

Fail: A Systematic Evaluation of Cryptographic Mis-

use Detection Techniques. In 2022 IEEE Symposium

on S&P, pages 614–631.

ChatGPT (n.d.). ChatGPT. https://chat.openai.com/. Re-

trieved August 21, 2024.

Chatzikonstantinou, A., Ntantogian, C., et al. (2016). Eval-

uation of Cryptography Usage in Android Applica-

tions. In Proceedings of the 9th EAI International

Conference on Bio-Inspired Information and Commu-

nications Technologies, BICT’15, page 83–90, Brus-

sels, BEL. ICST.

Chen, Y., Liu, Y., et al. (2024). Towards Precise Report-

ing of Cryptographic Misuses. In Proceedings 2024

NDSS.

Claude (n.d.). Claude. https://claude.ai/. Retrieved Decem-

ber 21, 2024.

CryptoGuardOSS (2020). cryptoguardoss/cryptoguard.

https://github.com/CryptoGuardOSS/cryptoguard.

[Online; accessed August 21, 2024].

CWE - Common Weakness Enumeration (2021). CWE -

Common Weakness Enumeration. https://cwe.mitre.

org. [Online; accessed August 21, 2024].

Egele, M., Brumley, D., et al. (2013). An Empirical Study

of Cryptographic Misuse in Android Applications. In

Proceedings of the 2013 ACM SIGSAC Conference

on Computer & Communications Security, CCS ’13,

page 73–84, New York, NY, USA. Association for

Computing Machinery.

Fang, C., Miao, N., et al. (2024). Large Language Models

for Code Analysis: Do LLMs Really Do Their Job?

In 33rd USENIX Security Symposium, pages 829–846,

Philadelphia, PA. USENIX Association.

Firouzi, E., Ghafari, M., et al. (2024). ChatGPT’s Poten-

tial in Cryptography Misuse Detection: A Compar-

ative Analysis with Static Analysis Tools. In Pro-

ceedings of the 18th ACM/IEEE International Sym-

posium on Empirical Software Engineering and Mea-

surement, ESEM ’24, page 582–588, New York, NY,

USA. ACM.

Gemini (n.d.). Gemini. https://gemini.google.com/. Re-

trieved August 21, 2024.

GitHub AI Pair Programmer (n.d.). GitHub AI Pair Pro-

grammer. https://copilot.github.com. Retrieved Au-

gust 26, 2024.

Graff, M. G. and Wyk, K. R. V. (2003). Secure Coding:

Principles and Practices. O’Reilly & Associates, Inc.,

USA.

Kafader, S. and Ghafari, M. (2021). Fluentcrypto: Cryptog-

raphy in easy mode. In 2021 ICSME, pages 402–412.

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

192

uger, S., Nadi, S., et al. (2017). CogniCrypt: Support-

ing developers in using cryptography. In 2017 32nd

IEEE/ACM International Conference on ASE, pages

931–936.

uger, S., Sp

ath, J., et al. (2021). CrySL: An Extensible

Approach to Validating the Correct Usage of Crypto-

graphic APIs. IEEE Transactions on Software Engi-

neering, 47(11):2382–2400.

Lazar, D., Chen, H., et al. (2014). Why does cryptographic

software fail? A case study and open problems. In

Proceedings of 5th Asia-Paciﬁc Workshop on Systems,

APSys ’14, New York, NY, USA. ACM.

Li, Y., Choi, D., et al. (2022). Competition-

level code generation with AlphaCode. Science,

378(6624):1092–1097.

Liu, P., Liu, J., et al. (2024). Exploring ChatGPT’s Capabil-

ities on Vulnerability Management. In 33rd USENIX

Security Symposium, pages 811–828, Philadelphia,

PA. USENIX Association.

Llama (n.d.). Llama. https://www.llama.com/. Retrieved

September 21, 2024.

Masood, Z. and Martin, M. V. (2024). Beyond static tools:

Evaluating large language models for cryptographic

misuse detection.

Nadi, S., Kr

uger, S., et al. (2016). Jumping through hoops:

why do Java developers struggle with cryptography

APIs? In Proceedings of the 38th International Con-

ference on SE, ICSE ’16, page 935–946, New York,

NY, USA. Association for Computing Machinery.

Ouh, E. L., Gan, B. K. S., et al. (2023). ChatGPT, Can

You Generate Solutions for My Coding Exercises? An

Evaluation on Its Effectiveness in an Undergraduate

Java Programming Course. In Proceedings of the 2023

Conference on Innovation and Technology in Com-

puter Science Education V. 1, ITiCSE 2023, pages 54–

60, New York, NY, USA. Association for Computing

Machinery.

OWASP Benchmark (2016). OWASP Benchmark. https://

owasp.org/www-project-benchmark/. Accessed May,

2024.

QwenLM (n.d.). QwenLM. https://qwenlm.ai/. Retrieved

December 21, 2024.

Rahaman, S., Xiao, Y., et al. (2019). CryptoGuard: High

Precision Detection of Cryptographic Vulnerabilities

in Massive-sized Java Projects. In Proceedings of

the 2019 ACM SIGSAC Conference on Computer and

Communications Security, CCS ’19, page 2455–2472,

New York, NY, USA. Association for Computing Ma-

chinery.

Redmiles, E. M., Warford, N., et al. (2020). A Comprehen-

sive Quality Evaluation of Security and Privacy Ad-

vice on the Web. In 29th USENIX Security Sympo-

sium, pages 89–108. USENIX Association.

Rostami, E. and Karlsson, F. (2024). Qualitative Con-

tent Analysis of Actionable Advice in Information

Security Policies – Introducing the Keyword Loss of

Speciﬁcity Metric. Information & Computer Security,

32(4):492–508.

Snyk (n.d.). Snyk Code — Code Security Analysis and

Fixes - Developer First SAST. https://snyk.io/product/

snyk-code/. Retrieved August 20, 2024.

Soot (2020). Soot. https://github.com/soot-oss/soot. Ac-

cessed: August 26, 2024.

Tabnine (n.d.). Tabnine. https://www.tabnine.com. Re-

trieved August 21, 2024.

Vall

ee-Rai, R., Co, P., et al. (1999). Soot - a Java byte-

code optimization framework. In Proceedings of the

1999 Conference of the Centre for Advanced Studies

on Collaborative Research, CASCON ’99, page 13.

IBM Press.

Whitten, A. (2004). Making Security Usable. PhD the-

sis, Carnegie Mellon University, 5000 Forbes Avenue,

Pittsburgh, PA, USA.

Xia, Y., Xie, Z., et al. (2024). Exploring Automatic Crypto-

graphic API Misuse Detection in the Era of LLMs.

Xie, J., Lipford, H. R., et al. (2011). Why do program-

mers make security errors? In 2011 IEEE Symposium

on Visual Languages and Human-Centric Computing,

pages 161–164, Los Alamitos, CA, USA. IEEE Com-

puter Society.

Zhang, L., Chen, J., et al. (2019). CryptoREX: Large-scale

Analysis of Cryptographic Misuse in IoT Devices. In

22nd International Symposium on RAID, pages 151–

164, Chaoyang District, Beijing. USENIX Associa-

tion.

Zhang, Y., Kabir, M. M. A., et al. (2023). Automatic De-

tection of Java Cryptographic API Misuses: Are We

There Yet? IEEE Transactions on Software Engineer-

ing, 49(1):288–303.

APPENDIX

LLM Prompt. Large Language Models (LLMs)

received standardized prompts for each test case from

the three benchmarks to ensure their responses were

consistent. The speciﬁc prompt used for each test

case is shown in Listing 1. During the experiments,

we kept the default settings for model hyper-

parameters like temperature, Top P, and frequency

penalty to maintain each model’s natural response

style. Responses from GPT 4-o-mini were collected

automatically through an API, while responses from

Llama, Claude, and Gemini were gathered manually

from its chat interface.

1 I wan t you to d ete ct "

Cryp t ogra p hic mis use s " in the

given J ava code by co n sid e ring

th e cryp t ogra p hic mi sus e

defi niti o ns below .

3 C rypt o grap h ic mis uses are

dev i atio ns from b est p ract ices

while in c orpo r atin g

Beyond Rules: How Large Language Models Are Redeﬁning Cryptographic Misuse Detection

193

cryp t ogra p hic alg o rit h ms in to

your so f twa re th at could

pote ntia l ly be exp loit ed by an

adv e rsa r y . Broadly ,

cryp t ogra p hic mis use s ca n be

ref erre d to as bad pr o gram ming

pra c tic e s that cr eat e

vuln e rabi l i ties a nd are

ass o ciat ed with de sig n fla ws

an d unsafe a r chit e ctur a l

cho ice s .

5 Iden tif y the f oll o win g for each

cryp t ogra p hic mi sus e and

inc lud e it in yo ur r e spo nse .

Fol low the same tem pla t e fo r

rep o rti n g m ult iple mi sus es .

7 1 - I ncl u de a l abel as YES / NO if

there are s p ecif i cal l y

cryp t ogra p hic mis use s in the

given c ode or not .

8 2 - I ncl u de t he nam e of t he m eth od

con t aini ng the mi sus e .

9 3 - S pec i fy a m essa ge hig h ligh t ing

th e ty pe of m isu ses . Pro vid e a

desc ript i on in t he e x plan a tio n

as to why it is h igh l ight e d as

a m isu se .

10 4 - S tate the l ine from t he cod e

con t aini ng the mi s use s .

Spec ific a lly men tio n the

sta rtin g line num ber tha t is

cau sin g m isu se .

12 Code :

13 // Cod e for E ach test ca se is

pla ced h er e

Listing 1: Prompt Used to Query LLMs

LLM Responses. The responses from the LLMs

were fairly similar in format due to the structured

prompt, though some variations were observed. For

instance, in certain cases, the explanation for misuse

from GPT 4-o-mini appeared under tags such as

“Misuse Description”, while in others, it was labeled

as “Explanation” or a similar term. An example

of this can be seen in Listing 3, which shows the

response for the CryptoAPI Bench test case listed

in Listing 2. The misuse in this instance involves

using the DES algorithm and the “DES” string to

initialize the key. The detailed ﬁndings are discussed

in Section 5.

1 pub lic c las s B r oken C r yptoA B M CCas e 1

2 {

3 p ubl ic static voi d ma in ( Str ing

[] args ) th row s

NoSu chPa d ding Exce p tio n ,

NoSu chAl g orit h mExc e pti o n ,

Inva l i dKey E x cept i o n

4 {

5 Brok e n Cryp t oABMC 1 bc = new

Brok e n Cryp t oABMC 1 () ;

6 S tri ng crypto = " D ES / EC B /

PKCS 5Pad d ing " ;

7 S tri ng cry p tok e y = " D ES " ;

8 bc . go( crypto , c rypt okey ) ;

9 }

10 }

Listing 2: CryptoAPI Bench Test Case:

BrokenCryptoABMCCase1.java

1 Resp ons e :

2 1 - ** La bel **: YES

3 2 - ** Me tho d c onta i nin g the mis use

** : ma in

4 3 - ** Me s sag e h ighl i ghti n g th e ty pe

of misu ses ** : Use of weak

cryp t ogra p hic alg orit hm and

mode of o pera tion

5 ** E xpl a nati o n * *: The code u ses

DE S ( Data Enc rypt i on

Sta ndar d ) in ECB ( E lect roni c

Cod eboo k ) mode , w hic h is

con s ider ed weak and i nsec ure

. DES its elf has been

dep r ecat ed due to its small

ke y si ze and s u scep t ibil i ty

to brute - force att ac ks ,

while ECB mode exp ose s

pat tern s in plai ntex t sin ce

ide n tic a l p lai n tex t b loc ks

pro duc e i d ent ical ci p her t ext

blo cks . This can lead to

ser iou s v ulne rab ilit ies ,

all owin g a ttac kers to

per for m c r ypta n alys i s or

ded uce in form a tion a bou t the

pla i nte x t . Bes t prac tice s

rec o mme n d usi ng st r ong er

alg o rith ms ( e . g . , AES ) and

sec ure m ode s of ope rati on ( e

.g., GCM or CBC with pr ope r

IV m ana g emen t ) .

6 4 - ** Line fro m the code c o ntai ning

th e mis use s **: 10 -11 ( Lines

men t ioni ng ‘ Str ing c ryp to = "

DE S / ECB / PK C S5Pa d ding "; ‘ and ‘

Str ing cr ypt o key = " DES "; ‘)

Listing 3: GPT 4-o-mini Response for

BrokenCryptoABMCCase1.java

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

194