
Kr
¨
uger, S., Nadi, S., et al. (2017). CogniCrypt: Support-
ing developers in using cryptography. In 2017 32nd
IEEE/ACM International Conference on ASE, pages
931–936.
Kr
¨
uger, S., Sp
¨
ath, J., et al. (2021). CrySL: An Extensible
Approach to Validating the Correct Usage of Crypto-
graphic APIs. IEEE Transactions on Software Engi-
neering, 47(11):2382–2400.
Lazar, D., Chen, H., et al. (2014). Why does cryptographic
software fail? A case study and open problems. In
Proceedings of 5th Asia-Pacific Workshop on Systems,
APSys ’14, New York, NY, USA. ACM.
Li, Y., Choi, D., et al. (2022). Competition-
level code generation with AlphaCode. Science,
378(6624):1092–1097.
Liu, P., Liu, J., et al. (2024). Exploring ChatGPT’s Capabil-
ities on Vulnerability Management. In 33rd USENIX
Security Symposium, pages 811–828, Philadelphia,
PA. USENIX Association.
Llama (n.d.). Llama. https://www.llama.com/. Retrieved
September 21, 2024.
Masood, Z. and Martin, M. V. (2024). Beyond static tools:
Evaluating large language models for cryptographic
misuse detection.
Nadi, S., Kr
¨
uger, S., et al. (2016). Jumping through hoops:
why do Java developers struggle with cryptography
APIs? In Proceedings of the 38th International Con-
ference on SE, ICSE ’16, page 935–946, New York,
NY, USA. Association for Computing Machinery.
Ouh, E. L., Gan, B. K. S., et al. (2023). ChatGPT, Can
You Generate Solutions for My Coding Exercises? An
Evaluation on Its Effectiveness in an Undergraduate
Java Programming Course. In Proceedings of the 2023
Conference on Innovation and Technology in Com-
puter Science Education V. 1, ITiCSE 2023, pages 54–
60, New York, NY, USA. Association for Computing
Machinery.
OWASP Benchmark (2016). OWASP Benchmark. https://
owasp.org/www-project-benchmark/. Accessed May,
2024.
QwenLM (n.d.). QwenLM. https://qwenlm.ai/. Retrieved
December 21, 2024.
Rahaman, S., Xiao, Y., et al. (2019). CryptoGuard: High
Precision Detection of Cryptographic Vulnerabilities
in Massive-sized Java Projects. In Proceedings of
the 2019 ACM SIGSAC Conference on Computer and
Communications Security, CCS ’19, page 2455–2472,
New York, NY, USA. Association for Computing Ma-
chinery.
Redmiles, E. M., Warford, N., et al. (2020). A Comprehen-
sive Quality Evaluation of Security and Privacy Ad-
vice on the Web. In 29th USENIX Security Sympo-
sium, pages 89–108. USENIX Association.
Rostami, E. and Karlsson, F. (2024). Qualitative Con-
tent Analysis of Actionable Advice in Information
Security Policies – Introducing the Keyword Loss of
Specificity Metric. Information & Computer Security,
32(4):492–508.
Snyk (n.d.). Snyk Code — Code Security Analysis and
Fixes - Developer First SAST. https://snyk.io/product/
snyk-code/. Retrieved August 20, 2024.
Soot (2020). Soot. https://github.com/soot-oss/soot. Ac-
cessed: August 26, 2024.
Tabnine (n.d.). Tabnine. https://www.tabnine.com. Re-
trieved August 21, 2024.
Vall
´
ee-Rai, R., Co, P., et al. (1999). Soot - a Java byte-
code optimization framework. In Proceedings of the
1999 Conference of the Centre for Advanced Studies
on Collaborative Research, CASCON ’99, page 13.
IBM Press.
Whitten, A. (2004). Making Security Usable. PhD the-
sis, Carnegie Mellon University, 5000 Forbes Avenue,
Pittsburgh, PA, USA.
Xia, Y., Xie, Z., et al. (2024). Exploring Automatic Crypto-
graphic API Misuse Detection in the Era of LLMs.
Xie, J., Lipford, H. R., et al. (2011). Why do program-
mers make security errors? In 2011 IEEE Symposium
on Visual Languages and Human-Centric Computing,
pages 161–164, Los Alamitos, CA, USA. IEEE Com-
puter Society.
Zhang, L., Chen, J., et al. (2019). CryptoREX: Large-scale
Analysis of Cryptographic Misuse in IoT Devices. In
22nd International Symposium on RAID, pages 151–
164, Chaoyang District, Beijing. USENIX Associa-
tion.
Zhang, Y., Kabir, M. M. A., et al. (2023). Automatic De-
tection of Java Cryptographic API Misuses: Are We
There Yet? IEEE Transactions on Software Engineer-
ing, 49(1):288–303.
APPENDIX
LLM Prompt. Large Language Models (LLMs)
received standardized prompts for each test case from
the three benchmarks to ensure their responses were
consistent. The specific prompt used for each test
case is shown in Listing 1. During the experiments,
we kept the default settings for model hyper-
parameters like temperature, Top P, and frequency
penalty to maintain each model’s natural response
style. Responses from GPT 4-o-mini were collected
automatically through an API, while responses from
Llama, Claude, and Gemini were gathered manually
from its chat interface.
1 I wan t you to d ete ct "
Cryp t ogra p hic mis use s " in the
given J ava code by co n sid e ring
th e cryp t ogra p hic mi sus e
defi niti o ns below .
2
3 C rypt o grap h ic mis uses are
dev i atio ns from b est p ract ices
while in c orpo r atin g
Beyond Rules: How Large Language Models Are Redefining Cryptographic Misuse Detection
193